## Abstract

Acute Lymphoblastic Leukaemia (ALL) is the most frequent paediatric cancer. Modern therapies have improved survival rates, but approximately 15-20 % of patients relapse. At present, patients’ risk of relapse are assessed by projecting high-dimensional flow cytometry data onto a subset of biomarkers and manually estimating the shape of this reduced data. Here, we apply methods from topological data analysis (TDA), which quantify shape in data via features such as connected components and loops, to pre-treatment ALL datasets with known outcomes. We combine these fully unsupervised analyses with machine learning to identify features in the pre-treatment data that are prognostic for risk of relapse. We find significant topological differences between relapsing and non-relapsing patients and confirm the predictive power of CD10, CD20, CD38, and CD45. Further, we are able to use the TDA descriptors to predict patients who relapsed. We propose three prognostic pipelines that readily extend to other haematological malignancies.

**Teaser** Topology reveals features in flow cytometry data which predict relapse of patients with acute lymphoblastic leukemia

## Introduction

Acute Lymphoblastic Leukaemia (ALL) is a haematological malignancy that affects all age groups, particularly children. Indeed, it is the most frequent type of paediatric cancer (*1*). While current chemotherapy regimes have improved survival rates, as many as 15-20% of patients relapse following chemotherapy, i.e. the disease returns, even though the initial response to treatment may be good. Identifying patients, specifically at the paediatric age, at risk of relapse as early as possible - ideally at diagnosis - and tailoring treatment plans accordingly is, therefore, of major clinical importance. Rather than receiving toxic and ineffective first-line chemotherapy regimes, patients at risk of relapse could, for example, immediately receive alternative treatments, such as more intensive chemotherapeutic regimes, before resistant clones emerge and become dominant (*2, 3*), saving valuable time and improving survival rates. Alternatively, it may be better to immediately use more aggressive or expensive therapies, such as bone marrow transplants or CAR-T-cell therapies as a first line treatment (*4*).

At diagnosis, clinicians typically collect a combination of quantitative biomolecular, and qualitative (usually binary) cell morphology and genetic data from a patient (*5*) to determine the type of hematological cancer, disease severity, i.e. prognosis, and likely response to treatment. Based on experience gained from multi-centre clinical trials, patient data is manually assessed to determine a patient’s treatment plan, following an induction-consolidationreinduction-maintenance chemotherapy regime, which is standard in many countries (*6*). Many clinical trials also collect data about patient responses to treatment, including whether they experienced relapse. These data are ideally suited to retrospective analyses, not only to optimise treatment plans, but also to predict relapse. Exploiting existing datasets represents a promising approach to overcome many of the challenges in personalised medicine in oncology, such as patient phenotyping for prognosis or biomarker discovery (*7–9*).

Flow cytometry data are routinely used for the diagnosis of haematological malignancies such as ALL (*10*). These data provide quantitative information about the bone marrow status at the level of single cells. A particular challenge of flow cytometry data is its high-dimensionality: each cell from a patient sample is represented by one data point whose coordinates correspond to the intensities of a set of predetermined fluorescent biomarkers. This collection of data points is also referred to as a point cloud. The biomarkers usually comprise immunophenotypic (IPT) markers (*11*) such as CD19, CD10 and CD20, which are characteristic of B-lymphocytes subpopulations, more general markers such as CD45, CD38 and CD34, which indicate cell differentiation and maturation status, as well as CD3, CD33 or cyMPO, which are specific to other cell lineages, e.g. T lymphocytes or myeloid cells. The current strategy for managing high-dimensionality in these data is to generate two-dimensional projections of pairwise combinations of biomarkers (*12*), and then to manually assess the shapes of the projected data. For example, the shapes can be inspected to identify subsets of points within the point cloud that correspond to cancerous cells. However, manual assessment is time-consuming, subjective, and prone to errors. Therefore, more systematic and automated methods of analysis are required. Conventional data analysis methods, such as principal component analysis or machine learning (ML) methods (neural networks, support vector machines, etc), which focus on linear relationships between biomarkers rather than characteristics of shape, could be applied for classification. However, they are not ideal in this context because the results are typically difficult to interpret.

Topological data analysis (TDA) (*13, 14*) is a set of methods that uses ideas from the mathematical discipline topology to study the ‘shape’ of high-dimensional data. The most prominent method in TDA is persistent homology (PH) (*13–17*). PH provides a quantification of the shape of point cloud data at different spatial scales. Thus, in contrast to other data analysis methods, it is not necessary to predetermine a spatial scale up to which relationships between data points are considered; PH takes into account the full range of possible scales. PH studies the shape of data via topological features such as connected components, i.e. data point clusters, and loops, which are formed by data points surrounding an empty hole. One can imagine PH as a process in which data is viewed at different resolution scales. The output from PH are topological signatures of the data called barcodes which visualise topological features and their scales in the data. Features that appear across a range of scales in these barcodes, are called ‘persistent’ and are often viewed as particularly characteristic of a dataset (for an example, see Fig. 1 (E) and (F)). The mathematical foundations of PH ensure that small changes in the data lead to only small changes in the barcodes. PH is readily computable (*17*), robust to noise (*18*) and its outputs are interpretable. In order to compare barcodes, perform statistical analyses, and/or apply classification methods from ML, barcodes can further be transformed into persistent images (*55*) which are a vectorised representation of the output of PH. In recent years, improved computational feasibility of PH has increased its applications to (high-dimensional) data in many different contexts, including studies of the shape of brain arteries (*20*), neurons (*21*), the neural code (*22*), airways (*23*), stenosis (*24*), zebrafish patterns (*25*), ion aggregation (*26*), contagion dynamics (*27*), spatial networks (*28–31*), and geometric anomalies (*32*). In oncology, PH has been used to construct new biomarkers (*29, 33, 34*), to classify tumours (*35, 36*) and genetic alterations (*37*) and to quantify patterns of immune cell infiltration into solid tumours (*38*).

In this paper we design a pipeline that combines PH and ML methods to predict relapse in ALL with very high accuracy (see Fig. 1). We combine flow cytometry data collected at three different hospitals in Spain (*39*) from patients with the most common subtype of paediatric ALL (*40*), one that is characterised by rapid growth of early B lymphocytes (B-ALL). We apply PH to these datasets, obtain their topological signatures and analyse them to identify those biomarkers with the strongest discriminatory power to classify relapsing (R) and non-relapsing (NR) patients. Our approach confirms that the known biomarkers CD10, CD20, CD38 and/or CD45 are most relevant for this classification. We identify specific topological features (connected components and loops) which appear predominantly in patients at risk of relapse. We further transform the barcodes into persistent images which allow for the topological classification of patients at risk of relapse. Finally, by applying ML to persistence images, we accurately classify persistent images of relapsing (R) versus non-relapsing (NR) patients. Based on our analyses, we suggest three novel and practical approaches for prognosis. The first approach relies only on PH and can be applied when ML is not available. With this approach, we identify interpretable shape characteristics in the data that are specific to patients at risk of relapse. The second approach includes a combination of PH and ML and is fully automated. It achieves optimal accuracy in relapse prediction compared to manual assessments performed at diagnosis. If only PH is available, we suggest a third approach where barcodes are summarised by computing how many connected components (topological features in dimensions 0) and loops (topological features in dimension 1) persist longer than a given range of scale thresholds. Our work can be readily extended to many other haematological malignancies.

## Results

### Machine learning classifier applied to TDA confirmed state-of-the-art markers for prognosis

We applied PH to our datasets and then applied machine learning to the output descriptors of PH to identify those immunophenotypic (IPT) markers that contain the information most relevant for classifying patients into either the non-relapsing (NR) group, i.e. patients who show no refractory values for minimal residual disease after three years, or the relapsing (R) group (see Figure 2). Our data contained 80 NR patients and 16 R patients. We merged data from all bone marrow samples for each patient into a single file (*41*) which included values for the IPT parameters CD10, CD13, CD19, CD20, CD22, CD3, CD33, CD34, CD38, CD45, CD58, CD66c, IGM, cyCD3, cyMPO, cyTDT (see Figure 1(B)). As we were focussing on B-ALL, we also manually selected data corresponding to B-lymphocytes in each patient (see Figure 1(C)). Since this data was too large for efficient computation of PH, we considered subsamples of markers. We identified the IPT markers which hold the most information for classification between R and NR patients. In line with standard clinical practice to assess prognosis, we constructed 2D projections of these high-dimensional point clouds, considering all pairwise combinations of the 16 IPT markers for each patient (see Figure 2(A–C)). We then applied PH (specifically, Vietoris-Rips complex, see ‘Methods’) to each of the 120 pairwise combinations of the 16 markers and obtained PH barcodes for dimension 0, i.e. connected components, and dimension 1, i.e. loops (see Figure 2(D); and, for a detailed description, see ‘Methods’). In a first step, we extracted the following simple descriptors from each barcode: the maximum, minimum and mean persistence and the standard deviation in dimensions 0 and 1. We used the descriptors to create 8-dimensional topological feature vectors for each patient, and for each of the 120 pairwise combinations of IPT markers (see Figure 2(E)).

To classify our patients, we applied Machine Learning methods. We divided the datasets into a discovery set and a validation set, as described in ‘Methods’. As a preliminary study, we downsampled the number of NR patients in the discovery set to match the number of R patients, i.e. 13 R and 13 NR patients (we held back 3 R patients for the validation set). We obtained a total of 3120 persistence barcodes for each dimension analysed. We then performed a Random Forest analysis, with cross-validation, on each patient subgroup within the discovery set, assigning 60% of the patients to a training group, and 40% to a testing group. We computed receiver operating characteristic (ROC) curves for the classification of R versus NR patients. IPT pairs with an area under the ROC curve of less than 50% were deemed to have very low information and excluded from subsequent analysis. The IPT markers that appeared most frequently in the remaining pairs were CD10, CD19, CD20, CD22, CD34, CD38, cyMPO and cyTdT (see Figure 2(F) and Data in S1). Only combinations including CD10 or CD20 had a mean AUC*>*70%.

Next, we considered the validation dataset. The complete set of pairwise combinations resulted in an additional 5520 persistence barcodes in each dimension. As before, these were analysed via Random Forest analysis, with cross-validation. Only IPT markers CD10, CD20, CD38 and CD45 produced mean AUCs*>*50% (see Data S1-S2). To establish whether these markers carried independent information, we computed the correlations between the markers. None of the IPT markers showed strong correlations in either the NR or R groups (see Figures S6 and S7). Biologically, it is well-known that CD10, CD20 and CD45 characterise the shape of CD19^{+} B lymphocyte clouds (*42*), while CD38 is a marker of aberrant B-lymphoblasts (*43, 44*). Our independent Random Forest analysis, performed on simple topological descriptors, confirmed the importance of these biomarkers. We performed our subsequent analysis on a reduced dataset containing only these variables.

### PH found that the connected components and loops were more persistent in R patients than in NR patients

Following our selection of the four most informative biomarkers for prognosis (CD10, CD20, CD38 and CD45), we applied PH to all pairwise combinations of these biomarkers for each patient.

We visualised the persistence of connected components (dimension 0) and loops (dimension 1) by constructing what we term *persistence threshold (PT) curves*; PT curves reveal characteristic topological features, i.e. persistent features, for each patient cohort. We counted the number of topological features in each barcode which persist over scales that exceed a fixed threshold. To obtain PT curves, we performed these computations for a range of threshold values (see Figure 3(A)). We averaged the PT curves for both patient cohorts and all pairwise combinations of biomarkers and found differences between the discovery and validation sets. The results for the mean PT curves for all pairwise combinations are presented in Figures S10-S15 for both dimensions 0 and 1. We considered different threshold *τ* ranges to capture bars with long persistence, e.g. *τ* ∈ [0.1, 0.18] for dimension 0 and *τ* ∈ [0.04, 0.05] for dimension 1, versus bars with medium scale persistence, e.g. *τ* ∈ [0.03, 0.07] for dimension 0 and *τ* ∈ [0.007, 0.015] for dimension 1.

We present illustrative results for CD10-CD20 in Figure 3. In both the validation and discovery datasets, the mean values of the dimension 1 PT curves were larger in R patients than in NR patients (see Figure 3(A.2)). These differences were statistically significant in dimension 1, where *τ* is the persistence threshold with *τ* ∈ [0.04, 0.05]. An illustrative example for dimension 0 for CD10-CD20 is presented in Figure S9.

Intuitively, there are fewer persistent loops in the NR patients than in the R patients as the loops in the NR patients tend to be ‘narrower’ and disappear at smaller spatial scales (see Figures 3(B.1) and (B.2), respectively). In the barcodes these less persistent loops correspond to shorter bars. This is captured by the PT curves. In dimension 0, we observed the same trends for CD10-CD20: the number of less persistent connected components in the data of R patients was larger than that of NR patients (see SI for details). Complete results for all pairwise combinations are included in Figures S14 and S15, respectively; they show the absolute and relative numbers of bars that are larger than a fixed threshold value for each patient. The number of persistent loops was higher in R patients for all combinations that included either CD10 or CD20, except for the combination CD10-CD45. We only found a higher number of persistent loops for NR patients in CD38-CD45. The number of non-persistent loops was only larger for NR patients than for R patients for the combination CD20-CD38. In dimension 0, connected components with short persistence were only found in R patients for the combination CD10-CD20. These results were statistically significant and consistent between discovery and validation datasets.

When we applied the same analysis to the 4-dimensional point clouds of each patient, the differences between the R and NR group were not significant and the discovery and validation datasets yielded different results (see SI). As before, we first performed statistical analyses on simple barcode descriptors, i.e. the maximum, minimum and mean persistence and the standard deviation, extracted from the dimension 0 and 1 barcodes of the NR and R groups. Since the resulting p-values for all variables were larger than 0.05, we concluded that, in general, the simple barcode descriptors are unable to discriminate between the two groups (see results summary in Figure S8(A)).

We performed clustering to verify the dimension 0 PH results. We considered 4-dimensional patient point clouds in the discovery set and applied the FlowSom algorithm (*45*) (see ‘Methods’ and Figures S16-S19) to each patient point cloud to determine the number of clusters. We then compared results between both cohorts. The number of clusters in point clouds from NR patients (4.01 ± 1.17) was lower (t-test, p-value 0.09) than the number in point clouds from R patients (4.56 ± ± 1.11). When we performed the same analysis with the 2-dimensional data, the same trends were observed for the 2D projections of CD10-CD20 (t-test, p-value 0.18), with fewer clusters in NR patients (3.51 ± 0.73) than in R patients (3.81 1.17). While these results were marginally significant, they are consistent with those obtained from the persistence threshold curves (see Figure 3).

### Persistence images accurately identified relapsing patients

In our previous analyses, we computed simple barcode descriptors from the PH barcodes. To fully exploit the information contained in the barcodes from the 4-dimensional datasets (with biomarkers CD10-CD20-CD38-CD45; see Figure 4(A)), we transformed the barcodes into persistence images (*55*). This method transforms barcodes into matrices which can be vectorised and, as such, are more amenable to analysis while also being stable to noise and maintaining an interpretable relationship to the barcodes from which they were generated (see Figure 4(B) and Section S1.4 and S2.3.c-d for methodological details).

We used persistence images in matrix form as input to Logistic Regression (LR) and Support Vector Machine (SVM) analyses. We considered three variables: datasets studied (balanced, or not, by uppersampling), dimensions of the topological features analysed (0, 1 or 2) and biomarkers included (CD10, CD20, CD38, CD45). We included dimension 2 barcodes in this analysis by reducing the number of data points from 10^{5} to 10^{4}.

We used the discovery and validation datasets for training and testing, respectively. The classification results are presented in Table 1. We performed cross-validation within the training sets to obtain representative SVM parameters as described in ‘Methods’. We obtained perfect classification scores and 100% accuracy. By comparison, a random forest analysis of clinical variables (such as age, sex and other molecular biology data as presence of mutations) lead to an accuracy of 86% after cross-validation, and an expert clinical analysis based on the same data lead to an accuracy of approximately 91% (for more details, see Section 2.1.c in the SI).

Since R patients constituted approximately 20% of the total number of patients, we performed upper-sampling and reran our analysis. These results are presented in Table 2. As before, we obtained 100% accuracy, with better classification for dimensions 0 and 1 than for dimension 2. We obtained the same trends when all three PIs (in dimensions 0,1 and 2) were used together as input for the classification methods (see Table S4). Thus, the PH methods combined with the machine learning techniques were able to identify the R patients from the diagnostic FC data with high accuracy.

We constructed the mean PIs for R and NR patients to compare differences between the two cohorts. Additionally, we computed the coefficients of the decision function in the LR classification and showed by means of a an image the discriminatory areas between both cohorts. These differences depended on the hyperparameters used for the construction of PIs, and were obtained for the discovery and validation datasets (see Figure S21 for details).

## Discussion and conclusions

Predicting risk of relapse after treatment is one of the major challenges in cancer, particularly when effective second and third-line treatments are available. In B acute lymphoblastic leukemia (B-ALL), first-line chemotherapy regimes have increased overall survival by up to 90%, but a substantial fraction of patients relapse. Being able to start other treatments as soon as possible, i.e. before resistant clones emerge and become dominant (*2, 3*), could significantly improve treatment outcomes in these patients. In our present work we focused on predicting risk of relapse based on flow cytometry data, which is routinely collected at diagnosis of haematological malignancies. Clinicians manually assess the shape of 2D projections of this high-dimensional data for diagnosis. We combined persistent homology (PH), a method from topological data analysis, with machine learning to obtain quantitative descriptors of the data which distinguish R and NR B-ALL patients at diagnosis with high accuracy. Regarding PH, we based our analysis on the following topological features: connected components (dimension 0) and loops (dimension 1).

We first used PH together with machine learning to detect markers relevant for the classification of R patients in an unsupervised manner. IPT markers CD10, CD20, CD45 and CD38 were among those with the highest relevance for classification. These biomarkers are characteristic of the B-cell line; CD10, CD20 and CD45 are routinely used to identify early B cell sub-populations (*42*), while CD38 is known to have prognostic value for B-ALL (*39, 44*). Thus, our results confirmed the importance of these four biomarkers and we retained only these four biomarkers for further analyses.

We analysed both the 4D dataset (with the above biomarkers) and pairwise, 2D projections of the data with PH and identified the features which most influence the shape of the point clouds. Interestingly, using simple barcode descriptors extracted from the 2D data we were able to better distinguish between R and NR patients than when considering the 4D data. Our ability to distinguish between R and NR patients increased further when barcodes from the 4D data were transformed into persistent images, i.e. when more of the topological information provided by PH was used for classification. This implies that while the shape of the 2D projections can give insights via both visual inspection and crude shape descriptors, PH captures valuable information that goes beyond these features in the 4D data. This additional information can only be exploited when the data is transformed into a format that can be combined with machine learning while retaining as much information from the barcodes as possible.

In the 2D data, by considering differences between the R and NR persistence threshold curves (PT curves), R patients were characterised by larger numbers of persistent connected components and loops (see Figure 3(B)). These results are consistent with the bone marrow cell population of R patients being more phenotypically diverse than those who do not relapse. In the R patients, such cellular heterogeneity can be linked to non-dominant leukaemic subpopulations that might be the drivers of disease relapse (*46*). Increased cellular heterogeneity in R patients manifests as a higher number of isolated leukaemic point clouds (i.e. clusters of cells spread across phenotype space) in their projected flow cytometry data. In particular, the projections of marker combinations, including CD10 and CD20, show such B-cell subpopulations in the data. This may also explain why there are more loops in the data of R patients in the corresponding marker combinations; a more heterogeneous cell population leads to more and larger voids in the data which can result in more persistent loops. In contrast, for NR patients we found more persistent loops in the marker combination CD38-CD45, the only combination not including either CD10 or CD20. A possible explanation is that long-time-survival patients with B-ALL express higher levels of CD38^{+} in their FC data, which could lead to homogeneity in NR patients. These considerations may also explain why 4D analysis of PT curves did not discriminate well between R and NR groups: we observe more isolated clusters and heterogeneity in the data point clouds of R patients in CD10-CD20 projections, while the intensity of CD38^{-} in CD38-CD45 projections (which could be related to more homogeneity in these biomarkers for R patients) points to bad prognosis (*39, 43*).

In the 4D data, persistence images (PIs) which use the full information content of the barcodes rather than summaries (simple barcode descriptors and PT curves) enabled us to classify the data from R and NR patients via machine learning. We computed PIs from the barcodes in dimension 0, 1 and also 2 (voids) for this analysis. Following balancing by uppersampling, the classification was excellent in every dimension. On average, classification scores were higher for dimension 0 than dimension 1, and lower for dimension 2. Concatenating PIs from all three dimensions yielded the same trends as those for dimensions 0 and 1. Regardless of classification method, dimension 0 attained a 100% classification score. Similarly, for dimension 1, the results showed a high score for almost every choice of hyperparameters in the construction of the PIs. We obtained slightly worse results in dimension 2 when Logistic Regression was used for classification. However, we note that, due to the high computational costs of PH in dimension 2, only a subset of 10^{3} data points were used, instead of the full number of 10^{4} points used to construct PIs in dimensions 0 and 1. Support Vector Machine (SVM) analysis applied to the PIs from dimensions 0 and 1 achieved 93% and 100% accuracy in the original and balanced data, respectively. Even when a 5×5 grid was used to genereate the PIs the accuracy of SVM was maintained.

Our results show how using PH to analyse routine high-dimensional FC data can significantly improve the accuracy of predictions of relapse for patients with B-ALL. Our levels of accuracy and precision highlight the potential for topological methods to dissect phenotypic characteristics of B-ALL relapse. While other methods used to predict clinical outcomes exploit the intensity of marker expression or cell proportions (such as (*47, 48*)), PH measures characteristics of the shape of the point cloud in high-dimensional spaces that are not accessible with other methods. These topological features obtained do account for variation in the data due to noise or calibration of the flow cytometer. However, several variables considered during routine, manual inspections performed in the clinic (such as the range of expression levels or the mean fluorescence intensity) can be affected by data transformations. Fully automated and validated, this methodology could be readily integrated into clinical practice. Based on our observations, we propose three practical approaches to assess the risk of relapse on diagnosis, depending on the available methodology (see Figure 5). Each method requires biomarkers CD10, CD20, CD38 and CD45 to be included in the flow cytometry data.

When only manual inspection of 2D projections of flow cytometry data is possible, i.e. when neither PH nor ML methods are available, insights from PH applied to 2D projected data may help to identify qualitative patterns that distinguish R and NR patients. As seen in the Table from Figure 5, patients at risk of relapse tend to have large enclosed voids in most pairwise combinations of the biomarkers and smaller clusters that are far apart from each other in the CD10-CD20 projection. The NR group also exhibit smaller and larger loops in some projections including CD38.

If both PH and ML analyses are available, we suggest PH analysis of the 4D datasets, followed by transformation of the barcodes to persistent images (PIs). Using a classification algorithm on PIs, such as SVM, would allow a high degree of accuracy.

If only PH methods are available, then PH analysis should be applied to all 2D projections of biomarkers CD10, CD20, CD38 and CD45 and the numbers of long bars in the dimension 0 and dimension 1 barcodes computed. If, for the dimension 1 barcode, the distance of the corresponding PT curve to the mean PT curve of NR patients is larger than the distance to the PT curve of R patients, then the data is likely to correspond to a relapsing patient. The same analysis can also be applied to the CD10-CD20 projection in dimension 0.

While our analysis reveals that TDA has great potential for assessing the risk of relapse on diagnosis, our study has several limitations. The size of the datasets used may render the results sensitive to overfitting. Despite the robustness to noise of topological methods, further validation of the method on different datasets is needed. Additionally, a larger study using bone marrow samples with a larger number of common, standardized IPT markers could be insightful (*49*), particularly when data would be collected using the same hospital protocols for preprocessing. The use of a single flow cytometer would further improve the accuracy of the study, as machines can differ with respect to the data scales and compensation techniques that they use. Another limitation of our study is the high computational cost of PH: we reduced our data to 10^{4} data points to enable computations in dimensions 0 and 1, and to 10^{3} data points to enable computations in dimension 2. This represented 10% and 1%, respectively, of the 10^{5} points selected from the whole bone marrow. Even when a representative subsample was selected using the max-min algorithm (see ‘Methods’), it took on average a week to compute all 16 pairwise combinations of IPT markers, with high-performance computing machines (see ‘Methods’ for details). However, as PH is an active research area, these computational limitations are constantly being improved. Finally, PH does not consider information about the intensity of markers or cell proportions. It would be interesting to incorporate such information into future PH analysis.

An extension of the methods presented here could be to apply our techniques to other FC datasets, monitoring patient responses to e.g. chemotherapy, transplant or CAR-T cell infusion. This could reveal new information about the emergence of resistance to these treatments. Flow cytometry data can also provide information about all bone marrow cells, not just B-lymphocytes, and is therefore routinely collected for many haematological conditions such as lymphomas, T-cell leukaemias, myeloblastic disorders. Our analysis could be adapted to study these diseases. Combining information from IPT cell markers with qualitative biomolecular or morphology features has been used in (*50, 51*) to perform risk stratification and ALL diagnosis. By combining such qualitative data with PH analyses of flow cytometry data it may be possible to further increase the accuracy of patient prognosis and relapse prediction, and the selection of effective personalised treatment strategies.

Our results could have been obtained using either PH or ML. However, their combination provides biological interpretability while also exploiting the high-dimensional character of FC data, whose shape is currently manually assessed in the clinic by focussing on 2D projections of the data. The approaches outlined in this article represent a first step towards a more general approach in which topological descriptors computed from FC data inform patient classification and treatment. In other datasets, the same systematic approach could be used to identify informative biomarkers.

We have demonstrated the potential of topological methods to reveal prognostic information in data that is routinely collected from B-ALL patients. Our work is a first step towards automating risk prognosis with high accuracy and interpretable results. We propose our approach to be tested prospectively in clinic to determine its practical use. Finally, our work could stimulate further studies which investigate the potential of methods from topological data analysis to reveal new features of cancer relapse.

## Materials and Methods

### Patients

A retrospective study was designed in accordance with the Declaration of Helsinki, and the protocol was approved by the institutional review board (IRB) of the four participating local institutions (LLAMAT Project, 2018). The inclusion criteria for the study were B-ALL diagnosis between February 2009 and July 2019, age less than 19 years, and availability of BM flow cytometry data in Flow Cytometry Standard 3.0 (FCS 3.0). A total of 129 patients were included in the study: 2 from Jerez Hospital (HJ), 54 from Virgen de la Arrixaca Hospital (HVA), 27 from Niño Jesús Hospital (HNJ) and 45 from Virgen del Rocío Hospital (HVR). After preprocessing and marker selection, 96 patients were retained for further analysis. A discovery set was constructed with 30 patients from HVR (dataset 1) and 20 patients from HNJ (dataset 2). A validation set was later reanalysed with 46 patients from HVA (dataset 3). In all datasets, we counted with 80 non-relapsing (NR) patients, with at least three years of follow-up with no evidence of minimal residual disease, and 16 relapsing (R) patients. Patients characterstics can be found in Table S3).

### Flow cytometry markers and data preprocessing

Marker expression was obtained on FAC-SCanto II flow cytometers, in accordance with the manufacturer’s specifications for sample preparation. FCS files were first imported into FlowJo (Becton Dickinson, 10.6.1) and FACS-Diva (Becton Dickinson, 8.0.1) and inspected manually by standard methods (see Figure S4): Acquisition anomalies, margin events (measurements that match the maximum or minimum values for any parameter), doublets (cells that are analysed together, as a single event), debris and dead cells were removed (*12*). Files were further processed in R (4.1) by means of flow cytometry specific packages from Bioconductor. Specifically, files were compensated and transformed with Logicle transform (*52*).

For each patient, we subsampled aliquots to the same number of cells and then brought them into a single file via nearest neighbour imputation (*53*). This resulted in one file per patient. We subsampled every patient file to 10^{5} cells. Batch effects were accounted for with a rescaling transformation. We chose the 0.001 and 0.999 quantiles to transform every intensity value *x* to . This was performed to avoid outliers, where *x* is the quantile 0.001 *x*_{q0.999} is the quantile 0.999. Subsequently, the common B-cell antigen CD19 was used to select the B-cell subpopulation.

Finally, we selected 10^{4} landmarks from the lymphocyte cloud by applying the max-min algorithm (*54*) to each group of markers considered. Because of computational constraints in persistent homology, we reduced the number of landmarks to 10^{3} for studies performed in dimension 2. The last step involved the selection of a common set of potential biomarkers for each patient; these included FSC-A, SSC-A, CD10, CD13, CD19, CD20, CD22, CD3, CD33, CD34, CD38, CD45, CD58, CD66c, IGM, cyCD3, cyMPO, cyTDT. We omitted geometric markers FSC-A and SSC-A from the persistence analysis, to prevent unconscious bias. Patients that did not contain the full set were excluded from the analysis.

### TDA and Persistence Images

Persistent homology is the most commonly used method from topological data analysis. The technique provides information on topological features such as connected components and loops in point cloud data. To apply persistent homology, one needs to construct a filtration of simplicial complexes from the point cloud data, i.e. a sequence of nested graph-like structures that include nodes (e.g., the data points) and edges as well as higher-order connections (simplices) such as triangles and tetrahedra. A method to obtain such simplicial complexes is the Vietoris-Rips filtration which we use here: Given a point cloud such as the one shown in Figure 1(E), we place a ball of radius *r* centered at each point. We now connect two points by an edge, if the balls surrounding them intersect. If three points have pairwise intersecting balls, we consider the edges to form a filled-in triangle, if four such points have pairwise intersections, their edges form a filled-in tetrahedron. As the radius *r* increases, the simplicial complex built on the data contains more edges, triangles, and tetrahedra until at some point all points are connected to each other forming a high-dimensional simplex. PH monitors how topological invariants such as connected components, loops and 3D voids are born and later die along this filtration as the radius *r* grows. For example, initially for very small *r* all data points form their own connected components. It is only as *r* grows that components merge together, i.e. one component dies while the other lives on now including two data points connected by an edge. For each radius *r*, the homology group of the simplicial complex captures the connected components (dimension 0), loops (dimension 1), and voids (dimension 2). The term “persistence”, or lifetime, refers to how long these topological features live within the filtration. This can be represented by barcodes, which are collections of intervals [*r*_{b}, *r*_{d}) where each interval represents the lifetime of one topological feature in the filtration. Intuitively, *r*_{b} is the radius at which the feature is first observed in the filtration and *r*_{d} corresponds to the radius when the feature dies in the filtration. For an example of a barcode, see Figure 1(F).

To further quantify these features, here we constructed so-called PT curves: we counted the number of bars in each barcode whose corresponding topological features persist longer than a threshold taken from a range of thresholds. Specifically, we studied threshold ranges *τ* ∈ [0.03, 0.07] and *τ* ∈ [0.1, 0.18] for dimension 0, as well as threshold ranges *τ* ∈ [0.007, 0.015] and *τ* ∈ [0.04, 0.05] for dimension 1. We chose these threshold ranges to represent different scales of persistence (‘medium’ versus ‘long’ scales).

In our study, we further applied persistence images (PIs), which we generated from persistence barcodes. These images are real-valued matrices that can be used as an input into a variety of machine-learning approaches and are then useful for classification. The construction of PIs depends on the hyperparameters, such as the grid size (5×5, 10×10, 25×25, 50×50, or 100×100 in this study) and the spread of the Gaussian distribution (0.01 and 0.05 in this study) which is used when transforming the information from the barcode to the image. A more detailed summary of TDA can be found in the SI, as well as in further references (*17, 55, 56*).

### Machine learning methods

We divided data into discovery and validation datasets. For classification and prediction, we used Random Forests, a popular and efficient algorithm based on model aggregation ideas (*57*). We also applied Support Vector Machines (SVM) (*58*) for classification by assigning R and NR labels to the data matrices obtained from the PIs of the patients. We varied two parameters associated with the SVM classification kernel: *γ*, which represents the curvature of the decision boundary, and *C*, which is the trade-off between correct classification and the distance between the decision boundary and support vectors. We selected these parameters by randomly splitting the training set and performing internal cross-validations for a logarithmic range of values of the parameters *C* ∈ [10^{−2}, 10^{13}] and *γ* ∈ [10^{−9}, 10^{3}]. We also used Logistic Regression to construct binary regressions. We computed the mean score of fitting the model after 6-fold cross-validation. R patients constituted approximately 20% of the total number of patients. In order to address this imbalance, we repeated the SVM method for classifying R and NR patients with upper-sampling on the first ones.

### Computing machines

The max-min algorithm and Persistence Analysis were run on six machines from the Oxford Mathematical Institute, each with 36 cores, with up to 3.9 Ghz speed, and 768 GB of RAM. Figures and other calculations were run on an iMac, running under Mac OS 10.15, with four i5 cores, 3.4 Ghz speed and 32 GB RAM.

### Software

Python (3.1) was used for the computations. R (3.6.0) and RStudio (1.2.1335) were used for data curation. Persistence barcodes and images were constructed using Ripser (0.3.2) (*59*) and Persim packages (0.1.3) in the Python Scikit-TDA toolbox. FCS files were read using the Python CytoFlow library (1.0) and Flow Cytometry libraries from Bioconductor (3.11) in RStudio. FlowJo (Becton Dickinson, 10.6.1) and FACSDiva (Becton Dickinson, 8.0.1) software packages were used for manual gating of FCS data. Random Forest and Support Vector Machine methods were obtained from the Python Scikit-Learn (0.24.2) classification library. Clustering was performed by means of the FlowSom algorithm (*45*), normally used for Flow Cytometry data. This last is included in the Spectre package (from R, v.0.5.0), embedded in Python with the rpy2 module (v. 3.4.5).

## Data Availability

Implementation details, code and processed PH data are freely available on https://github.com/salvadorchulian/shapecancerrelapse. Flow cytometry files are available upon request. These datasets can be provided by the corresponding author, S.C., or H.B., pending scientific review and a completed material transfer agreement. Requests for the datasets should be submitted to: S.C., or H.B.. All other data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials.

## Data and Code Availability

Implementation details, code and processed PH data are freely available on https://github.com/salvadorchulian/shapecancerrelapse. Flow cytometry files are available upon request. These datasets can be provided by the corresponding author, S.C., or H.B., pending scientific review and a completed material transfer agreement. Requests for the datasets should be submitted to: S.C., or H.B.. All other data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials.

## Author Contributions

Conceptualization, S.C., B.J.S., V.M.P.-G., M.R. and H.B.; Data curation, S.C., Á.M.-R., C.B.G., J.F.R.G., Á.M.Q., T.C.-V., M.R.-O., A.C.R., J.L.F.S., A.M.P. and M.V.M.S.; Formal analysis, S.C.; Funding acquisition, M.R., V.M.P.-G. and H.B.; Investigation, S.C. and B.J.S.; Methodology, S.C. and B.J.S.; Project administration, C.B.G., M.R., V.M.P.-G. and H.B.; Resources, Á.M.Q., T.C.-V., M.R.-O., A.C.R., J.L.F.S., A.M.P. and M.V.M.S.; Software, S.C. and Á.M.-R.; Supervision, V.M.P.-G., M.R. and H.B.; Validation, S.C., C.B.G., J.F.R.G.; Visualization, S.C. and Á.M.-R.; Writing—original draft, S.C. and B.J.S.; Writing—review & editing, S.C., B.J.S., Á.M.-R., C.B.G., J.F.R.G., Á.M.Q., T.C.-V., M.R.-O., A.C.R., J.L.F.S., A.M.P., M.V.M.S., M.R., V.M.P.-G. and H.B. All authors have read and agreed to the published version of the manuscript.

## Competing Interests

The authors declare that they have no competing financial interests. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

## Acknowledgments

This work has been partially supported by the Fundación Española para la Ciencia y la Tecnología (FECYT project PR214), the Asociación Pablo Ugarte (APU, Spain), Junta de Andalucía (Spain) group FQM-201, Junta de Comunidades de Castilla-La Mancha (grant number SBPLY/17/180501/000154), the Programme of Research and Transfer Promotion from the University of Cádiz (grant number EST2020-025), Ministery of Science and Technology, Spain (grant number PID2019-110895RB-I00, funded by MCIN/AEI/10.13039/501100011033). This work was also subsidized by a grant for the research and biomedical innovation in the health sciences within the framework of the Integrated Territorial Initiative (ITI) for the province of Cadiz (grant number ITI-0038-2019). ITI is 80% co-financed by the funds of the FEDER Operational Program of Andalusia 2014-2020 (Council of Health and Families). BJS and HMB are members of the Centre for Topological Data Analysis, funded by the EPSRC grant (EP/R018472/1). We would also like to acknowledge Heather Harrington (Centre for Topological Data Analysis, Mathematical Institute, Oxford) for helpful discussions.