Abstract
Oncology therapeutic development continues to be plagued by high failure rates leading to substantial costs with only incremental improvements in overall benefit and survival. Advances in technology including the molecular characterisation of cancer and computational power provide the opportunity to better model therapeutic response and resistance. Here we use a novel approach which utilises Bayesian statistical principles used by astrophysicists to measure the mass of dark matter to predict therapeutic response. We construct “Digital Twins” of individual cancer patients and predict response for cancer treatments. We validate the approach by predicting the results of clinical trials. Better prediction of therapeutic response would improve current clinical decision-making and oncology therapeutic development.
Introduction
Therapeutic development in oncology continues to be challenging. Whilst significant advances have been made in some instances, progress continues to be slow and incremental. The vast majority of candidate therapies fail, and the failure rate in advanced phase 3 clinical trials remains high. This inefficiency costs over $50 billion per annum, which is unsustainable for most health systems and economies.
Advances in the molecular profiling of cancer, coupled with accelerated computing power, provide the promise of moving away from a “trial and error” approach to cancer treatment and therapeutic development, to one where we can predict therapeutic efficacy prior to treatment. “Digital Twins”; in-silico virtual replicas of cancer patients, offer enticing possibilities for improving cancer treatment. The benefits of accurate prediction of therapeutic response and patient outcome can be applied at many points in therapeutic development, from early candidate drug selection through to late phase clinical trials and routine cancer care.
Here, we present a machine learning approach which simulates treatment with cytotoxic and small molecule therapies with applicability across numerous cancer types. We demonstrate how these models can predict overall response rates (ORR) for a range of cancer types and treatments. The digital twins predict drug efficacy, for single agent or drug combinations and can predict if treatment A will perform better than treatment B in individual patients and in virtual clinical trials. The prediction accuracies were tested against actual response rates and overall survival metrics in historical clinical trials. Synthetic controls for comparator arms of clinical trials were constructed to enable benchmarking of predicted clinical efficacy of investigational drugs versus standard of care. The simulated clinical trial can then predict survival. Finally, we demonstrate how this approach can be used for patient cohort enrichment for an investigational drug of interest, and calculate the predicted increase in response rates achieved through such enrichment strategies.
Results
Constructing Digital Twins to simulate therapeutic response and clinical trials
The modelling approach we used arose out of a collaboration with astrophysicists1 to develop advanced Bayesian inference software that enables integrative modelling of gravitational lensing and cancer biology. These partnerships motivated a transfer learning approach where detailed molecular and therapeutic data generated from biological experimentation was used to build generalisable Bayesian models that can be applied to predict treatment efficacy for single agents and combinations.
We created a computational framework that could predict in vitro therapeutic response and clinical response and survival using multi-dimensional data that included molecular profiles, predominantly genomic and transcriptomic. Digital twins were created to address specific questions using 3 distinct models: 1) Drug Efficacy Model 2) Treatment Response Model and 3) Overall Survival Model (Figure 1).
The perturbation kernel is derived from the drug-efficacy model and is leveraged by multiple Bayesian inference models to transfer understanding about shared molecular mechanisms across in vitro combination screens and clinical treatment settings. It defines the similarity in the molecular mechanisms between every pair of datapoints (for example a patient treated with a taxane such as docetaxel vs. a different patient treated with an anthracycline such as doxorubicin), which can be used to make predictions of effect using Gaussian processes2.
The drug efficacy and perturbation kernel were built using in-vitro dose-response data from the Cancer Therapeutic Response Portal (CTRP)3–5. The CTRP dataset consists of 481 anti-cancer compounds which include chemotherapy and targeted small molecules. These were dosed against 860 cancer cell lines. The molecular data for the cell lines was obtained from the Cancer Cell Line Encyclopedia6. This dataset was used to train the perturbation kernel to predict IC50 for the compounds in the dataset using a Sparse Gaussian Process. The perturbation kernel can also accurately predict synergy scores from the NCI-ALAMANAC dataset (unpublished data) for combination treatments. In this study, the model was tested using the perturbation kernel to predict treatment response in clinical data using the TCGA dataset located at the NCI Genomic Data Commons7. A summary of the datasets used and abbreviations can be found in Table 1 and Table 2. A detailed breakdown of the cohorts used in this study can be found in Extended Data Figure 1.
To evaluate the performance of the model across the TCGA dataset we split the dataset into 5 cross-fold splits, stratified by cancer type and overall survival. We then trained the models on 4 of the splits and predicted outcomes for the remaining (omitted) split. All accuracy metrics reported are averages of the metrics calculated for each split in turn. Many patients had missing information, these missing variables were either imputed by the mean value of that column for that patient’s cancer type or from the entire cohort. These mean values were calculated only from the training cross folds when imputing for the validation cohort. The treatment response model was used to calculate treatment response probabilities for all the patients using the training cross-folds. If the patient received no treatment, then the patient was considered to have 0 probability of treatment response.
The model used molecular fingerprints generated by the CDK8–11 and Cinfony12 chemoinformatics libraries from canonical SMILES structures obtained from PubCHEM13 to incorporate structural information about each therapeutic. This process restricts the treatment response predictions to small molecule therapies at this time and hence we focus on chemotherapy drugs as monotherapy and drug combinations.
Validation of Digital Twin predictions using clinical trial simulations
We simulated digital trial arms for single chemotherapy drugs and combinations to predict treatment response in cancer patients with the goal of assessing the accuracy of digital twin predictions through comparison to historical clinical trial results. TCGA data was used as input, no original individual participant clinical trial data is available and was not used for predictions. Initially, the digital twin predictions were evaluated using an unblinded approach where our technology team was aware of the results. (Figure 2). The model treatment predictions were compared to the results of four historical phase 2 and phase 3 clinical studies (1997 - 2018). These were trials in metastatic pancreatic cancer (Burris, et al.50), advanced breast cancer (Chan, et al.51 and Tutt, et al.52) and platinum-sensitive recurrent ovarian cancer (Cantù, et al.53). We compared the predicted log odds ratio (OR; Figure 2a) generated by the digital twin model for Overall Response Rate (ORR) for each treatment arm tested in the clinical study, and then compared this against the actual reported log odds ratios (log OR) from the historical trial, the ground truth (Figure 2b). We started with single-agent predictions, then progressively increased the complexity through combinations and heterogenous treatments in more sophisticated clinical trial designs (Table 3).
Unblinded validation
Single-agent chemotherapy arms
Study 1
The first clinical trial we simulated reported by Burris et al. in 1997, was a prospective, randomised clinical trial in advanced pancreatic cancer50 (n=126; 17 sites in USA & Canada; 1997), which randomised participants to either single-agent 5-fluorouracil (n=63) or gemcitabine (n=63). Clinical benefit was 23.8% for gemcitabine compared with 4.8% for 5-fluorouracil (5-FU) (P = 0.002). Median survival was 5.65 months for gemcitabine compared to 4.41 months for 5-FU (P = .0025). Survival at 12 months was 18% for gemcitabine and 2% for 5-FU. Both chemotherapy drugs are considered to be anti-metabolites, therefore this experiment tested the model’s ability to detect the difference in drug efficacy for two drugs belonging to the same drug class. The Digital Twin drug response predictions were based upon either no response (stable disease or disease progression) or response (partial response or complete response). The model correctly predicted gemcitabine chemotherapy would have greater clinical benefit than 5-FU (predicted log odds ratio −0.10, P < 0.0001)(Figure 2).
Study 2
Two metastatic breast cancer studies were simulated, both designed to prospectively compare single-agent therapeutic arms. These studies considered either an anthracycline, a taxane or a platinum. Chan et al.51 reported a prospectively randomised phase 3 study comparing taxane monotherapy (docetaxel; n=161) vs. anthracycline monotherapy (doxorubicin; n=164) in metastatic breast cancer (n=326) previously treated with an anthracycline-containing regimen (UK & Europe). The Digital Twin model predicted single-agent docetaxel would be a better treatment than single-agent doxorubicin using a relatively small dataset of n=21 (predicted log odds ratio −0.09, P = 0.07)(Figure 2 and Table 3) The borderline P value reflecting the small number of patients available for prediction expanding the confidence interval. The actual trial (n=326) showed that docetaxel had a higher objective response rate than doxorubicin (47.8% vs. 33.3%; P =.008).
Study 3
The other Phase 3 Clinical study in breast cancer reported by Tutt et al.52 (TNT; 17 sites, UK) compared carboplatin (n=188) vs. a taxane, docetaxel (n=188) as first-line treatment in metastatic triple negative breast cancer (n=376). This study showed that carboplatin was no more active than docetaxel in the BRCA wildtype subpopulation of patients (ORR, 31.4% vs. 34.0%, respectively; P = 0.66). The Digital twin model predicted, in alignment with the trial results, that neither carboplatin nor docetaxel would have superior efficacy in the BRCA wild-type population. In summary, the Digital twin model accurately predicts chemotherapy responses for different drug classes and can effectively predict the difference in drug activity, if present for single-agent treatments.
Single-agent vs. combination chemotherapy in relapsed disease
Study 4
To continue to ascertain the potential limitations of the Digital Twin model, we then added complexity to the predictions we set by including combination treatments and used a study comparing a single drug against a drug combination. The Digital Twin model virtually simulated a prospective randomised study in ovarian cancer reported by Cantù et al. 202253 (n=97; Italy) allocating participants to either single agent paclitaxel (taxane; n=50) or a combination of cyclophosphamide (alkylating agent), doxorubicin (anthracycline) and cisplatin (platinum) (CAP; n=47). This study recruited patients with recurrent ovarian cancer who had achieved complete remission with previous platinum-based regimens, and whose disease recurred after a progression-free interval of more than 12 months. The Digital Twin model used data inputs from the TCGA cohort to predict that the cisplatin-based combination (CAP) would have a higher response rate than paclitaxel. The Digital twin prediction reflected the actual published higher overall treatment response rate for CAP, which was 55% vs. 44.7% for CAP vs. Paclitaxel. (P = 0.062). Predicted ORR log odds ratio −0.24, P < 0.001 for n= 251 vs. calculated log odds ratio from clinical trial data −.044, P=0.133).
Blinded validation
As part of the validation process, the technology team were blinded to the published outcomes of an additional four phase 3 clinical trials. Three clinical studies in early breast cancer evaluated adjuvant chemotherapy regimens54–56 and one evaluated first-line metastatic pancreatic cancer57 (Figure 2). The Digital Twin model correctly predicted drug efficacy and the clinical trial result for all four clinical studies.
Study 5 (USA, Europe & Australia)
Von Hoff et al.57 The phase III Metastatic Pancreatic Adenocarcinoma Clinical Trial (MPACT) in metastatic pancreatic cancer compared the combination of nab-paclitaxel plus gemcitabine vs. gemcitabine alone as first-line therapy. The study randomised 861 previously untreated metastatic pancreatic cancer patients between these treatment arms. The response rates were 23% for nab-paclitaxel plus gemcitabine versus 7% for gemcitabine alone (P<0.001). Median overall survival was 8.5 months in the nab-paclitaxel-gemcitabine group vs. 6.7 months with gemcitabine alone (hazard ratio 0.72, p<0.001). Using data for a similar chemotherapy drug, paclitaxel, the Digital Twin model was able to correctly predict that nab-paclitaxel plus gemcitabine was superior to gemcitabine alone (Predicted log odds ratio −0.090, p = <0.001) (Figure 2).
Study 6 (National Cancer Institute of Canada Clinical Trials Group)
In order to test the model’s limitations with regard to the number of patients the model needed to train on, we designed a virtual trial with methotrexate chemotherapy, because the model had trained on only eight cancer patients treated with methotrexate. The study was a Phase 3 prospective randomised trial reported by Levine et al.55 (1998), which enrolled high-risk, node positive pre/peri-menopausal women post mastectomy or lumpectomy and axillary dissection (n=716) and randomised them to either adjuvant ECF (epirubicin, cyclophosphamide and fluorouracil), or adjuvant CMF (cyclophosphamide, methotrexate and fluorouracil) treatment. The relapse-free survival for CMF (control arm) was 53% (95% CI, 45-58%) and 63% (95% CI, 57-68; P=0.009) for CEF at 5 years. Although CEF was a more effective chemotherapy regimen, it was associated with significantly more acute toxicities and as a consequence is not widely used. In order to predict treatment response in the adjuvant setting where the cancer has been surgically removed. Virtual simulations by the digital twin model accurately predicted that adjuvant ECF would be superior to adjuvant CMF in early breast cancer (Predicted log odds ratio −0.023, P =<0.001; Table 3).
Study 7 (USA)
Reported by Jones et al. 200656 with 1016 participants with a median follow up of 5.5 years, a phase 3 prospective randomised trial in stage 1-3 breast cancer reported disease-free survival at 5.0 years for TC (docetaxel and cyclophosphamide) of 86% vs. 80% for AC (doxorubicin and cyclophosphamide) (HR 0.67; 95% CI 0.5-0.94; P=0.015)(Figure 2). The purpose of this trial was to compare the clinical outcomes in patients treated with a standard adjuvant anthracycline regimen vs. a non-anthracycline regimen. Virtual simulations by the model correctly predicted that adjuvant TC (Docetaxel and Cyclophosphamide) would be superior to adjuvant AC (doxorubicin and cyclophosphamide) in early breast cancer (Predicted log odds ratio −0.059, P < 0.001; Table 3)
Study 8 (Japan & South Korea)
To further test the limitations of the Digital Twin’s performance, we aimed to challenge it further. We tested it across mixed populations who received different neoadjuvant chemotherapy regimens containing either an anthracycline, a taxane, or both, and then subsequently received heterogeneous adjuvant therapy. CREATE-X (Masuda et al.54 2017) was a Phase 3 prospective, randomised study (n = 910) that randomised participants with residual disease following different neoadjuvant chemotherapy regimens for breast cancer (stage I-III) to either capecitabine or a no capecitabine. The study participants included both hormone-positive (ER+ve, HER2-ve) and triple-negative (ER-ve, HER2-ve) patients. For simulation purposes, we focused on the hormone-positive subpopulation only and assumed participants in the control arm would receive endocrine treatment but no capecitabine. For the CREATE-X study, the overall survival 95% confidence intervals and hazard ratios for the hormone-positive subgroup crossed 1.0 (n=601; HR 0.73 0.38-1.40; P = 0.41) suggesting adjuvant capecitabine was no better control. Virtual simulations by the Digital Twin model predicted that in people with hormone-positive breast cancer (HER2-ve; stages 1-3), treatment with adjuvant capecitabine would be inferior to standard of care such as adjuvant hormone treatment with tamoxifen (log odds ratio = 0.07). Although the predicted confidence intervals inferred inferiority, the prediction was within the confidence interval of the actual trial results.
Predicting survival
An Overall Survival (OS) model was integrated into the Digital Twin clinical trial simulator using a Random Survival Forest (RSF)58, a statistical non-parametric ensemble learning method. The learning target is time-to-event and event (censored/deceased) data, and primary output is survival probability vs. time curves.
Clinical data from 10,913 patients was pre-processed to yield a dataset comprising 4029 patients, with ages ranging from 11 to 90 years, spanning 23 different cancer types. This dataset along with the RECIST response categories from the TRM stage was used as input for the OS model.
We analysed five different prediction accuracy metrics: 1) cumulative dynamic AUC ROC, 2) Uno’s concordance index59 (C-index), 3) time-dependent Brier score, 4) the Brier skill score, and, 5) explained variance. Table 4 shows the average outcome across 5-fold training and testing splits. A detailed explanation of the metrics is provided in the Online Methods. Both the dynamic AUC and the C-index scores of the Digital Twin were above 0.7 in a pan-cancer setting, a threshold set for a good predictive model.
Additionally, we evaluated the performance of our Digital Twin OS prediction in relation to existing computational models found in the literature. The results are shown in Figure 3, with further details on these benchmarks available in Extended Data Table 1. When evaluating across all tissue types, the Digital Twin model exhibited commendable performance in comparison to the mean of all 29 other model-data methods and tissue types identified in the literature60–77. Regarding Glioma, our model demonstrated favourable performance compared to XGBoost-Surv by Dal Bo et al.69 (2023) and Deep Learning with Cox proportional hazard (CPH) by Jiang et al.77 (2021) on the Udine Hospital and TCGA datasets, respectively. In Breast Cancer and Glioblastoma our model performed comparably. Our model underperformed benchmarks for Ovarian Cancers and Head and Neck Cancers. This may be attributed to the absence of crucial data inputs, specifically, TNM cancer staging data.
However, benchmark studies typically do not address the prediction of survival curves for various drugs, including novel ones, and primarily focus on predicting survival curves for specific cancer types. In contrast, our model possesses the capability to address hypothetical scenarios, offering insights into questions such as the projected survival curve when a specific patient undergoes treatment with a novel drug.
Cohort Enrichment
The FDA defines cohort enrichment as the “prospective use of any patient characteristic to select a study population in which detection of a drug effect (if one is in fact present) is more likely than it would be in an unselected population.” Enrichment strategies should accelerate drug development, increase the magnitude of drug responses and therefore accelerate the path to drug approval.
Here we evaluated the effectiveness of using the predicted response score to segment cohorts into responder and non-responder cohorts. For each cohort we chose two thresholds to split the cohort into positive, intermediate and negative groups and evaluated the log odds ratio of overall response rate to assess the potential for cohort enrichment.
The cohorts were split into 3 groups to assess the interpretability and quality of risk stratification of the response scores predicted by the model.This cohort molecular enrichment approach was tested in 19 different solid tumour types and 17 different cytotoxic drugs (Figure 4). The Digital Twin output data suggests that 16 solid tumour types, except ovarian cancer, would benefit from a molecular predictive enrichment strategy integrated into clinical trial designs. This data also confirms that machine learning approaches can successfully identify molecular patterns and enrich drug response for commonly used chemotherapy drugs such as docetaxel or cisplatin, which currently do not have robust predictive biomarkers and are used in unselected cancer populations.
Discussion
Predicting therapeutic response has the potential to transform cancer care and oncology therapeutic development. We developed a Bayesian statistical approach that is similar to modelling gravitational lensing to predict the mass of dark matter by astrophysicists, where the complexity and interactivity of multiple data points is required. We show that this approach predicts with reasonable accuracy the response of therapeutics preclinically and clinically which we validated through comparison to clinical trials.
This approach can be applied at various points in therapeutic development. These include, but are not limited to:
Predicting response in preclinical models to inform decisions regarding which cancer and specific indication is more likely to be associated with response;
Combination strategies;
Selecting which therapeutics to advance through early clinical development;
Selecting which therapeutic to advance to late-stage development;
Improved patient selection for clinical development;
Construction of synthetic controls for clinical trials. In addition, other potential applications, not tested here, include prediction of toxicity.
One of the aspects of this approach is that the model can predict likely response for individual patients as well as cohorts. Predicting an individual’s response to a specific treatment ahead of time has the potential to substantially impact on routine cancer care. This would better inform clinical decision-making for an individual patient, avoiding likely ineffective therapies, and selecting the better option or a clinical trial. Moreover, increasing the accuracy of prediction for an individual would mean that they could potentially serve as their own control in a clinical trial. If the survival of an individual patient with standard of care could be predicted with a known level of accuracy, more meaningful information could be drawn from that individual’s response or lack of response to a novel therapy.
An important factor to consider is how accurate a prediction needs to be in order for it to be useful. With the high failure rate of oncology therapeutic development, an incremental increase in predictive accuracy for critical decisions would have potential significant impact on the probability of success.
Important current limitations that need to be addressed are the variability in predictive accuracy between different classes of therapeutics and different cancer types. Whilst this may be simply the amount and quality of data ingested, adjustments to the model may need to be made to reflect the mechanism of action of therapeutics, where known. Biological inferences from the model as it stands need to be developed further so as to better define candidate biomarkers that could be rapidly translated into the clinic.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Online Methods
Here, we describe a Digital Twin, an integrated machine-learning model for simulating clinical trials and estimating clinical trial endpoints for various treatment scenarios including novel drugs. It consists of three main models: the Drug Efficacy Model (DEM), the Treatment Response Model (TRM), and the Overall Survival (OS) model and employs multi-modal data input as illustrated previously in Figure 1. The DEM is trained to estimate the impact of treatment. It provides an estimation of a patient’s treatment response in the prospective cohort under the prospective treatment, entirely from preclinical data. With treatment response estimation, the overall survival model predicts the clinical trial endpoints.
Drug Efficacy Model (DEM)
The DEM learns to estimate half-maximal inhibitory concentration (IC50) based on the pre-clinical information from cell-line studies. The model takes a set of features for the drug-patient pair: Whole Exome Sequencing (WXS), RNA sequencing (RNAseq), drug structures, and dose-response curves. The input data includes 545 compounds using SMILES (simplified molecular-input line-entry system) describing the structure of chemical species, 1343 genes in RNAseq, 1342 copy number variation (CNVs) from the WXS data and 130,000 Hill parameters for the dose–response curves. The drug structures were encoded as CDK8–10 molecular descriptors from their SMILES ID using cinfony12. The description of the input data for the Digital Twin DEM is shown in Extended Data Figure 1a.
Here, we deliberately opted for Random Forest78,79 (RF) because it allowed us to examine the leaf assignments that create the IC50 estimates. These leaf assignments are later used in the estimation of treatment response categories in the next stages of Digital Twin. Besides, RF tends to outperform other models in prognostic value80. DEM produces three types of outputs: 1) an estimate of the IC50 of a given drug in the given patient’s tissue, 2) drug combination synergy curves (the phenomenon where the combined effect of two or more drugs is greater than the sum of their individual effects), and 3) Perturbation Kernel81 leaf assignments. The latter is the main output, and input to the next stage of the Digital Twin, to the Treatment Response Model.
To evaluate the performance of the DEM, i.e. evaluate how good the predictions of a model on unseen validation data, we compare predicted against observed IC50 in Extended Data Figure 2. The root mean squared error (RMSE) of the log10 IC50 was 0.47 the coefficient of determination was 0.61.
Perturbation Kernel
Here, a Perturbation Kernel is a method based on a Random Forest Kernel (RFK)81, that helps find patterns and relationships between data points in a more complex space where those patterns might be more apparent. It is constructed using the random partition sampling scheme, generating several RF decision trees, which are trained on a subset of features. In trained RF, each leaf in each tree is considered a partition. In learning algorithms, a kernel is a function that calculates the similarity or distance between pairs of data points. It is calculated based on counting the fraction of times these data points share a partition. Thus, the more these points share the same partition, the less distance between them and the more similar they are.
The Perturbation Kernel leaf assignments are output derived from the DEM and are utilised to transfer knowledge concerning shared molecular mechanisms across in vitro studies, combination screens, and clinical treatments. It outlines the similarity in molecular mechanisms between pairs of patients, for example, a patient treated with a taxane and another patient treated with an anthracycline. Subsequently, these are employed to train the Treatment Response Model for the prediction of treatment response categories.
Treatment Response Model (TRM)
The TRM learns to accurately predict four RECIST treatment response categories49 from the output of DEM: 1) clinical progressive disease 2) stable disease, 3) partial response 4) complete response. The input to the model are the leaf assignments from the Perturbation Kernel, that is the output from the DEM. Random Forest in DEM allows the use of leaf assignments to form a kernel function between inputs. Assuming that similarity in IC50 estimates correlates with similarity in treatment response, it was possible to conduct kernel regression to estimate treatment response categories. To perform kernel regression we use the Gaussian Process2.
The performance of the TRM was evaluated based on its ability to classify the response categories in the overall pan-cancer setting, considering individual cancer types and distinct cancer treatments (see Extended Data Table 2). The following grouped RECIST categories were analysed: 1) Disease Control (combined complete response, partial response and stable disease) 2) Response (combined complete response and partial response) and 3) Complete Response. Overall, TRM demonstrated high accuracy, scoring 0.75, 0.74 and 0.69 in the respective grouped categories, along with Area Under the Curve (AUC) values of 0.63, 0.73, and 0.72. AUC reports the Receiver Operating Characteristic area under the curve for the non-thresholded prediction, and can be interpreted as the probability that a positive responder had a higher predicted probability of response than a negative responder. These were evaluated in a 5-fold cross-validation, and the reported value is the mean across the held-out cross-fold validation sets.
In addition, Extended Data Table 2 shows detailed model performance evaluation. Specifically, it displays accuracy, AUC, precision, recall and F1 score across a) grouped RECIST categories b) 9 cancer types and c) 12 cancer drugs. The Weighted Average provides scores weighted by the support for positive and negative responses. Average precision is the area under the precision-recall curve, measuring the average precision overall of all classification thresholds as a function of recall, with higher values being preferable.
Significant variations in performance are evident across treatments and tissue types, and much of this variability is likely attributed to the limited availability of response data for many cohorts.
Statistics
The clinical trials confidence intervals are calculated according to Altman82 and Sheskin83. To calculate the predicted log(OR) (log odds ratio), the log(OR) was calculated for each individual patient in the dataset based on the treatment response model’s prediction, and the mean log(OR) and standard error were directly calculated. Because the log(OR) can be calculated for each individual patient the confidence intervals are much smaller than for an equivalent clinical study which is measuring the difference in response between two populations. When calculating the log odds Overall Response Rate (ORR) for a combination therapy the highest log odds for all treatments in the combination was taken.
Overall Survival
We built an Overall Survival (OS) model as an integral component of the Digital Twin clinical trial simulator. A survival model is a statistical method to predict the time until an event, such as death. It deals with the censored data, indicating that the event of interest did not happen during the study period, and produces survival probability vs. time curves. We employed Random Survival Forest (RSF), a non-parametric ensemble learning method that can incorporate censored and time-to-event data58. The learning process involves the creation of multiple decision trees, and the model is selected based on the accuracy of predictions on unseen data.
The inputs for the OS are: 1) clinical records data, 2) cancer tissue type and 3) RECIST response categories from the TRM stage of the Digital Twin. Initially, 10,967 patients with survival data were pre-processed to end up with data from 4,029 patients with cancer tumour stages 0-4, lymph node stages 0-3, metastasis status (yes/no), and ages ranging from 11 to 90 years, across 23 cancer tissue types. Cancer types with small amounts of patients were removed from the analysis.
For patients with missing data, missing values were imputed with the mean values. The imputation was performed separately in the train set and validation set. Here, we group RECIST response categories into binary: 1) Disease (clinical progressive disease and stable disease, and 2) Response (partial response and complete response). The descriptive figure about the data is shown in Extended Data Figure 1.
To evaluate the performance of the model, we analysed five different prediction accuracy metrics across the solid tumours: 1) area under the receiver operating characteristic (AUC ROC) averaged for all times also known as cumulative dynamic AUC84 and 2) Uno’s concordance index (C-index) based on the inverse probability of censoring weights59. It is a goodness of fit measure for models that produce risk scores, commonly used in survival analysis. The intuition behind the C-index is – when comparing patients against each other if the patient with the higher risk score has a shorter time-to-event; 3) the integrated time-dependent Brier score85. It provides an overall calculation of the model performance at all available times. The smaller numerical values represent higher prediction accuracy (0 is the best achievable score with perfect accuracy and 1 is the worst score). 4) the Brier skill score that is the difference between the Brier score of the reference mode and the Brier score of the forecast model, divided by the Brier score of the reference model (1 is the best achievable score) and 5) explained variance, which is a proportion to which a model accounts for the variation of a given data set. Time-dependent metrics are integrated over time. We used scikit-survival86 for calculating the majority of metrics.
We conduct the performance evaluation using a 5-fold split (train/test splits), and all survival metrics reported are averaged across all cross-folds. In the pan-cancer setting, the model attained high accuracies as described by the following: AUC ROC = 0.78, C-index = 0.71, Brier score = 0.168, Brier skill = 0.42, and Explained Variance = 0.44. The performance of the OS per cancer was shown in Table 5 of the main body.
For the comparison with the benchmark studies (Figure 3 in the main body), we conducted a literature survey, selecting studies that focus on the survival analysis modelling within the oncology field. The complete list of studies is available in Extended Data Table 1. Given that the majority of studies utilise Harrell’s C-index88, we computed and compared it on a per-cancer basis.
Feature Importance
We performed the feature importance analysis which emphasised the fact that the predicted output from the DEM modulates the Overall Survival, influencing clinical trial outcomes.
We used two methods: permutation-based importance and Cox proportional hazards model. The former is calculated by randomly permuting a feature and measuring the difference between the model prediction score after permutation and without. Cox estimates the impact of individual covariates on the hazard ratio (HR), allowing to quantify how changes in specific features affect the risk of an event occurring over time. The latter is a standard survival analysis model in oncology87.
Results are shown in Extended Data Fig 3. Results indicate that the “Disease (No response)”, which is inferred from the DEM plays a significant role in model performance. A stronger “disease” results in a positive log(HR) on the Cox model with the highest HR among all features. The permutation importance method shows that this feature has high importance, almost comparable to the patient’s age, which is considered one of the most important factors in cancer survival.
Integration
Finally, we integrate DEM, TRM and OS models in order to be able to simulate clinical trials for both existing and novel cancer therapies. We call it a Digital Twin, a virtual model designed to accurately reflect clinical trials of cancer treatments with cytotoxic and small molecule therapies across various cancer types. The simulation output can then be any desired endpoint as predicted by the overall survival model.
The integration of distinct models involves utilising outputs from the preceding model as inputs for the subsequent model. DEM incorporates pre-clinical data, including gene expression and mutation profiles, drug response curves and drug compounds and produces a perturbation kernel. This perturbation kernel is subsequently employed in the TRM for predicting RECIST response categories. These predicted response categories are then incorporated into the OS model, along with clinical records data, to forecast overall patient survival over time.
Acknowledgements
This work was supported by Innovate UK [grant number 50074]. The data used in this study is in whole or part based upon data generated by the TCGA Research Network.