Integrated radiogenomics models predict response to neoadjuvant chemotherapy in high grade serous ovarian cancer

High grade serous ovarian cancer (HGSOC) is a highly heterogeneous disease that often presents at an advanced, metastatic state. The multi-scale complexity of HGSOC is a major obstacle to measuring response to neoadjuvant chemotherapy (NACT) and understanding its determinants. Here we propose a radiogenomic framework integrating clinical, radiomic, and blood-based biomarkers to measure and predict the response of HGSOC patients to NACT, showing how quantitative imaging data can serve as the backbone of multi-scale data integration. We developed and validated our approach in two independent highly-annotated multi-omic multi-lesion data sets. In a discovery cohort (n=72) we found that different tumour sites present distinct response patterns, and identified volumetric response assessment as a better predictor of overall survival (OS) than RECIST~1.1 status. We trained an ensemble machine learning approach to predict tumour volume response to NACT from data obtained prior to treatment, and validated the model in an internal hold-out cohort (n=20) and an independent external patient cohort (n=42). Benchmarking integrated models against models built on single data types highlighted the importance of comprehensive patient characterisation. Our study sets the foundation for developing new clinical trials of NACT in HGSOC.


15
High grade serous ovarian cancer (HGSOC) remains a major therapeutic challenge and usually presents 16 with advanced, multi-site metastatic disease. Neoadjuvant chemotherapy (NACT) followed by delayed 17 primary surgery (DPS) is now the most frequent first line therapy for advanced HGSOC (1, 2). However, 18 39% of patients did not obtain any direct benefit from it in the large international ICON8 trial (3-5). 19 Patient care would be substantially improved if these different sub-populations could be predicted before 20 treatment started, for example by identifying likely non-responders who should receive immediate primary 21 surgery. This variability in the response is driven by the complexity of HGSOC, which spans a large range 22 of scales -from macroscopic tumour volumes observed on radiological imaging to microscopic immune 23 microenvironments and sub-microscopic genomic diversity (6-8)-, but multiple-sampling strategies across 24 different metastatic sites are not scalable. 25 Here we present an integrative radiogenomic framework to address the two fundamental challenges in 26 the treatment of heterogeneous multi-site disease. The first challenge is accurately measuring response. 27 The widely used RECIST 1.1 criteria are based on one-dimensional measurements performed on a small 28 subset of lesions (9) and are thus insufficient for complex diseases such as HGSOC (10-12). This lack 29 of accurate response measures prevents the optimisation of treatment strategies and the development 30 of clinical trials for new combination therapies (13). The second major challenge is predicting response. 31 Studies so far have focused on single data streams which can only offer a partial view of the disease, such 32 as clinical features (14, 15), CA-125 (16-18), computed tomography (CT) imaging (19,20), and circulating 33 tumour DNA (ctDNA) (21). The superior predictive power of integrative models for complex endpoints is 34 1 D R A F T well documented in several cancer types (22-24), but training such models requires large, well-annotated, 35 multi-omic datasets, which have not been available in HGSOC. In particular, radiological imaging is the 36 only data source to capture the spatial heterogeneity of metastatic disease, but has so far been underused in 37 HGSOC, where existing radiomics studies have focused only on correlations (19,20), rather than combined 38 predictive power, and none have considered NACT. 39 To overcome these challenges we used two independent highly annotated data sets to develop an integrated 40 radiogenomic framework for measuring and predicting the response of HGSOC patients to NACT. Our 41 framework showcases the power of quantitative imaging data to characterise metastatic disease and to serve 42 as the backbone of multi-scale data integration.

44
This study is based on two data sets, the first one (n = 92) from the NeOV trial and the second one (n = 42) 45 from the Barts Health NHS Trust (Materials and Methods). Patients show large variability in treatment 46 regimes and response patterns (Figure 1a). The NeOv dataset was randomly divided into a training set 47 (n = 72) and a hold-out internal validation set (n = 20, Figure S1a). The training data set was used for the 48 exploratory analyses and to train the machine learning models. The hold-out set was kept aside and only 49 used for performance assessment after all predictive models had been fully trained. The Barts data were 50 used as an independent, external validation data set.  Despite the overall differences between sites, RECIST 1.1 status provided significant stratification for 59 lesion-wise response (Figure 1f). In addition, the volume of omental disease at baseline was higher (p = 0.05) 60 in responders as assessed by RECIST (complete or partial response; median = 85 cm 3 ) compared to 61 non-responders (stable disease or progression; median = 31 cm 3 ). 62 While the number of disease locations at baseline was significantly correlated with response, disease 63 volume at baseline (overall or in specific anatomic locations) was not (Table 1). This indicates that 64 multivariable predictors are required to predict response to NACT rather than simple knowledge about 65 disease burden and its anatomic distribution.    been shown to enhance tumour DNA detection (27). Owing to the strong correlation of t-MAD with TP53 71 MAF (p < 0.0001), only TP53 MAF was included in univariable analyses ( Figure S1c).

72
Total disease burden at baseline (total volume, number of lesions, and summed RECIST 1.1 diameters) 73 correlated significantly with both CA-125 and TP53 MAF (Table 2). Neither baseline CA-125 nor TP53 74 MAF correlated significantly with response (Table 1), but they did show a significant positive correlation 75 with the summed RECIST 1.1 diameters post-chemotherapy.

76
Baseline CA-125 correlated with both omental disease and pelvic and ovarian disease measured before 77 chemotherapy. Similarly, baseline TP53 MAF correlated significantly with pelvic and ovarian disease 78 measured both before and after chemotherapy. However, it did not correlate with omental disease at either 79 time point (Table 2). This suggests that high TP53 MAF at baseline could be a specific indicator for high 80 disease burden in the ovaries or pelvis, which tends to show poorer response ( Figure 1).

D R A F T
Some families of radiomics features correlate with clinical and biological characteristics. To capture the 82 radiological complexity of the disease we defined several collections of radiomics features (see Table S9 83 for the full list). Volumes and number of lesions were calculated for the relevant anatomical sites. Shape 84 features, first-order histogram statistics and texture features ('intensity radiomics') were calculated for each 85 lesion and averaged over the whole disease. Intra-lesion heterogeneity was assessed by contracting the lesion 86 contours and calculating the ratio of radiomics features before and after the contraction ('rim radiomics').

87
Similarly, the external context of the lesions was assessed by calculating the ratio of radiomics features 88 before and after dilating the contours ('peripheric radiomics'). Finally, we also defined a series of binary 89 variables to describe additional radiological findings ('semantic features'): ascites and pleural effusion were 90 assessed manually, and we used a previously developed automated tissue-specific sub-segmentation tool to 91 identify hypodense (cystic/necrotic spaces) and hyperdense (calcifications) lesion parts within the manual 92 segmentations (28). 93 We found that imaging features grouped into 6 distinct clusters (Figure 2). Cluster 1 was associated with 94 baseline CA-125 levels (r median = 0.34) and contained mostly lesion volume metrics. This is consistent with 95 previous work suggesting that CA-125 correlates with lesion volume (29). Clusters associated with ctDNA 96 features were generally dominated by features quantifying lesion heterogeneity and context. Cluster 5 was 97 primarily associated with ctDNA TP53 status (r median = 0.25), and contained predominantly peripheric 98 radiomics features, which quantify lesion context. Cluster 6 was associated with ctDNA TP53 MAF and 99 tMAD (r median = 0.20 and 0.19 respectively), and contained predominantly rim ratio radiomics features, 100 which provide information on intra-lesion heterogeneity.

101
Cluster 4 was highly correlated to stage (r median = 0.36), and was composed of a mixture of features 102 related to shape, volume, and number of lesions, which quantify the disease burden. This is consistent with 103 the definition of FIGO stage, which relies on the assessment of the extent and spread of the disease (30).

104
Cluster 2 was associated mostly with age (r median = 0.23) and contained almost exclusively rim ratio 105 radiomics features, which provide information about intra-tumour heterogeneity. The remaining group 106 (cluster 3) was formed by a heterogeneous mixture of features, and did not associate with any biological or 107 clinical feature.

108
These results indicate that some of the information that global biomarkers such as stage, CA-125 or 109 ctDNA provide can also be captured in multi-lesion radiomics features that quantify the extent, spread, 110 heterogeneity and context of the disease. In addition, as shown in Figure 2, clusters 1, 2, 3, and 5 contain 111 imaging features that are significantly correlated with volumetric response to treatment after multiple 112 comparison correction. Crucially, the fact that cluster 3 has negligible biological or clinical associations 113 Table 2. Spearman correlation coefficients between pre-and post-NACT measurements of tumour burden and blood biomarkers measured at baseline.

D R A F T
suggests that there may be additional value in the integration of all sources of data to predict treatment 114 response.

115
Volumetric response is a better prognostic marker than RECIST 1.1 status. We considered three possible 116 metrics for treatment response assessment: RECIST 1.1 response status, RECIST 1.1 summed diameter 117 ratio, and total volume ratio. The three metrics have the advantage of being defined at the end of NACT, 118 and can therefore be used as clinical trial endpoints for rapid response evaluation. The decision of which one 119 to use for predictive modelling was based on the evaluation of their prognostic power. We built univariable 120 Cox proportional hazards models for PFS and OS (Table S1). RECIST 1.1 status was not significant for 121 either PFS or OS. PFS was significantly predicted by the summed diameter ratio (p = 0.014), with the 122 total volume ratio (p = 0.077) showing a strong trend. OS was strongly predicted by the total volume ratio 123 (p = 0.026) while the summed diameter ratio was borderline significant (p = 0.051). 124 We also studied the power to discriminate very good or very poor responders by stratifying Kaplan-Meier 125 curves by the top and bottom 20th percentiles. We found that RECIST 1.1 status did not provide significant 126 separation between the groups (Figure 3), for either PFS or OS. By contrast, top responders identified 127 with the RECIST 1.1 summed diameter ratio (p = 0.03) and particularly the total volume ratio (p = 0.01) 128 both had significantly longer PFS. Similarly, the worst responders identified according to the total volume 129 ratio had significantly worse OS (p = 0.03), but no discrimination was provided by the ratio of summed 130 diameters. We therefore decided to use the total volume ratio as the response metric on which to train our 131 predictive modelling framework.

132
Integrative models predict volume response to neoadjuvant chemotherapy. We built a machine learning 133 framework to integrate the different streams of data into a set of predictive models of response to neoadjuvant 134 chemotherapy. We used three distinct datasets in the process. The NeOv cohort (n = 92) was first randomly 135 split into a training set (n = 72) and a hold-out validation set (n = 20). Model parameters were optimised 136 and fixed using the training dataset in a cross-validation setup. The hold-out validation set was used as 137 an independent test of model performance. A further external dataset, the Barts cohort, was used as an 138 independent validation set (n = 42). 139 We trained a collection of models with an incremental, cumulative number of integrated features: age and 140 FIGO stage, treatment characteristics, CA-125, radiomics features, and ctDNA, as depicted in Figure 4a. 141 The full list of features is included in Tables S8 and S9.

142
The response variable was defined as the logarithm of the post-chemotherapy total tumour volume 143 divided by the pre-chemotherapy total tumour volume, as measured on the corresponding CT scans. This 144 choice of response variable was motivated by the previous observation that volume shrinkage indicates 145 longer OS and PFS (Figure 3), and is therefore prognostically important.

146
Model predictions were obtained by averaging the outputs of three machine learning pipelines based 147 on elastic net, support vector regression, and random forest, respectively. The pipelines also included 148 collinearity reduction and feature selection steps. We optimised model hyperparameters by minimizing 149 the mean square error (MSE) in a 5-times 5-fold cross-validation scheme applied on the training cohort. 150 During the training process, we found that the cross-validated MSE was reduced by 25% as a result of the 151 successive integration of clinical, radiomic, and blood-based biomarkers. Adding ctDNA only resulted in a 152 marginal improvement from 24% to 25% reduction in MSE (Table S7).

153
The final models were obtained by fixing the optimal hyperparameters and re-fitting to the entire training 154 cohort. To assess the discriminative performance and calibration of the final models we applied them on 155 the hold-out validation cohort. We observed a similar, gradual reduction in MSE, reaching a reduction of 156 15% for the integrated model without ctDNA, and 14% after integrating ctDNA ( Figure 4 and Table S7). 157 In addition, radiomics and ctDNA cumulative integration models produced response scores that were 158 significantly correlated with the observed volume response (Spearman r = 0.5, p = 0.02 in both cases, 159 Figure 4b and Table S7). Although the models were not trained to predict RECIST 1.1, we observed 160 that the predicted scores were able to correctly rank the three RECIST 1.1 response groups in both the 161   radiomics and ctDNA cumulative integration models (p = 0.02, Figure 5). 162 We tested the generalizability of the models by applying them on an independent external cohort (Barts 163 cohort, n = 42). As this cohort of patients did not have ctDNA data available, we were able to test the 164 clinical, CA-125, and radiomics integration models only. We confirmed that integration was beneficial for 165 discrimination, as only the radiomics integration model produced significant response scores (Spearman 166 r = 0.32, p = 0.04, Figure 4b and Table S7), with a reduction in MSE of 8%. We also confirmed that the 167 predicted scores were able to correctly rank the RECIST 1.1 response groups in the radiomics integration 168 model (p = 0.005), but not the simpler ones ( Figure 5). 169 We studied the relative contribution of the features used by the final models in two different ways. First, 170 we examined which features passed the feature selection step in the three different component pipelines that 171 integrate the machine learning models. Figure 6a shows the number of pipelines that selected a particular 172 feature in each of the cumulative models, a metric that we call the 'selection frequency'. We found that 173 the treatment regimen (whether the patient received paclitaxel weekly, and whether the patient received 174 carboplatin only) and the number of cycles of chemotherapy before the second scan were consistently 175 selected across models. We also found that CA-125 was used in models that did not include radiomics, but 176 was dropped from radiomics integration models. Semantic features, in particular pleural thickening and the 177 presence of a hyperdense region in an omental lesion, were selected across all the relevant models. Mean 178 volume, number of lesions, and volume and number of infrarenal lymph nodes were also selected. A small 179 and consistent number of radiomics features were also selected, most of them belonging to the category of 180 features defined to quantify lesion context. 181 We also studied the feature importances in the elastic net and random forest component pipelines. 182 To quantify the importances we used feature coefficients in the elastic net pipeline, and impurity-based 183 Gini importances for the random forest pipeline. Figure 6b shows the averages of the normalised feature 184 importances for those two component pipelines, for each of the 9 cumulative integration models. We found 185 Crispin-Ortuzar, Woitek et al.

RECIST status Ratio of RECIST diameters Ratio of total volumes
June 7, 2021 | vol. XXX | no. XX | 7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;   that most of the models integrating a large number of features tend to be more dense, with features sharing 186 similar, lower importance levels. Features that tended to have larger importances were generally consistent 187 with those that had the highest selection frequency, as can be observed by comparing the two panels in  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;   . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;

D R A F T
attempted. We have shown that integrated machine learning models based on baseline multi-scale data 195 predict volumetric response to NACT (p = 0.04, external validation cohort). Our approach uses radiomics, 196 the only data source capable of simultaneously measuring the entire metastatic disease, as the backbone of 197 the prediction. Radiomics features had the strongest effects in the prediction models, which were validated 198 in an independent cohort. We also found that volumetric response was a better predictor of OS (p = 0.03) 199 than RECIST 1.1 status (p = 0.3) and revealed site-specific patterns of differential response and correlation 200 with ctDNA detection.

201
The field of radiomics has grown exponentially in recent years, showing great promise across tumour 202 sites and endpoints (31). At the same time, radiomics has been criticised for lack of robustness and 203 reproducibility, as well as lack of biological interpretability (32, 33). Our study shows that both problems 204 can be overcome by the right design choices. We made robustness a design priority for our predictive 205 framework, which included strong feature selection, model ensembles on multiple levels, and repeated 206 re-shufflings of the data to avoid biases. In addition, we trained the model on a dataset with heterogeneous 207 imaging parameters (34); and we included new families of imaging features based on ratios between different 208 volumes of interest, which were designed to partially cancel out such biases. We also curated an independent 209 cohort from a different institution for validation, which we were able to perform successfully.

210
Our feature definitions were also designed to improve interpretability: the ratio features are not only 211 robust to imaging parameters, but they also helped us to explore internal heterogeneity (rim features) and 212 the external context of the lesions (peripheric features). Interestingly, we found that ratio features were the 213 most numerous in the final models. This result is in line with previous work that found that qualitative 214 radiological features describing the edges of peritoneal disease (nodular, diffuse, or mixed) were significantly 215 associated with CLOVAR subtype (35) and BRCA mutation status (36). The concept of peritumoural 216 radiomics has been explored before in breast, lung, and liver cancer (37-40), but to our knowledge our 217 study is the first one to apply it to ovarian cancer. Our findings motivate the development of more detailed 218 studies focusing on the boundary regions of ovarian cancer lesions -and show that data-driven radiomics 219 analyses can support biological hypothesis generation.

220
Even if the hypothesis-motivated features had not played a significant role, the interpretability of the 221 results was naturally supported by the integrative framework that the radiomics analysis was embedded 222 in. For example, our study confirms previous observations that the presence of ctDNA is correlated with 223 volume of disease at the start of treatment in HGSOC patients (26). The integration of clinical data into 224 our models also yielded important insights, as features describing the type of NACT and its timing were 225 consistently selected by our models. The ICON8 study (to which part of our discovery cohort was recruited) 226 showed that PFS in patients with ovarian cancer undergoing NACT was unaffected by the administration 227 regimen of paclitaxel (3-weekly as is standard or in a dose-dense weekly regimen) (41). However, the feature 228 describing the mode of administration (weekly versus 3-weekly) was consistently selected by our prediction  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;

D R A F T
biomarkers, including ctDNA and CA-125. We showed that global tumour burden and volume of pelvic and 244 ovarian tumors at baseline are significantly correlated with both CA-125 and ctDNA TP53 MAF. However, 245 our analysis of volumetric data also showed that the ctDNA signal was only due to the presence of disease 246 in the ovaries and/or pelvis, suggesting that the simultaneous reading of ctDNA and CA-125 at baseline 247 could play a role in helping determine disease spread and response in the diagnostic setting.

248
Some of these findings highlight the limitations of previous studies and ours. For example, even though 249 we found that change in total volume may be a better prognostic indicator than RECIST-1.1, it is still 250 treating the disease as a single entity, ignoring the subtle interplay between the pelvis/ovaries and the 251 omentum. Similarly, the importance of some data streams is driven by the specific features that were 252 included in the model. Indeed, given the correlation between ctDNA and total disease volume, and the 253 higher dimensionality of the collections of clinical and radiomics features, it is not surprising that we did 254 not find any additional value in adding ctDNA based biomarkers to our integrated models. However, our 255 study was limited to ctDNA t-MAD, TP53 MAF and TP53 mutation status; a higher-dimensional panel 256 of ctDNA sequencing data, applied on a larger dataset, and potentially measured longitudinally across 257 different time points, may find that there are other complementarities between ctDNA and imaging data. 258 Cell-of-origin analysis via mutation, methylation or fragmentomics to infer contribution from specific cell 259 types to the circulating free DNA pool could help to predict specific volume reductions. Our dataset also 260 lacked detailed quantification of BRCA1/2 status and homologous recombination deficiency, which are 261 likely to be important factors and should be incorporated into future models (43).

262
Another key limitation of the study is that manual segmentation for volumetry is still extremely time 263 consuming. However, accurate and highly adaptable deep learning models for automatic segmentation are 264 being rapidly developed across cancer types, facilitating the integration of volumetry into clinical workflows 265 as well as large-scale radiomics computation (44).

266
In conclusion, our study is a proof-of-principle for the integration of multi-scale data to describe and 267 predict the response of HGSOC patients to NACT. We demonstrate that the systematic multi-scale 268 integration of standard-of-care biomarkers provides critical predictive power and important insights into the 269 complex spatial configuration of the disease. After the necessary clinical validation in larger, prospective 270 cohorts, a framework like the one we propose could have significant impact as a stratification tool in clinical 271 and experimental settings -for example avoiding delays in surgery for patients who are unlikely to respond 272 to chemotherapy-, and could bring forward a new generation of clinical trials for HGSOC, with rapid, 273 effective endpoints that improve and expedite the discovery of new therapies.

275
Patient cohorts. Two patient cohorts were used in this study. The main cohort (the 'NeOV' cohort) was 276 randomly split into a training set and a hold-out set. The training set was used to train the machine 277 learning models, and as a discovery dataset for univariable analyses. The hold-out set was set aside and 278 used only to validate the model predictions. A second dataset (the 'Barts' cohort) was used for external 279 validation.

280
For both data sets, patients had a confirmed histopathological diagnosis of HGSOC and were treated 281 with neoadjuvant chemotherapy before delayed primary surgery. All patients within the main data set 282 were treated at Cambridge University Hospitals NHS Foundation Trust between 2009 and 2020 and were 283 recruited into a prospective clinical study approved by the local research ethics committee (REC reference 284 numbers: 08/H0306/61). All patients within the Barts data set were treated at Barts Health NHS Trust 285 between 2009 and 2018 and recruited into prospective clinical study approved by the local research ethics 286 committee (IRAS reference numbers: 243824).

287
Written informed consent was obtained from all patients prior to any study related procedures. The 288 study was performed in accordance with the principles of the 1964 Declaration of Helsinki and its later 289 amendments or comparable ethical standards. Patients were identified and included based on the availability 290 of at least two CT scans at baseline and prior to DPS. Additionally, for the main data set at least one 291 D R A F T baseline plasma sample for ctDNA assessments was required.

292
Clinical data. Data regarding patient demographics, treatment, and disease were collected from the patient 293 electronic medical records including notes from multidisciplinary team discussions (MDTs). PFS was defined 294 as the time between histopathological diagnosis and first radiological evidence of progression or recurrence.

295
Where progression date was unclear from radiology reports alone, (e.g. successive imaging studies with 296 subtle/mixed changes), clinical interpretation of progression was incorporated in PFS date calling (e.g.    Table S2 shows the breakdown for the three datasets.    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;

D R A F T
Therefore, different manufacturers and scanning protocols were used. Baseline scans were acquired between 338 0 and 14 weeks before initiation of neoadjuvant chemotherapy and post-treatment scans were acquired for 339 response assessment after 1.6 to 5.8 months of treatment. All scans were initially identified on the local 340 PACS and then fully anonymised for further study related processing.

Data processing.
342 Image segmentation and labelling On axial images reconstructed with a slice thickness of typically 5 mm 343 (Table S3), and pixel spacings ranging between 0.053 and 0.095, using abdominal soft tissue window 344 settings, all cancer lesions were segmented semi-automatically by a board certified radiologist with ten 345 years of experience in clinical imaging, using Microsoft Radiomics (project InnerEye; Microsoft, Redmond, 346 WA, USA). The volumes of interest (VOIs) were annotated for their anatomic location: omentum, right 347 upper quadrant, left upper quadrant, epigastrium, mesentery, right paracolic gutter, left paracolic gutter, 348 ovaries & pelvis, infrarenal abdominal lymph nodes, suprarenal abdominal lymph nodes, inguinal lymph 349 nodes, supradiaphragmatic lymph nodes, other chest lymph nodes, parenchymal liver metastases, and 350 lung metastases. Cystic and solid tumour parts were included in these segmentations. Automated sub-351 segmentation of hyperdense/calcified, hypodense/cystic or fatty and intermediately dense/solid tissue was 352 performed for omental lesions and lesions of the ovaries and pelvis using a previously described and validated 353 technique (28). Baseline and follow-up CT scans were evaluated according to RECIST 1.1 for response 354 assessment (9). Pleural effusions and ascites were assessed semiquantitatively (0 = none, 1 = trace, 2 = 355 less than 5 cm when measured perpendicularly to chest/abdominal wall, 3 = 5 cm or more when measured 356 perpendicularly to chest/abdominal wall.

357
Radiomics features VOIs drawn manually were split into connected components using MATLAB's bwlabeln 358 function with a three-dimensional connectivity of 26, which assumes that voxels are connected if their 359 faces, edges, or corners touch. Voxels with intensities below -100 HU were removed from the radiomics 360 calculations. Radiomics features were extracted using the CERR Radiomics toolbox (48) (December 2018 361 version, GitHub hash: 5974376be7103d5c3831690c62aa721fc784d949), including shape, intensity-volume 362 histogram, first-order, and Haralick texture features (see Table S9 for the full list). Intensity-volume 363 histogram features, inspired by the Vx features commonly extrated in radiotherapy dose-volume histograms, 364 corresponded to the volumes spanned by voxels above a certain intensity value (denoted 'HU>x' in Table S9). 365 To calculate Haralick texture features for each lesion, co-occurrence matrices for 100 grey levels (up to 366 a maximum of 1000 HU) were computed independently for each direction along 2D slices and averaged. 367 To calculate the rim and peripheric radiomics features, for each VOI two copies were created by eroding 368 and dilating the contours by 0.4 cm along the 2D slices. The value of the margin was chosen in order to 369 capture slices of at least 1 cm diameter. Erosion and dilation were achieved by convolving the contour with 370 a circular mask of the desired margin. Ratio features were computed by dividing the results obtained from 371 the eroded and standard volumes (rim features), or the dilated and standard volumes (peripheric features). 372 Shape features were not included in the ratios. Once standard and ratio features were calculated for each 373 lesion, a single value was extracted for the whole patient by taking the unweighted mean of all lesions.

374
Other imaging features In contrast to the radiomics features, which we averaged across lesions, we did use 375 the volume and the number of lesions in each of the anatomic locations as individual features. We also 376 computed the mean, maximum, and total volume, as well as the number of lesions with volume bigger than 377 0, 1, 10, and 100 cm 3 . In addition, we defined four binary features that indicated whether or not there were 378 hypodense or hyperdense regions in either omental or pelvic/ovarian lesions. Ascites and pleural effusion 379 were used as defined by the radiologist, as explained above (section 4, Radiological image analysis).

380
Clinical features Chemotherapy regimens were extracted from the clinical records. We recorded whether 381 the patient had received any Carboplatin, Paclitaxel or Doxorubicin in three binary variables. Mean periods 382 were calculated by averaging the time intervals between sessions. We defined weekly regimen as having 383 a mean period of 6 to 10 days, both included; and three-weekly as having a mean frequency of 18 to 24 384 D R A F T days, both included. Typical combinations included weekly Paclitaxel and three-weekly Carboplatin; both 385 weekly; and both three-weekly ( Figure 1). These combinations were therefore encoded in binary variables. 386 We also recorded whether patients had received Carboplatin only in a binary variable. FIGO stage was 387 encoded by assigning ordinal numbers to in order of aggressiveness, from 1 for stage 1A to 10 for stage 4B.

388
The exact mapping is listed in Table S4. CA-125 values were also extracted from the clinical records, with 389 the measurement closest to the beginning of treatment being used for analysis. The full list of clinical and 390 blood biomarker features can be found in Table S8.   Table S1. Kaplan-Meier curves were 410 also computed using the Lifelines library. 95% confidence intervals were computed using Greenwood's 411 Exponential formula. p values were computed using the log-rank test.

412
Machine learning models.

413
Training We created a machine learning framework to predict response to chemotherapy, evaluated on the 414 basis of relative total volume change. We used the NeOv training set to train and optimise the models.

415
Once trained and frozen, we evaluated the models in the internal hold-out set and in the external validation 416 set. We used an increasing number of features, in order of general availability. We started with clinical 417 features (age, stage, and treatment); then added baseline CA-125; then imaging features; and finally ctDNA.

418
For each combination we retrained the framework and derived a new model. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;

D R A F T
the hyperparameter space to optimise mean square error (MSE). Once the optimal hyperparameters were 431 found, we determined model parameters by re-fitting the model to the entire training set. To increase model 432 robustness, we repeated this process five times with five different cross-validation seeds. The five resulting 433 optimal models were combined to form the final ensemble, in which the prediction is simply the average of 434 the five predicted scores. In this regard, our ensemble setup has two different tiers: the randomisation tier 435 (5 seeds), and the algorithmic tier (3 regressors acting in parallel for each seed).

436
Validation We validated the models on the hold-out internal validation set (n = 20) and the external 437 validation set (n = 42). To quantify the calibration of the models, we computed the MSE. To quantify the 438 discriminative power of the models, we computed the Spearman correlation coefficient and p value between 439 the predicted and observed response scores. Finally, we evaluated whether the predicted scores were also 440 able to rank patients into the different RECIST 1.1 categories using the p value associated with the point 441 biserial correlation coefficient.

442
Feature importance We evaluated feature importances in two different steps. First we computed the 443 frequency with which features were selected after the collinearity reduction and univariable selection steps. 444 We repeated the process for each of the three algorithms and each of the five cross-validation seeds, which 445 means that features could be selected between 0 and a maximum of 15 times, as seen in Figure 6. The 446 table on Figure 6 displays only features that were chosen at least 3 out of 5 times in each cross-validation 447 loops, for robustness. Second, we computed the importance of each individual feature within the regression 448 algorithm. This was only possible for the elastic net, where we used the feature's coefficients, and the 449 random forests, where we used impurity-based feature importances. The results were averaged across the 450 five seeds and the two algorithms, with the table on Figure 6 displaying only features that were chosen at 451 least 3 out of 5 times in the cross-validation loops, as before.

452
Data and code availability 453 The code and data can be found at https://github.com/micrisor/OvarianIntegration.    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;  Fig. 2. Site-specific change in volume for each of the patients in the training dataset. Patients are ordered as a function of the total change in volume.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021.            . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 23, 2021. ;   . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)