Predicting Success of Phase III Trials in Oncology

We developed a model predicting the probability of success (PoS) for single planned or ongoing PhIII trials based on information available at trial initiation. Such a model is highly relevant for study sponsors to capture risk and opportunity on a trial to trial basis through trial optimization, and for investors to select drugs whose trial design match their investment strategy. Objectives: To predict the outcome of planned or ongoing PhIII trials in oncology, given publicly available prior information Design, Setting, Participants: Predictive modeling using publicly available data for 360 completed PhIII and 1240 PhII studies initiated between 2003 and 2012. Success and failure of PhIII studies were modeled using Bayesian logistic regression model. Main Outcome Measures: Predicted PoS of individual PhIII trials based on a Bayesian model calibrated on publicly available data translated into 16 composite scores. Those scores cover aspects such as trial design, indication, number of patients, phase II (PhII) study outcomes, experience of sponsor at time of trial initiation, and others. Results: The model allows to calculate the PoS distribution, including credible intervals, for a PhIII trial in oncology. The predictive performance was determined using an area under the receiver operator curve (AUROC), resulting in an overall performance of 73%oPP (mean AUROC). We identified two key factors contributing to the predictive performance of the model: quality and strength of PhII data and experience of the sponsor at the time of study initiation. Conclusion and Relevance: We describe the generation and application of a statistical model predicting the PoS for individual PhIII trials in oncological indications with unprecedented predictive performance. Compared to other approaches, this is the first study generating a fully transparent model resulting in trial specific PoS distributions. Moreover, we have shown that qualitative concepts such as PhII knowledge or sponsor R&D strength can be captured in quantitative scores and that these scores have a high predictive power.


Introduction 54
In recent years, predictive algorithms have become ubiquitous across a wide range of 55 industries, such as logistics -e.g. Amazon's predictive shipping 1or information retrieval -56 e.g. Google's predictive search algorithm 2 . By combining information from a vast number of 57 sources in an objective, unbiased manner, predictive algorithms can outperform human 58 decision making with respect to accuracy and speed at marginal cost. Even in the public 59 sector, political decision makers have become increasingly aware of the importance of 60 accurate predictions and have started evaluating different approaches in forecasting 61 tournaments 3 . Predictive algorithms can aid decision makers in the pharmaceutical industry 62 as well. However, so far, adoption has been limited. The current decision making process, 63 from discovery to clinical development phases, is characterized by a series of decision points 64 defined by formal go/no-go criteria 4 that relate to the available clinical data. This process is 65 implemented to aid executives as it reflects regulatory requirements for each indication. 66 However, it has been demonstrated that decisions based on real world cases vary greatly, 67 ranging from absolutely 'go' to absolutely 'no-go' due to subjective interpretation of identical 68 data 5 . 69 We argue that predictive algorithms can be employed for improving and rationalizing decision 70 making in Pharma, particularly in clinical trials representing the most crucial and expensive 71 centerpiece of drug development. Predictive algorithms based on Big Data have 72 demonstrated that they are able to aid or even outperform drug developers and physicians 73 when it comes to predicting either patient accrual rates in clinical trials 6,7 , or optimal cancer 74 rehabilitation 8 , or supportive care interventions 8 . 75 Trial decision making for any drug in clinical development has far-reaching implications. On 76 one hand, hundreds to thousands of patients are recruited to test the effects, each of them 77 hoping to benefit from the new drug. On the other hand, sponsors risk tens to hundreds of 78 million Dollars on trials that may or may not demonstrate the drug's effect 9 . It is therefore in 79 the interest of sponsors to terminate failures early, without compromising on the quality of 80

Results 149
Database 150 ClinicalTrials.gov is the largest registry of clinical trials database with >183,000 registered 151 trials at the time of query (Fig 1, I). Strict application of exclusion criteria resulted in 360 PhIII 152 trials across 37 oncology indications (Fig 1, IV). The 111 NBEs and NCEs associated with 153 these PhIII studies served as a starting point for the complementary analysis, which aimed to 154 gather all publicly available PhII and PhIII information associated with the original set of 360 155 PhIII studies by means of searching for the drug name, the trial identifier, trial names and its 156 synonyms. We searched in the European and Japanese trial registries EudraCT and JAPIC, 157 scanned published literature using PubMed.gov, company homepages, and checked Adis 158 R&D, a database for drug research and development (Fig 1, V). This approach gave rise to 159 1240 PhII studies, 488 publications, and 111 ADIS data entries, respectively, with detailed 160 information about PhII and PhIII study outcomes, endpoint measures, study design, actual 161 patient population etc. (eTables 1-3). Please note, PhI results were not considered for this 162 proof-of-principle analysis, given PhI studies are less well defined with regard to patient 163 stratification (mixed patient population with regard to tumor type, tumor stage, line of 164 treatment) and endpoint measures. We defined trial success by meeting all primary 165 endpoints (see eTable 3 for detailed classification). 166

Scoring approach for predictive modeling 167
The key step in predictive modeling is to construct informative (predictive) scores from the 168 available raw data. To this end, we employed a hybrid approach that combines (a) expert-169 driven design of complex scores to incorporate human judgement and experience and (b) a 170 purely data-driven approach to assigning weights to variables and then selecting the most 171 predictive variables. The group of domain experts consisted of eight consultants with an 172 academic life science background and several years of work experience in pharma R&D. 173 Notably, expert input was not employed with regard to individual trials (which would lead to 174 biased results), but exclusively in the structural design of the scoring methodology. Fig 2  175 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; provides a conceptual overview of this approach illustrated by an exemplary drug X, which is 176 developed in three different indications A, B, and C (Fig 2A). We identified a battery of 177 information available at a given point in time (Fig 2A, time of assessment, vertical grey line) 178 to assess the probability of success for a given PhIII trial of interest (TOI, blue box). We 179 classified all available information into time-dependent variables ( Fig 2B, eTable 1), drug-180 related characteristics (Fig 2C, eTable 2) and trial-specific characteristics (Fig 2D, eTable 3). 181 Within each of these categories we created composite scores combining weight and 182 magnitude of information. All composites were broken down to objective and quantifiable 183 elements ( Fig 2B i-iii). We developed unbiased decision matrices for variables that offered no 184 direct read-out from the database (e.g. Relatedness between PhII and PhIII TOI, eFig 2). In 185 total, we created three classes of parameters comprising 16 composite scores (see eTables 186 2-4) based on 82 variables. Consequently, each PhIII trial is described by a unique set of 187 descriptors, allowing the generation of a trial-specific prediction. 188

Generation of predictive model 189
The overall approach to predictive modeling is a dynamic Bayesian logistic regression 23 . 190 More specifically, the log-odds of the binary target variable ( is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 scores only such information is used that was available at the time point of the respective 203 PhIII trial's initiation date (eTable 1). For variable selection (b) and overall performance 204 evaluation (c) we adopt a time-series cross-validation strategy. 205

(Ir-)Relevance of variables 206
The predictive performance of single composite scores can be calculated for 15 out of the 16 207 composite scores that we designed, i.e. all scores but novelty of MoA ( Fig 4A). When 208 applying the AUROC performance metric, we found positive correlationmean AUROC 209 >55%, ranging up to 64% AUROCwith success or failure for the metrics SPIIK, Company 210 R&D strength, number of subjects in trial, designations, trial type and prior registrations of the 211 drug in other indications. 212 We found no correlation -AUROC >45% and <55% -with success or failure for the metrics 213 modality, indication, geography, and involvement of Big Pharma. 214

Model selection 215
The best full predictive risk model (PRM) was built based on the single composite scores 216 with a positive contribution to the model, while scores without impact were omitted ( Fig 4B). 217 To arrive at the best PRM, we addedat each stepthe variable producing the highest 218 mean AUROC increase of the model (or the smallest decrease). The variables selected (Fig  219   4B) are those that bring the highest overall predictive performance, reported as the mean 220 AUROC over five time-series cross-validation splits ( Fig 4C, Fig 3S). Note that the predictive 221 performance of the single scores is not strictly additive due to overlapping information. 222

Overall model performance 223
The performance of the model with the best combination is ~73%oPP and includes 12 224 variables ( Fig 4C, eFig 3B, dotted line). In other words, confronted with one successful and 225 one unsuccessful trial, using the PRM one will correctly identify the successful trial in ~73% 226 of the cases by picking the trial with the higher predicted PoS. 227

Exemplary model outcomes 228
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10. 1101/2020 To illustrate the model's output on single trials, we elected to highlight the variables 229 characterizing exemplary clinical trials and to visually signify their influence on the PoS 230 prediction ( Fig 5A). Consequently, we selected two clinical studies at different ends of the 231 spectrum, one with low mean PoSgefitinib in treating patients with esophageal cancer that 232 is progressing after chemotherapy, NCT01243398 26 , Fig  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10. 1101/2020 Discussion 249 Drug developers use historical success rates as forward-looking estimates to determine the 250 PoS of an individual trial to inform their decision making. These PoS rates are often adjusted 251 based on the opinion of subject matter experts, so-called Key Opinion Leaders (KOLs). For 252 more than 60 years studies 30-32 have kept demonstrating that expert opinions are no better 253 than guessing, while in the last decade or so it was established that algorithms are able to 254 aid or even outperform drug developers and physicians when it comes to predicting patient 255 accrual rates in clinical trials 33,34 , optimal cancer rehabilitation, or supportive care 256 interventions 8 . Therefore, it is surprising that an industry praising rationale drug development 257 still takes decisions regarding investments, strategy, and science based on subjective expert 258 advice. While the traditional approach certainly has some merit, we argue that a data-driven 259 prediction can complementif not quite replacetraditional methods such as KOL 260 interviews and hSR benchmarking. In particular, we have demonstrated that the 261 consideration of complex drivers of PhIII success such as the accumulated knowledge from 262 prior phases is not limited to the judgement of experts (KOLs), but can also be addressed in 263 an empirical data-driven manner using a sophisticated scoring approach presented in this 264 study. 265

Discussion of results 266
Employing several publicly available databases 18,21,22 we developed a predictive model 267 The SPIIK includes the strength and relatedness of the combined PhII evidence that exists 273 before starting the PhIII (Fig 2, eTable 2). Our use of a decision tree optimized for prediction 274 allowed us to model the relevance of a combined PhII body of evidence to a particular PhIII 275 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.15.20248240 doi: medRxiv preprint study and its design (eFig 2). The predictive performance of SPIIK alone confirms that the 276 information building this composite score is relevant indeed. This is in line with both, common 277 sense and regulatory guidance 35 , suggesting that the sum of PhII results are to some degree 278 indicative for PhIII trial outcomes. 279 The sponsor's past track record in oncology (Company R&D Strength) had the second 280 largest predictive performance for PhIII studies. As this criterion includes both the number of 281 past PhII and III studies in oncology as well as the outcomes, it essentially describes where 282 an organization stands on the learning curve when it comes to designing studies in oncology. 283 Noteworthy, this value is time-dependent to consider the situation on the day of trial initiation. 284 There is a strong effect of having designed phase II and III studies that met their primary 285 endpoints on the ability to do it again. Janssen, Celgene, and Genentech are the top 3 286 performers in this category of companies with at least 10 PhIII studies (2003 -2012) in our 287

sample. 288
The role of indication 289 The factor 'indication' provides no additional value to the predictive risk model (Fig 4A, B). 290 Notably, this is in line with expectations due to the technical bias in trial selection (Fig 1); We 291 started our search with PhIII trials and only subsequently enriched with PhII trials associated 292 with the selected PhIII trials. Therefore, we introduced a bias for drugs that made it into PhIII 293 in at least one indication. Drugs exclusively developed in indications known for high failure 294 rates (e.g. Pancreatic Cancer) 14 hardly make it into a PhIII trial in the respective indication, 295 hence PhIII trials in these indications are underrepresented in this proof of concept study. 296

Novelty of MoA 297
In order to factor in the degree of innovation brought about by the compounds investigated in 298 PhIII, we designed a composite score taking into account the novelty of MoA. That composite 299 score was excluded from the model, as the results of our attempt were not conclusive. On 300 one hand, this is due to the complexity of embodying the qualitative nature of innovation into 301 a quantitative variable. On the other hand, there is a lack of availability of systematic, 302 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.15.20248240 doi: medRxiv preprint comprehensive data due to fundamental differences in the MoA classification schemes used 303 and the level of information provided by companies (eTables 2 and 4). 304

Comparison to other approaches 305
Empirical studies of clinical trial success broadly fall into two categories: (i) retrospective 306 descriptive analysis of success rates 11,12,15,16,36,37  In contrast to others, we use a linear regression model to reduce overfitting and for ease of 319 interpretation but calibrate parameters in a Bayesian fashion so that credible intervals for 320 parameter estimates and a posterior predictive distribution is available for PoS estimates. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 The current algorithm is focused on oncology PhIII trials. For this proof of concept study, we 328 chose oncology over other therapeutic areas, because trial endpoints (mPFS and mOS) 329 across all oncology indications are both, quantifiable and comparable in nature, providing a 330 strong foundation for modeling approaches. 331 We excluded several non-standard modalities including the cellular therapies which are 332 changing the treatment landscape as we speak. In principle, the algorithm is also prone to 333 certain regulatory aspects (e.g. break through designations) that allow a development 334 program to move from phase I (PhI) directly to PhIII or approval, respectively. 335

Next steps & outlook 336
Based upon this proof of concept study, the model can potentially be expanded to (1) predict 337 PhII trial outcomes based on data from pre-clinical and PhI studies, (2) therapeutic areas 338 other than oncology (e.g. cardiovascular diseases), (3) incorporate more modalities (e.g. 339

CAR T cells) for which a growing body of evidence is becoming available, and to (4) allow for 340
the integration of non-public information available to drug developers (sponsors and 341 investigators) in cooperation with the project teams. 342

Conclusions 343
The algorithm presented here can distinguish successful from unsuccessful trials with much 344 greater confidence than any other publicly available approach reviewed 10,12,[14][15][16][17]38 . The 345 positive predictive value can be tuned up to >80%PPV by accepting more false negatives 346 (lower sensitivity). To our knowledge, this is the first approach allowing to quantitatively 347 predict the probability of success for single trials. Our model uses publicly available 348 information only, including that of prior trials with perhaps only remote relatedness to the trial 349 in question, and then delivers a specific prediction for a given trial. In addition, the model is 350 fully transparent, adaptive on a trial-to-trial basis, provides unprecedented granularity (e.g. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020.
Such an algorithm has a number of obvious applications of high medical, strategic and 354 financial value, quite apart from the ethical dimension of a doctor's decision to enroll patients 355 in a study. Both sponsors and investors involved in the field of oncology could benefit greatly 356 from a predictive algorithm assessing the prospects of a specific study, in particular by 357 • Supporting sponsoring companies to maximize success by designing their individual 358 studies based on the highest possible PoSTrial 359 • Helping investors determine the impact of PhIII outcomes on valuation. This is especially 360 relevant for those biotechs with a single PhIII asset. In addition, investors able to pursue 361 different strategies could identify trials (and companies behind the studies) that match 362 their investment strategy, e.g. pick-the-winner-drop-the-loser or vice versa. 363 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020  that were completed prior to the assessment carry information, but the relevance of this 507 information (shades of green) varies with regard to the specific characteristics of the TOI. 508 Generally speaking, the closer the patient population, the study design and the treatment 509 algorithm of a given trial with regard to the TOI, the higher its relevance. Note, the clinical 510 development plan (CDP) illustrates only elements, which are considered in the database. We is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint conducted with drug X (not shown) carries information (black arrows) that are potentially 520 relevant for assessing the PoS of the TOI (blue). The ROI for a given trial i is defined as the 521 product of three factors, each of which can be broken down into further subfactors, that may 522 even be further broken down (e.g. Relatedness, *see eFig 2 for details) until an objective and 523 quantifiable level of information is found. Note, that there is a unique ROI for each 524 combination of PhII and TOI, thus a unique SPIIK for each TOI. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.15.20248240 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.15.20248240 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.15.20248240 doi: medRxiv preprint segment are not included in the model because they fail to increase its performance. (C) For 542 the best model, the receiver operating characteristic curves are displayed for each time 543 series. The overall model performance is given by the mean AUROC over all time points and 544 is ~73%. 545 546 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.15.20248240 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10. 1101/2020