Abstract
Background There are many clinical prediction models (CPMs) available to inform treatment decisions for patients with cardiovascular disease. However, the extent to which they have been externally tested and how well they generally perform has not been broadly evaluated.
Methods A SCOPUS citation search was run on March 22, 2017 to identify external validations of cardiovascular CPMs in the Tufts PACE CPM Registry. We assessed the extent of external validation, performance heterogeneity across databases, and explored factors associated with model performance, including a global assessment of the clinical relatedness between the derivation and validation data.
Results 2030 external validations of 1382 CPMs were identified. 807 (58%) of the CPMs in the Registry have never been externally validated. On average there were 1.5 validations per CPM (range 0-94). The median external validation AUC was 0.73 (25th −75th percentile [IQR] 0.66, 0.79), representing a median percent decrease in discrimination of −11.1% (IQR −32.4%, +2.7%) compared to performance on derivation data. 81% (n = 1333) of validations reporting AUC showed discrimination below that reported in the derivation dataset. 53% (n = 983) of the validations report some measure of CPM calibration. For CPMs evaluated more than once, there was typically a large range of performance. Of 1702 validations classified by relatedness, the percent change in discrimination was −3.7% (IQR −13.2, 3.1) for ‘closely related’ validations (n=123), −9.0 (IQR −27.6, 3.9) for ‘related validations’ (n=862) and −17.2% (IQR −42.3, 0) for ‘distantly related’ validations (n=717) (p<0.001).
Conclusion Many published cardiovascular CPMs have never been externally validated and for those that have, apparent performance during development is often overly optimistic. A single external validation appears insufficient to broadly understand the performance heterogeneity across different settings.
Introduction
Clinical prediction models (CPMs) are widely available to inform decisions in cardiovascular medicine. Our own database, the Tufts Predictive Analytics and Comparative Effectiveness (PACE) CPM Registry,1 demonstrates continued growth of prediction models for patients with cardiovascular disease (CVD) despite apparent substantial redundancy. The growth in the literature reflects the increasing ease with which these models can be developed, given the wide availability of both data and statistical software. Despite the publication of methodologic2 and reporting guidelines3 and a large set of potential performance metrics4, much remains unknown about the broad performance of these models, including the extent to which they have been validated, how well they validate, and how performance varies from one setting to another.
While there are various ways to assess the performance of a statistical model4, clinically beneficial CPMs will yield accurate predictions on new cohorts (external validation)5 and improve decision making and subsequent clinical outcomes. Despite the increasing number of CPMs in the literature, how models perform generally during external validations and the determinants of that performance is largely unknown. Current reporting recommendations reinforce the need for external validation3 though recent analyses suggest that most CPMs either have not been externally validated6 or have only been validated on a single external cohort.7 CPM discriminatory performance cannot not be assumed to be stable (i.e. equivalent to model performance at derivation) when tested in new settings.8 Model calibration has been largely neglected and unless it is known to be excellent, CPMs may lead to harm if they are used to inform decisions at certain risk thresholds9,10.
Here, we perform a field synopsis of external validation studies of cardiovascular CPMs reported in a prior systematic review.1 We aimed to describe the extent of external validation, variation in performance of models across databases, and to explore factors that are associated with worse model performance.
Methods
Cardiovascular CPMs
The cardiovascular CPMs that form the basis of this review are found within the Tufts PACE CPM Registry. This Registry is available at www.pacecpmregistry.org and represents a field synopsis of prediction models for patients at risk for and with known cardiovascular disease. The search strategy and inclusion criteria have been previously reported.1 Briefly, for inclusion in the Registry, an article must present the development of a cardiovascular CPM, contain a model predicting a binary clinical outcome, and the model must be presented in a way that allows prediction of outcome risk for a future patient. The search strategy for CPM identification was previously reported1 and is presented in the Supplement Fiigure 1. This analysis looked at cardiovascular CPMs published from 1990 through March 2015.
External Validation Search
A SCOPUS citation search of these cardiovascular CPMs was conducted on March 22, 2017. Citations were reviewed by two members of the study team to identify external validations of CPMs in the Registry. Discrepancies were reviewed by a third member of the research team. Consistent with prior work6, external validations were defined as any report that claimed to study the CPM for the same outcome as originally reported, but in a non-overlapping population.
Data Extraction
Information about each CPM/validation pair was extracted, including sample size, continent of study, number of events, and reporting of measures of discrimination and calibration. CPM validation performance focused on discrimination (AUC) change compared to the AUC seen in the derivation population. We also document whether validations include any assessment of CPM calibration. There are many methods to assess model calibration and only recent consensus on best practices.4,11 Given this lack of consistency and interpretability in the literature, we report whether or not this dimension of performance was assessed during external validation. Calibration assessment included any comparison of observed versus expected outcomes. Examples include a Hosmer-Lemeshow statistic or calibration plot. For this study we also included measures of calibration-in-the-large, where overall observed event rates are compared to predicted rates.
CPM performance
Consistent with prior work12, percent changes in CPM discrimination from derivation to validation are described on a scale of 0% (no change in discrimination) to −100% (complete loss of discrimination) because it more intuitively reflects the true changes in discriminatory power.13 Positive changes represent improvements in discrimination. The percent change in discrimination is calculated using the following equation [(Validation AUC − 0.5) – (Derivation AUC - 0.5) / (Derivation AUC - 0.5) ∗ 100].
Population Relatedness
To explore potential explanations for decreased performance on validation data sets, we assessed the similarity between the derivation and validation populations by creating detailed relatedness rubrics for the 10 index conditions with the greatest number of CPMs (Supplement Taable 1). These rubrics were created by investigators with expertise in these clinical areas. Relatedness was assessed for each CPM/validation pair to divide validation databases into 3 categories —“closely related,” ”related” and “distantly related.” A fourth category “no match” was assigned to validations that were excluded from the analysis because they were not clinically appropriate matches (e.g., CPM validated on population with non-overlapping index condition or outcome). Generally, the relatedness rubrics were based on 5 domains: (1) recruitment setting (e.g., outpatient vs emergency room vs inpatient), (2) major inclusion/ exclusion criteria, (3) intervention type (e.g., percutaneous coronary intervention versus thrombolysis for acute myocardial infarction), (4) therapeutic era, (5) follow-up time. Two clinicians reviewed these domains for each CPM/validation match and assigned a relatedness category. Non-random split-sample validations were labeled as “closely related” validations. Discrepancies were reviewed by the study team to arrive at a consensus.
Factors associated with CPM external validation
We identified a set of study level factors to evaluate associations with whether or not a CPM was externally validated. These factors were identified based on observed methodologic and reporting patterns as well as prior literature.8 These factors included: Index clinical condition, internal validation performed, year of publication (divided here before 2004, 2004-2009, 2009-2012, after 2012), continent of origin, study design (e.g., clinical trial vs. medical record), sample size, number of events, number of predictors, prediction time horizon (< 30 days, 30-265 days, >365 days), regression method (e.g., logistic regression vs Cox regression), and reporting of discrimination or calibration. We analyzed unadjusted associations and used multivariable logistic regression to assess whether these variables were associated with CPM external validation.
Factors associated with poor performance
A set of study level factors defined a priori were evaluated for association with worse CPM performance (discrimination) during validation. These factors included: population relatedness (here, dichotomized as distantly related versus other), presence of overlapping authors, same or different paper, CPM modeling method, CPM data source, validation data source, outcome rate difference between derivation and validation data (defined as > versus < 40%), CPM events per included variable (EPV). We used generalized estimating equations (GEE)14,15 with robust covariance estimator to assess the multivariable association with the observed change in discrimination, taking into account the correlation between validations of the same CPM. Multiple imputation of 20 imputed data sets was used to account for missingness. These analyses estimated the absolute difference in the estimated percent change in the c-statistic from derivation to validation populations, as calculated above. All statistical analyses were performed using SAS Enterprise Guide version 8.2 (SAS Institute Inc., Cary, NC, USA).
Results
Overview of Validations
The Registry includes 1382 CPMs for CVD and the citation search of these CPMs identified 54,086 citations that were screened (Figure 1). These citations identified 14,615 abstracts that were screened to identify 6039 full text articles. A total of 2030 external validations were extracted from 413 papers. Only 575 (42%) of the CPMs in the Registry have ever been validated. On average there were 1.5 validations per de novo CPM, with a very skewed distribution. The Logistic EuroSCORE16 has been externally validated 94 times. For this analysis, we included 1846 validations of 556 CPMs after exclusion of 19 decision trees and 156 validations performed on ‘unrelated’ (i.e. populations with different index conditions or non-overlapping outcomes) samples. The median external validation sample size was 861 (25th-75th percentile [IQR] 326, 3306) and the median number of outcome events was 68 (IQR 29, 192) (Table 2).
CPM Validation Discrimination
Overall, 91.3% (n = 1685) of the external validations report area under the receiver operating characteristic curve (AUC). The median derivation AUC was 0.77 (IQR 0.73, 0.82). The median external validation AUC was 0.73 (IQR 0.66, 0.79) representing a median percent change in discrimination of - 11.1% (IQR −32.4%, +2.7%) (Table 2). Of the validations with decreased performance (n = 795), 25% (n = 195) had less than 10% decrement in discrimination. Two percent (n = 35) had greater than 80% drop in discrimination; 19% (n = 352) of model validations showed CPM discrimination at or above the performance reported in the derivation dataset.
CPM Calibration
In total, 53% (n = 983) of the validations report some measure of CPM calibration. The Hosmer- Lemeshow test of goodness-of-fit was most commonly reported (30%, n = 555) followed by calibration- in-the-large (26%, 488), and calibration plots (22%, n = 399). (Table 2). Overall, there was no externally assessed calibration information available for 86% (n = 1182) of the CPMs in the Registry.
Clinical Domains
The ten conditions with the most CPM validations comprised 92% (1702/1846) of the total validations included in this analysis (Table 3). The condition with the largest number of validations was Stroke (299 validations performed on 104 CPMs). There were a total of 286 validations of 87 CPMs for populations at risk for developing CVD (population samples) and 286 validations of 52 CPMs for Cardiac Surgery.
Only five index conditions had > 50% of available CPMs externally validated [Arrhythmias (81%), Valve Disease (62%), Venous Thromboembolism (53%), Cardiac Surgery (51%), and Aortic Diseases (50%)]. There is an extreme range of CPM performance and consistent loss of discriminatory performance during external validations (Figure 2, Table 3). These observations were apparent for all conditions that were studied (specific condition waterfall analyses shown in Supplemental Figure 2).
Relatedness
Relatedness was assigned to each of the 1702 of the CPM/validation pairs for the top 10 index conditions. Of these, 123 (7%) of the validations were performed on ‘closely related’ populations, 862 (51%) were performed on ‘related’ populations, while 717 (42%) were performed on ‘distantly related’ populations (Table 2). The median AUC for ‘closely related’ validations was 0.78 (IQR 0.719, 0.841). The median AUC for ‘related population’ validations was 0.75 (IQR 0.68, 0.803). The median AUC for ‘distantly related’ validations was 0.70 (IQR 0.64, 0.77) (p <0.001). Overall, the median percent change in discrimination was −3.7% (IQR −13.2, 3.1) for ‘closely related’ validations, −9.0 (IQR −27.6, 3.9) for ‘related validations’ and −17.2% (−42.3, 0) for ‘distantly related’ validations (p<0.001).
Range of Performance for Individual CPMs
Table 4 shows the variation in performance across the 10 CPMs that were validated most frequently. Uniformly, there was a substantial range in performance of each CPM across datasets, from virtually useless to excellent. For example, discrimination for the Logistic EuroSCORE (validated 94 times) ranged from 0.48 to 0.90 across different databases. None of these highly cited (and validated) CPMs had consistently good discrimination across validation databases.
Predictors of External Validation
Study features that are associated with CPM external validation (yes/no) are shown in Supplemental Table 2. The index condition was strongly associated with subsequent external validation. Models that were internally validated and models that were published more recently were less likely to be externally validated. Sample size, number of predictors, and reporting of discrimination or calibration were positively associated with external validation. On multivariable analysis, these predictors remained associated with CPM external validation. Study design, prediction time horizon, and regression method were not associated with a model being externally validated.
Predictors of Poor Performance
Predictors of CPM validation performance are shown in Table 5. On univariate analysis, population relatedness was significantly associated with CPM discrimination in validations. When CPMs were tested on “distantly related” cohorts, the AUC decrease was −15.6% (95% CI −22.0, −9.1) compared to the reference (validations done on “closely related” cohorts). When evaluated in a multivariable model, population relatedness remained significantly associated with CPM discrimination in validations (−9.8%; 95% CI −18.8, −0.8). We also observed that validations demonstrated AUCs that were 9.8% (95% CI 5.4, 14.2) higher when reported in the same manuscript (with the same authors) as the de novo CPM report compared to validations reported in different manuscripts with non-overlapping authors.
Discussion
Our Tufts PACE CPM Registry documents the tremendous proliferation and redundancy of CPMs being developed and published. The review reported here underscores that this proliferation is occurring without adequate—or even minimal—external evaluation. Approximately 60% of published CPMs have never been externally validated. Approximately half of the CPMs that have been validated have been validated only once. A small minority of models have been validated numerous times. The value of single validations is unclear, since there is substantial performance heterogeneity and good (or poor) performance on a single validation does not appear to reliably forecast performance on subsequent validations. No CPM showed consistently good discrimination across multiple validation databases. For example, the ten most validated CPMs have each been validated more than 20 times; all show substantial variation in discrimination across these validation studies, from virtually useless (i.e., c- statistic = ∼ 0.5) to very good (c-statistic = ∼0.8 or higher). This demonstrates the difficulty of defining the quality of a model generically, since performance greatly depends on characteristics of the database on which a model is tested. These findings underscore recent calls for a fundamental paradigm shift in how models are assessed for validity and utility17 and calls for more robust stewardship of algorithms for health care18.
The majority of cardiovascular CPMs in our Registry have never been externally validated. This finding mirrors an observation made in previous assessment of primary prevention models8 and broadly suggests that cardiovascular clinicians should be skeptical about the accuracy of individual risk estimates. In our Registry, model level predictors associated with subsequent external validation include the disease being studied and also larger sample size, higher outcome rates, and whether discrimination or calibration were reported in the original presentation. Older CPMs were generally more likely to be externally validated—an observation that may relate to insufficient time to allow for validation of more recently published CPMs. Given the extreme redundancy of CPMs and the relative scarcity of external validations, it seems reasonable to prioritize the study of existing cardiovascular CPMs (as opposed to developing new ones), and how these might be optimized for clinical use.
In our review, it was common to observe substantial decrements in discrimination during validations. This finding is consistent with prior reports that have shown CPM validation discriminatory ability that is highly variable and often worse than anticipated (when compared to performance on the derivation database).6,8 There are several potential reasons why model performance might decrease, including model invalidity (e.g., due to over-fitting on the derivation population) and a change in case mix.5 Model invalidity might be expected to be more pronounced when models are evaluated in populations that are dissimilar to the derivation population. We found that models had a substantially larger decrease in model performance when tested on distantly-related populations compared to either related or closely-related populations. However, judging the relatedness of the populations is laborious and requires substantial clinical expertise. Differences that may appear subtle can be very influential.
For example, a CPM developed on patients in the emergency room might not be expected to have similar discriminatory performance if the validation cohort includes only patients admitted to the hospital since—as in the case of many acute cardiac syndromes—care19 and outcome predictors20 are different very early in the disease course. So too changes in treatments received (e.g., different ACS revascularization approaches,21 stent types22, or outcome definitions23,24) likely impact model validation performance. If the model was derived on patients receiving lytic therapy and validated using data from a more contemporary percutaneous coronary intervention (PCI) trial, it should not be surprising that model performance appears worse than expected. Other study-level characteristics we examined apart from relatedness, did not appear to greatly influence model performance.
One of the most striking observations of this work is that isolated validations appear insufficient to understand the performance of CPMs when tested in new populations. There was often an extreme range in performance for CPMs evaluated in multiple databases—an observation that calls into question the generalizability of any one validation result. These data challenge the current approach in which a model might be evaluated on a single external population and then declared to be a ‘validated’ prediction model that is ready for use. Even when a model performs well using statistical criteria, it is unclear whether such a model improves decision making when used on a closely related population.
Further, good statistical performance on one external database does not guarantee good statistical performance in another setting—such as where a CPM is eventually used to support care. There is no evidence from our analysis that so called “validated” CPMs that have been integrated into clinical practice guidelines25,26 should be accepted as trustworthy unless CPM performance is specifically known to be excellent on populations like those being treated. While having a single CPM that is accepted by the clinical community and promoted in guidelines is appealing as a means of standardizing practice across a range of different settings, the degree of variation seen in our review suggests that this paradigm may result in substantial variation of performance across different settings, and poor performance in some settings. Testing CPMs for improved decision making and better clinical outcomes (e.g. in a cluster randomized trial27) is rarely performed prior to dissemination into practice. Novel paradigms, emphasizing increasing the accuracy of model performance on local populations, through continual recalibration and updating, are an appealing approach that deserves further consideration.
Our review has several limitations. First, the review was limited by the information collected and presented in the original articles. We relied on changes in discrimination largely because CPM calibration is woefully under-assessed. Only 62% of models in the CPM Registry have had calibration formally assessed in an external population; even among the models that were validated only 48% report any calibration. Finally, even when calibration is reported, it is usually reported in a form that is not clinically interpretable (e.g., as a Hosmer-Lemeshow statistic4,13) or graphically [easy to summarize according to calibration slope (ideal: 1) and systematic under or overestimation (intercept ideally 0)]. Some less frequently used metrics, such as the integrated calibration index28, may help compare performance across multiple validations. Decrements in calibration may be as serious as, or even more serious than, decrements in discrimination, since miscalibrated models yield misinformation which may cause harmful decision making.9 Ideally, we would be able to evaluate the ‘net benefit’ of model use, which integrates discrimination, calibration and relative utility to compare the value of prediction-based decision making compared to best “one-sized-fits-all” strategies.4,29 Such evaluations would have required individual patient data, since these approaches are so rarely used in the published literature. Similarly, we could not assess how much of the decrement in discrimination was due to differences in case mix, rather than invalidity, which would have also required evaluation of patient level data30. Finally, our systematic review does not include more recent validations after 2017, due to the enormous scope of this literature, the lack of efficient search strategies and the laborious nature of comprehensive data extraction and evaluation of relatedness. We do not anticipate the more recent literature would substantially change our findings. Maintenance and continual updating data of this registry will registry will require a semi-automated approach heavily reliant on natural language processing.31
Conclusion
Many published cardiovascular CPMs have never been externally validated and for those that have, it is common to see significant performance heterogeneity and marked decreases in the discriminatory performance compared to the model development phase. Calibration has been widely under-assessed and single validations do not sufficiently capture CPM performance. Granular information about population relatedness is associated with CPM performance in external validations and when CPMs are tested on distantly related populations, model performance is often substantially worse than expected. This review raises substantial concerns about the current approach to ‘validating’ cardiovascular CPMs and underscores the need for a radical rethinking for how performance heterogeneity is explored and quantified (e.g. through multiple validations across various practice settings) and how models are evaluated for clinical use.
Data Availability
The authors confirm that the data supporting the findings of this study are available within the article.
Acknowledgements
Research reported in this work was funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1606-35555). The views in this work are solely the responsibility of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee.
The authors wish to acknowledge the contributions of Mr. Vandan Patel for his work on the relatedness effort.
BW is supported by K23AG055667 from NIH-NIA and R03AG056447 from NIH-NIA