Development and validation of the 4C Deterioration model for adults hospitalised with COVID-19

Prognostic models to predict the risk of clinical deterioration in acute COVID-19 are required to inform clinical management decisions. Among 75,016 consecutive adults across England, Scotland and Wales prospectively recruited to the ISARIC Coronavirus Clinical Characterisation Consortium (ISARIC4C) study, we developed and validated a multivariable logistic regression model for in-hospital clinical deterioration (defined as any requirement of ventilatory support or critical care, or death) using 11 routinely measured variables. We used internal-external cross-validation to show consistent measures of discrimination, calibration and clinical utility across eight geographical regions. We further validated the final model in held-out data from 8,252 individuals in London, with similarly consistent performance (C-statistic 0.77 (95% CI 0.75 to 0.78); calibration-in-the-large 0.01 (-0.04 to 0.06); calibration slope 0.96 (0.90 to 1.02)). Importantly, this model demonstrated higher net benefit than using other candidate scores to inform decision-making. Our 4C Deterioration model thus demonstrates unprecedented clinical utility and generalisability to predict clinical deterioration among adults hospitalised with COVID-19.


Introduction
The coronavirus disease 2019 (COVID-19) pandemic continues to overwhelm healthcare systems worldwide 1 . Effective triage of patients presenting to hospital for risk of progressive deterioration is critical to inform clinical decision-making and facilitate effective resource allocation. Moreover, early identification of higher risk subgroups enables targeted recruitment for randomised controlled trials of therapies with equipoise 2 , and more precise delivery of treatments for which effectiveness is known to vary according to disease severity [3][4][5] .
A large number of multivariable clinical prognostic models for patients with COVID-19 have rapidly accrued to predict adverse outcomes of mortality or clinical deterioration 6 . The vast majority of candidate models subjected to comprehensive quality assessment have been classified as being at high risk of bias, and therefore may not be generalisable 6,7 . Moreover, none of the multivariable prognostic models included in a systematic head-to-head external validation study outperformed univariable predictors 8 , highlighting a critical need to adhere to rigorous model development methodology using large scale multi-site data, in order to facilitate generalisability.
We have previously reported a pragmatic prognostic index for in-hospital mortality from the ISARIC Coronavirus Clinical Characterisation Consortium (ISARIC4C) study 9 . Here, we extend this work through a larger study cohort to develop and validate a prognostic model for in-hospital clinical deterioration. We use the wide geographic coverage of the ISARIC4C study cohort in England, Wales and Scotland to explore between-region heterogeneity, and to comprehensively assess model generalisability with respect to discrimination, calibration and clinical utility. We have called this the 4C Deterioration model. predictors (including restricted cubic spline transformations) and the outcome in the imputation models to ensure compatibility. Imputation was done separately for each NHS region to preserve potential inter-region heterogeneity. We generated 10 multiply imputed datasets; all primary analyses were performed in each imputed dataset and model and validation parameters were pooled using Rubin's rules 25 .

Model development
We hypothesised that heterogeneity among populations, healthcare services and clinical management may be present between NHS regions, which may contribute to differences in model performance; we therefore split the dataset according to the region of the contributing hospital. We included eight regions in model development and internal-external cross validation (East of England, the Midlands, North East England and Yorkshire, North West England, Scotland, South East England, South West England and Wales), with one region held out for further validation (London).
We used a logistic regression modelling approach and performed backward elimination of the a priori candidate variables using Akaike information criterion (AIC). This process was done separately in each multiply imputed dataset and in each NHS region in the development set. Predictors were required to be retained in >50% of multiply imputed datasets in >50% of development NHS regions in order to enter the final model. We specified this in order to retain a parsimonious set of predictors that had consistent prognostic value across the development NHS regions.

Internal-external cross-validation
The model including the selected variables was then validated in the development dataset using the internal-external cross-validation framework, in order to concurrently examine between-region heterogeneity and assess generalisability 26 . During each internal-external cross-validation cycle, one of the contributing NHS regions within the development set was iteratively discarded from model training. Validation was then evaluated in the omitted NHS region by quantifying the model C-statistic, calibration slope and calibration-in-the-large 27 . We used random-effects meta-analysis to calculate pooled C-statistic, calibration slope and calibration-in-the-large statistics from internal-external crossvalidation, and forest plots were examined to assess heterogeneity between regions. Calibration plots were also generated for each internal-external cross-validation cycle by fitting a loess smoother between the model predictions and the outcome in the stacked multiply imputed datasets.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020. 10.09.20209957 doi: medRxiv preprint Recalibration to each region was performed by re-estimating the model intercept in the validation sets during each internal-external cross-validation cycle.
Decision curve analysis was also done in the internal-external cross-validation validation sets to quantify the net benefit of implementing the model in clinical practice 28 , compared to: (a) a 'treat all' approach; (b) a 'treat none' approach; and (c) using other candidate generic and COVID-specific clinical prognostic models to stratify treatment, identified by recent systematic reviews 6,8,9 . Only candidate models where constituent variables were available among >60% of the cohort were considered. Candidate models using points scores were calibrated to the validation data during decision curve analysis, resulting in optimistic estimates of their net benefit. All decision curves were loess-smoothed, from stacked multiply imputed datasets.

Model validation in held-out NHS region
The final model was then trained using the full development dataset and validation was further evaluated in the held-out NHS region (London) by quantifying the C-statistic, calibration slope and calibration-in-the-large, and by visualisation of calibration plots 27 . Decision curve analysis was also performed, with comparison to other candidate models.

Sensitivity analyses
We assessed validation of the final model using complete case data only in the held-out NHS region.
We also recalculated validation metrics when: (a) excluding participants who experienced the outcome on the day of admission, in order to assess discrimination of the final model without these early events; (b) excluding participants in the validation cohort who had ongoing hospital care at the end of follow-up; (c) stratifying the validation cohort by community vs. nosocomial infection; and (d) excluding community-acquired cases who developed symptoms in the interval between admission and the temporal threshold for nosocomial infection, in order to assess any effect of incorrect inclusion of nosocomial infections within the community acquired cases. We also repeated the analysis using an alternative multiple imputation approach, using the aregImpute function from the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint rms package in R 30 , and recalculated model parameters using alternative temporal definitions of nosocomial SARS-CoV-2 infection (>5 days and >10 days after admission, compared to >7 days in the primary analysis).

Ethical approval
Ethical approval was given by the South Central-Oxford C Research Ethics Committee in England (reference 13/SC/0149), and by the Scotland A Research Ethics Committee (reference 20/SS/0028).
The study was registered at https://www.isrctn.com/ISRCTN66726260. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint

Overview of study cohort
A total of 75,016 adults were recruited to the ISARIC4C study during the study period. Baseline demographic, physiological and laboratory characteristics of the cohort are shown stratified by outcome in Table 1   Outcomes were missing for 998/75,016 (1.3%) participants.

Development of the 4C Deterioration model
Since we hypothesised that geographic heterogeneity may contribute to model performance, we analysed the dataset by UK National Health Service (NHS) region. We included eight regions in Following our backward elimination procedure, 11 predictors were retained in >50% of multiply imputed datasets in >50% of NHS regions in the development cohort. These were: age, sex, nosocomial infection, Glasgow coma scale, admission oxygen saturation, breathing room air or oxygen therapy, respiratory rate, urea, C-reactive protein, lymphocyte count and presence of radiographic chest infiltrates. Associations (including non-linearities) between these predictors and the outcome from the model trained on the full development cohort are shown visually in Figure 1. Full model coefficients are presented to enable independent model reconstruction in Supplementary Table   2. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint

Internal-external cross-validation
In order to examine potential heterogeneity between NHS regions and evaluate generalisability, we conducted internal-external cross-validation 26   existing candidate prognostic models for which the constituent variables were available in >60% of participants in our data. In decision curve analysis, net benefit allows assessment of clinical utility by quantifying the trade-off between correctly identifying true positives and incorrectly identifying false positives weighted according to the threshold probability 28 . The threshold probability represents the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint risk cut-off above which any given treatment or intervention might be considered, and reflects the underlying risk:benefit ratio for the intervention. The new model for clinical deterioration had higher net benefit than any of the existing models as well as 'treat all' or 'treat none' strategies, across a broad range of threshold probabilities, in all development NHS regions (without local recalibration).

Validation in held-out NHS region
Next, we validated the final prognostic model, trained on the full development cohort, in the held-out NHS region (London; n=8,252). Discrimination and calibration metrics for the 4C Deterioration model were similar to the estimates from internal-external cross validation (Table 2), with C-statistic 0.77 (95% CI 0.75 to 0.78), CITL 0.01 (-0.04 to 0.06) and slope 0.96 (0.90 to 1.02). Discrimination was higher for the 4C Deterioration model than for the other existing candidates. A loess-smoothed calibration curve for the held-out London region is shown in Figure 3a.
We then conducted decision curve analysis in the held-out NHS region to further examine clinical utility for the 4C Deterioration model. Importantly, this demonstrated higher net benefit than all other candidates that we were able to recreate, as well as the 'treat all' and 'treat none' approaches, across a range of threshold probabilities (Figure 3b).
We anticipate that clinicians may wish to evaluate risk of deterioration or death separately. Therefore, for illustration, we compared predictions from the 4C Deterioration model to our previously reported 4C Mortality Score 9 in the London validation cohort, stratified by age ( Figure 4a). In addition, 10 example participants selected at random from each decile of 4C Deterioration predictions in the London cohort are shown in Figure 4b, with their clinical characteristics summarised in Figure 4c.
Overall, deterioration predictions appeared appropriately higher than those for mortality, but these differences were exaggerated among younger age groups.

Sensitivity analyses
Recalculation of model validation metrics in complete case data from the held-out London region showed similar results to the primary analyses (Supplementary Table 3). Exclusion of deterioration events on the day of admission in the London validation cohort resulted in slightly lower C-statistics for most models; discrimination remained higher for the 4C Deterioration model, compared to other candidates (Supplementary Table 4). Validation metrics in the London cohort appeared similar to the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint primary analysis when excluding participants who had ongoing hospital care at the end of follow-up (Supplementary Table 5), when restricted to community-acquired infections (Supplementary Table   6a), and when community-acquired infections with symptom onset after admission were excluded (Supplementary Table 7). Among nosocomial infections, the C-statistic appeared slightly lower for the 4C Deterioration model than in the primary analysis (0.72; 95% CI 0.67 to 0.77), though discrimination remained higher than other candidate models, and CITL was 0.39 (95% CI 0.2 to 0.59), suggesting some underestimation of risk (Supplementary Table 6b Table 8).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 13, 2020. London validation cohort showed higher predictions for deterioration overall; these differences were amplified among younger age groups. This suggests that younger people who deteriorated were more likely to have escalation of treatment through HDU/ICU admission or ventilatory support, while older people who deteriorated were more likely to die. These observations are likely to be mediated, in part, by differential treatment escalation decisions by age. Moreover, our comparison of the models for 10 randomly selected patients across the distribution of outcome risks from the held-out validation cohort illustrates examples of cases with relatively low risk of death, but moderate to high risk of deterioration. These discordances underline the need for independent prognostic models for deterioration and mortality outcomes, thus empowering clinicians to predict their desired outcome as required to inform clinical management decisions.
Our study has a number of strengths. Previous studies seeking to develop prognostic models for people with COVID-19 have been evaluated as being at high risk of bias due to suboptimal development methodology, and are often limited to single hospital sites 6 , thus impeding generalisability during external validation 8 . In the current study, we adhered to TRIPOD standards 11 and retained continuous variables without arbitrary categorisation, while accounting for linearities, to avoid loss of information 32 . Moreover, we used the largest dataset to date, to our knowledge, to develop and validate the 4C Deterioration model, including data from hospitals across nine NHS . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint regions in England, Scotland and Wales. We exploited this wide geographic coverage to explore between-region heterogeneity in model performance using the recommended approach of internalexternal cross-validation 33 . While discrimination, calibration slopes and net benefit were largely very consistent, we noted minor variation in CITL, suggesting some variation in baseline risk between regions. Our approach of recalibrating the model intercept to each NHS region demonstrated the potential to address such heterogeneity and could be used to update the model if risk is found to vary temporally (as novel therapies are implemented) and among different populations. However, it is notable that net benefit, which accounts for model discrimination and calibration in quantifying clinical utility, appeared higher for the 4C Deterioration model than all other candidates even without recalibration, across all NHS regions and in the held-out validation dataset. This was the case even when comparing to points-based models, which may achieve overly optimistic performance in during decision curve analysis since they were recalibrated to the validation data set. We also used a robust approach to missing data with multiple imputation, as widely recommended in prediction model studies 34 , and performed a sensitivity analysis using an alternative approach, with similar findings.
Ongoing prospective external validation of the 4C Deterioration model will be required over time to consider the need for temporal recalibration 35 , and to include diverse international settings outside of the ISARIC4C study. Another limitation is that we only included predictors that were routinely measured as part of clinical care during the study period, and specified that they had to be available among >60% of the population for inclusion in the analysis. Thus, we were unable to assess candidate models that include predictors such as lactate dehydrogenase or D-dimer, since these variables were only available in a small minority of participants. Future studies could consider standardised capture of laboratory measurements considered to have prognostic value to enable inclusion of these variables in model development and validation at scale. Moreover, we note that novel molecular biomarkers currently under investigation may also offer prognostic value 36 . Blood transcript, protein and metabolite measurements will be available from a subset of the ISARIC4C participants and could be integrated into risk-stratification tools in future studies.
In summary, we present a prognostic model for clinical deterioration among hospitalised adults with community or hospital acquired COVID-19, validated in nine NHS regions in England, Scotland and Wales. The model uses readily available clinical predictors and will be made freely available online . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint alongside our previously reported mortality risk score (https://isaric4c.net/outputs/4c_score/) 9 at the point of peer-reviewed publication, to inform clinical decision-making and patient stratification for therapeutic interventions. The underlying model coefficients are presented and code will be published to enable independent external validation in new datasets.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020.

Data sharing statement
We welcome applications for data and material access via our Independent Data and Material Access Committee (https://isaric4c.net).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Code availability statement
The final prognostic model developed in this study will be made freely available at the point of peerreviewed publication, to enable implementation in clinical practice and independent external validation in new datasets. The code underlying the prediction tool will also be made available.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 1: Multivariable associations between selected predictors and outcome in final model
Variable selection was done in each imputed dataset using backward elimination within each NHS region using AIC. Variables retained in >50% of multiply imputed datasets in >50% of NHS regions were selected. Continuous variables are modelled using restricted cubic splines. Final model parameters are pooled across multiply imputed datasets (total sample size for model development = 66764 participants). Black lines indicate point estimates; grey shaded regions indicate 95% confidence intervals.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 2: Internal-external cross validation of selected model by NHS region.
Pooled estimates are calculated through random-effects meta-analysis (total sample size = 66,764 participants). Dashed lines indicate lines of perfect calibration in the large (0) and slope (1), respectively. Black squares indicate point estimates; bars indicate 95% confidence intervals; diamonds indicate pooled random-effects meta-analysis estimates.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 3: Calibration and decision curve analysis in held-out London region
Calibration (a) is shown using a loess-smoother across multiply imputed datasets. The rug plot indicates the distribution of predicted risk. Net benefit (b) is shown with loess smoothing for each candidate model compared to the 'treat all' and 'treat none' approaches. Points score models are recalibrated to the validation data, resulting in optimistic estimates of net benefit for these models.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 4: 4C Deterioration vs Mortality predictions for (a) London validation cohort (n = 8252) and (b-c) randomly sampled example patients.
4C Mortality probabilities are calculated from points scores, based on observed mortality risk for each score in the original validation data. In panel (a) smoothed plot reflects loess fit, stratified by age . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20209957 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 13, 2020.