Protocol for an observational study evaluating new approaches to modelling diagnostic information from large administrative hospital datasets

The Charlson and Elixhauser indices define sets of conditions used to adjust for patients’ comorbidities in administrative hospital data. A strength of these indices is the parsimony that results from including only 19 and 30 conditions respectively, but the conditions included may not be the ones most relevant to a specific outcome and population. Our objectives are to: (1) test an approach to developing parsimonious indices for the specific outcome and populations being studied, while comparing performance to the Charlson and Elixhauser indices; and (2) evaluate several approaches that involve models with more diagnosis-related terms and aim to improve prediction performance by capturing more of the information in large datasets.This is a modelling study using a linked national dataset of administrative hospital records and death records. The study populations are patients admitted to hospital for acute myocardial infarction, hip fracture, or major surgery for colorectal cancer in England between 1 January 2015 and 31 December 2017. The outcome is death within 365 days of the date of admission (acute myocardial infarction and hip fracture) or procedure (colorectal surgery). In the ‘First analysis’, prognostic indices will be developed based on the presence/absence of individual ICD-10 codes in patients’ medical histories. Logistic regression will be used to estimate associations with a full set of sociodemographic and diagnostic predictors from which reduced models (with fewer diagnostic predictors) will be produced using a step-down approach. In the ‘Second analysis’, models will also account for the timing that each ICD-10 code was last recorded and allow for non-linear relationships and interactions between conditions and the timings of records. Validation will include an overall measure of performance (scaled Brier score) and measures of discrimination (c-statistic) and calibration (such as the Integrated Calibration Index) in bootstrap or cross-validation samples. Sensitivity analyses will include varying the length of medical history analysed, using a comparator that combines the Charlson and Elixhauser sets of conditions, and aggregating ICD-10 codes into clinical groups.


Introduction
In 1987, Charlson et al. 1 proposed a set of comorbidities that could be used to predict patient mortality in longitudinal studies. The corresponding article is now the 3 rd most cited article of all time in the 'Medicine, General and Internal' category of the Web of Science 2 with almost 30,000 citations (Google Scholar). The wider set of comorbidities proposed in 1998 by Elixhauser et al. 3 , for use with large administrative hospital datasets, has almost 6,000 citations. These two indices are a cornerstone of modern clinical research and healthcare evaluation: they are often used to assess and reduce confounding of outcomes between trial arms, 4-6 observed treatment groups, [7][8][9] and healthcare providers. [10][11][12] They are also often included in prediction models. [13][14][15] These indices are likely to be popular partly because they are highly transparent, reproducible, and parsimonious. The Charlson 1 and Elixhauser 3 indices include 19 and 30 conditions, respectively, which have well-established International Classification of Diseases (ICD) coding methods. 16 The weighted number of conditions is often used as a summary measure. 1,17 However, the relatively small number of conditions included in each index may not account for the ones most relevant to a specific outcome and population, 18 thus limiting the degree to which confounding is reduced. Several authors have therefore suggested approaches that include more conditions. [19][20][21][22] For example, in 2005, Holman et al. 23 used the 100 commonest ICD codes in the population to define candidate predictor variables. While this approach performed better than the Charlson index in its development study, 23 it has not been widely used (with less than 100 citations on Google Scholar). This may be partly because the large number of variables is likely to increase the risk that models fitted in smaller datasets do not generalise well to other datasets. It is also more time-consuming to apply and present a larger model in different studies.
This highlights the need for a new approach to developing prognostic indices that considers a large number of candidate conditions but produces a final model with a relatively small number of conditions that account for most of the explainable variation in the outcome. This model is adapted to the specific outcome and population being studied and should share the strengths of the Charlson and Elixhauser indices given above, especially their parsimony.
Several studies focusing on risk prediction in large datasets, rather than a comorbidity index in itself, have developed particularly large models with many more diagnosis-related predictors. These models have accounted for the timing of diagnosis records [24][25][26] and were estimated using machine learning methods that do not assume a simple functional form for associations between outcomes and predictors. [24][25][26][27][28] This could help these approaches to better predict outcomes than more conventional methods, but this has not been tested or quantified empirically in the existing literature. Moreover, more complex models may be less interpretable and reproducible than simpler ones, which may negate any benefits in terms of prediction performance.
In this study, we will: (1) test an approach to developing parsimonious prognostic indices, based on ICD-10 codes, for the specific outcome and populations being studied, while comparing performance to the Charlson and Elixhauser indices; and (2) evaluate several approaches that involve models with more diagnosis-related terms and aim to improve prediction performance by capturing more of the information in large datasets.
The study outcome and populations will be 1-year mortality of acute myocardial infarction, hip fracture, and major colorectal cancer surgery patients in England. We have chosen to analyse specific groups rather than all inpatients as most studies using comorbidity methods focus on clinical populations defined by disease, events, or interventions, so comparisons of methods are most informative (and most common 18 ) when done within clinical groups. 23 We chose these three groups to illustrate the general applicability of the approaches. The groups correspond to a large number of admissions per year but vary in terms of the clinical areas affected, the conditions that coexist, and mortality rates. Most deaths amongst patients admitted for hip fracture are due to other conditions, both in the short and longer terms, 29,30 with a 1-year mortality rate of approximately 30%. 31 In contrast, acute myocardial infarction presents a greater risk of death in itself; most deaths within 30 days of admission for this condition are due to the primary event. 32 Unlike these two conditions, most major surgery for colorectal cancer is planned and the 1-year mortality rate is less than 10%. 36 The approaches will therefore be tested in a heterogeneous group of conditions where performance is more likely to vary.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint

Study populations
We will conduct a modelling study using a large national dataset of administrative hospital records. The Hospital Episode Statistics Admitted Patient Care dataset includes all inpatient hospital care funded by the National Health Service (NHS) in England. 37 The three study populations are patients admitted to hospital for acute myocardial infarction (ICD-10: I21-22 38,39 ), hip fracture (S72.0-S72.2 40,41 ), or major surgery for colorectal cancer (C18-20; OPCS-4: H04-11, H29, H33, X14 [42][43][44][45]. Acute myocardial infarction and hip fracture patients will be identified from the ICD-10 code recorded as the primary diagnosis for the first episode in each admission. Colorectal surgery patients will be identified from the primary diagnosis of any episode with a relevant OPCS-4 code recorded for the main procedure of that same episode.
We will analyse patients aged 18 years or older in the acute myocardial infarction and major colorectal surgery populations and aged 60 years or older 41 in the hip fracture population. We will include those with a relevant admission from 1 January 2015 to 31 December 2017.
Each record in Hospital Episode Statistics corresponds to an 'episode', defined as a continuous period of hospital care under the same consultant doctor. Each episode has 20 diagnosis fields which contain ICD-10 codes corresponding to diagnoses recorded for that specific episode. The first field contains the primary diagnosis-the main condition treated or investigated during that episode. The other 19 fields contain secondary diagnoses. Three-character categories of ICD-10 codes define single conditions or groups of diseases with a common characteristic; fourth characters are used variably to define sites, subtypes, and causes. 46

Outcome
The outcome variable is death within 365 days of the date of admission (acute myocardial infarction and hip fracture) or procedure (major colorectal surgery). Death in hospital or within 30 days and 365 days of admission are the commonest outcome variables in studies comparing comorbidity methods 18,47 ; we will focus on the 365 days period (as in Charlson et al. 1 ) as in-hospital and 30day mortality may be much more strongly affected by the primary event than other conditions. In addition, 30-day mortality after colorectal excision is less than 3%. 36 If a patient has more than one index admission of the same type (such as two admissions for hip fracture), the earliest date of these admissions from 2015 to 2017 will be taken as the index date.
We will identify dates of death from Civil Registrations Data. 48 We have these records up to 31 December 2018, providing complete follow-up for the outcome. Death records were linked to Hospital Episode Statistics using deterministic linkage on each patient's unique NHS number, date of birth, sex, and postcode.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint

Definition of predictors
We will analyse diagnosis codes from episodes that started within 365 days before the index date of each patient. We will use the three-character categories of ICD-10 codes only (excluding fourth characters) to increase the frequencies of predictors so that model predictions are more reliable. In each population, we will exclude code categories recorded for less than 0.5% of patients as these codes are so rare that they are unlikely to explain much variation in outcomes. 25,49 We will define a binary predictor variable for each code that denotes whether that code was recorded or not within 365 days before a patient's index date. We chose this 1-year 'look-back period' rather than only using information from the index admission as the former approach can also be applied in non-inpatient populations and some studies suggest that it improves model performance. 18 Patient age (coded as a continuous variable) at the index date, sex, and socioeconomic status will be predictors in all models, as is common when comparing comorbidity indices. 18,47 Socioeconomic status will be measured by the national rank of the Index of Multiple Deprivation score for the lower super output area in which a patient lived; these areas are ranked from 1 (most deprived) to 32,482 (least deprived) and have population sizes between 1,000 and 3,000. 50

Model estimation
We will first estimate associations between the outcome and the full set of predictors as the maximum likelihood estimates from logistic regression. Interaction terms will not be considered. We will then develop a reduced model that approximates this full model using the backwards stepdown approach proposed by Harrell. 51,52 First in this approach, the predicted log-odds of the outcome for each patient is calculated from the coefficients of the full model. Second, an ordinary least squares regression model is fitted between these values and the full set of predictors (with an R 2 value of 1 by definition). Third, R 2 is calculated for a series of models in which each successive model has an additional predictor removed, in a backwards step-down process based on likelihood ratio estimates. 51,53 This process will stop when all predictors have been removed. Fourth, the model with the fewest predictors and an R 2 value equal to or greater than 0.95 will be selected as the final reduced model. This model will explain at least 95% of the variation in the outcome explained by the full set of predictors.

Comparators
We will compare this resulting prediction approach to the Charlson 1 and Elixhauser 3 methods, as systematic reviews found them to be the most commonly used comorbidity indices by far. 18,47 An established list of ICD-10 codes 16 will be used to identify whether each condition included in these indices was recorded or not within 365 days before the index date. This list categorises the Charlson 1 and Elixhauser 3 conditions into 17 and 31 groups, respectively; 14 of the 17 Charlson groups are also included in the Elixhauser groups. 54 We will estimate associations between the outcome and each condition using multivariable logistic regression, separately for the two sets of . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint conditions. We will model a binary predictor variable for each condition rather than an overall weighted summary measure, as optimal weights differ across datasets, populations, outcomes, and time. 3,33 Systematic reviews of comorbidity indices suggest that modelling conditions using this approach generally improves prediction performance. 18,47

Use of diagnosis timings as predictors
The analysis described above will model the presence or absence of each ICD-10 code within a year before the index dates. In the second analysis, we will use a longer look-back period of five years and define an additional predictor for each ICD-10 code that records the number of days before the index date that the code was last recorded (from 0 to 1825 days). Logistic regression will be used to estimate the coefficients of the following model: where is the predicted probability of death for individual , each of the variables represents a binary indicator denoting the presence (1) or absence (0) of the th ICD-10 code (in the lookback period) for individual and is a continuous variable denoting the corresponding number of days between the last record of that code and the date of the index admission/procedure. The model parameters are interpreted as follows:  represents the log-odds of death within 365 days for a person without any ICD-10 codes recorded  , = 1, … , is the change in log-odds of death for the th ICD-10 code on the day of the index admission/procedure  , = 1, … , is the change in log-odds of death per day for the interval between the last record of a given ICD-10 code and the index admission/procedure Patient age, sex, and socioeconomic status will also be included as predictors (as before) and are omitted from the equation above only for clarity. This equation assumes linear relationships between the timings that ICD-10 codes were last recorded and the log-odds of death. To relax this assumption, we will also fit generalised additive models that include non-linear associations for the predictors using smoothing splines. 55,56

Use of boosted trees to account for interactions
A potential limitation of the approaches described above for the first and second analyses is that they do not model interactions between conditions. As conditions could modify each other's effects on prognosis, we will also test the performance of boosted tree models. These models have the potential benefit of modelling interactions between conditions without these interactions having to be pre-specified, unlike when using conventional logistic regression for example. The approach involves fitting a series of decision trees to the data sequentially with each tree attempting to improve on the predictions of the previous tree (termed 'boosting').
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint We will use the gradient boosted tree approach of Friedman et al. [57][58][59] as implemented in the XGBoost 60 R package (explained accessibly elsewhere [61][62][63] ). This is a popular statistical learning algorithm that has outperformed other methods in large routine healthcare datasets with many predictors. 24,26,28 The tuning parameters of the algorithm are given in Appendix 1.

Validation methods
For all analyses, Brier scores will be calculated to assess the overall predictive performance of the approaches. 64 These scores equal the mean of squared differences between the predicted probability of death and the observed death outcome. Since Brier scores are affected by the overall risk of the outcome, we will scale these scores (Brierscaled) to account for the maximum possible value (Briermax) given the overall risk of death: Brierscaled = 1 -Brier/Briermax. 65 Larger values of Brierscaled are better: this measure equals 100% when predictions are perfect and 0% when the predicted probability of death for each patient equals the overall risk of death in the population.
To assess discrimination, we will calculate the c-statistic. 66 This measure equals the probability that a randomly chosen patient who died had a greater predicted probability of death than a randomly chosen patient who did not die; it assesses the rank order of predictions. A perfect model would have a value of 1 and a set of random predictions would have a value of 0.5.
To assess model calibration, we will calculate calibration-in-the-large, calibration slopes, 67 and the Integrated Calibration Index (ICI). 68 Calibration-in-the-large is the difference in means between predicted probabilities and observed death outcomes, such that perfect models have values of zero. Calibration slopes equal the coefficient of the predicted log-odds of death in a logistic regression model with the log-odds of death as the outcome variable and no other predictors; perfect models have slopes equal to one. The ICI is a weighted average of absolute differences between a calibration curve and the diagonal line of perfect calibration 68 ; the perfect ICI is zero.
In the 'First analysis', we will calculate the above measures in the entire dataset used to fit the regression models ('apparent performance') and also optimism-corrected estimates of these measures using the bootstrapping procedure given in the TRIPOD guidelines [69][70][71] ; 500 bootstrap samples will be used with all modelling steps repeated in each sample.
In the 'Second analysis', the models will take longer to estimate due to the larger number of parameters and bootstrapping will be too time-consuming, so a different validation process will be used. In each population, we will randomly allocate 75% of patients to a 'training' dataset and the other 25% to a 'test' dataset. We will first calculate the above performance measures in the training set using five repeats of 10-fold cross-validation. The results will be used to select the tuning parameters of the boosted trees that minimise the negative log-likelihood. We will then fit the final models on the whole training dataset and evaluate their prediction performance in the test dataset.

Sensitivity analyses
Five analyses will test the sensitivity of results to changes in the methods described above for the 'First analysis'. First, ICD-10 code categories recorded for less than 1% of patients (rather than 0.5%) will be excluded. Second, we will examine the performance of the full models with diagnostic . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint predictors based on the index episode only or a 3-year look-back period (instead of a 1-year period). Third, we will fit a model using the conditions included in either of the Charlson and Elixhauser indices, 19,22 using the most inclusive ICD-10 code definition of conditions in both indices. 54 Fourth, we will aggregate ICD-10 codes into the groups of the Clinical Classification Software (CCS) scheme, 72,73 excluding groups recorded for less than 0.5% of patients, and reestimating the full models. Fifth, we will assess changes in coefficients of the full models when penalised maximum likelihood estimation 51,74 is used; the penalty factor will be chosen to maximise a modified Akaike's Information Criterion (AIC). 51,71,74 The results of these analyses will inform which sensitivity analyses, if any, to also conduct for the approaches given under 'Second analysis' above.
We will follow the 'Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis' (TRIPOD) guidelines for reporting the study. 70 The corresponding articles will provide code for estimating the different models in R. Data management and preparation will be done using Stata (v15) 75 and R (v3.5) 76 will be used for all statistical analysis.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint

Implications for research
This study will evaluate new approaches to modelling diagnostic information from large administrative hospital datasets, using mortality as the outcome. These approaches could be particularly useful for risk adjustment when comparing patient outcomes between treatment groups or healthcare providers, in addition to being useful for risk prediction models. Based on the study's results, we will suggest which approaches should be considered by other researchers with similar large datasets depending on the intended purpose of the model developed.
We expect that researchers who aim to develop a parsimonious prognostic index specifically for the outcome and population being studied could use the approach proposed under 'First analysis' above. This index-a single measure defined by a weighted number of conditions-could then be incorporated into future models to help reduce confounding or improve the performance of prediction models. Alternatively, the list of conditions that define the final index (without the weights) could be used as separate predictor variables in subsequent models. The index could therefore be used as the Charlson and Elixhauser indices have been; the difference is that the new approach produces an index that is optimal for a specific outcome and population.
Researchers who do not intend to develop a parsimonious index may prefer one of the approaches given under 'Second analysis'. These approaches produce models that have a large number of diagnostic parameters and are more difficult to interpret but may predict outcomes better. A single measure can be produced as the predicted log-odds of death based on the diagnostic predictors in the model. This measure could then be included in future models as before. Alternatively, it may be preferable to estimate a model with diagnostic information and all other predictors simultaneously (using the same approaches), such as when it is acceptable for the final model to represent the conditions as a large number of separate terms.
Each of the approaches involves a large number of predictors such that large sample sizes are needed to develop the models. The adequate sample size depends on several factors including the frequencies of the outcome and co-existing conditions in the population. Guidance for determining adequate sample sizes is available elsewhere. 77 Researchers with smaller datasets can still apply the models developed in larger studies to their own data, including those developed for acute myocardial infarction, hip fracture, and major colorectal cancer surgery in this study.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint Contributors TEC conceived the initial idea for the work and designed the study with substantial contributions from DAC, LDS, and JvdM. TEC will conduct the analysis for this study. TEC drafted the initial study protocol. All authors commented on draft manuscripts and approved this version of the protocol.
Funding This work was supported by a Medical Research Council fellowship awarded to TEC (grant number: MR/S020470/1).

Competing interests None to declare.
Ethical approval Hospital Episode Statistics and Civil Registrations Data were made available by NHS Digital; approval for the use of the data was obtained from NHS Digital as part of the standard approval process (digital.nhs.uk).
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint

Appendix 1. Tuning parameters of gradient boosted trees
We will use the caret 78 package in R to tune the gradient boosted trees fitted by the XGBoost package. The final model will use the values of tuning parameters that minimised the negative loglikelihood for binary outcome variables. The parameters and the values tested will be: The parameter values to test were chosen partly based on the default values of the XGBoost package as well as the gbm package (which also implements gradient boosted trees). For each parameter, we will test values that are as or more conservative than the defaults in these packages to reduce the risk of overfitting. More conservative values correspond to fewer boosting iterations; lower learning rates, tree depths, and subsampling fractions; and greater minimum node weights. Overall, 336 unique combinations of the parameter values will be tested.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19011338 doi: medRxiv preprint