PT - JOURNAL ARTICLE AU - Harvineet Singh AU - Vishwali Mhasawade AU - Rumi Chunara TI - Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database AID - 10.1101/2021.07.14.21260493 DP - 2021 Jan 01 TA - medRxiv PG - 2021.07.14.21260493 4099 - http://medrxiv.org/content/early/2021/07/15/2021.07.14.21260493.short 4100 - http://medrxiv.org/content/early/2021/07/15/2021.07.14.21260493.full AB - Importance Modern predictive models require large amounts of data for training and evaluation which can result in building models that are specific to certain locations, populations in them and clinical practices. Yet, best practices and guidelines for clinical risk prediction models have not yet considered such challenges to generalizability.Objectives To investigate changes in measures of predictive discrimination, calibration, and algorithmic fairness when transferring models for predicting in-hospital mortality across ICUs in different populations. Also, to study the reasons for the lack of generalizability in these measures.Design, Setting, and Participants In this multi-center cross-sectional study, electronic health records from 179 hospitals across the US with 70,126 hospitalizations were analyzed. Time of data collection ranged from 2014 to 2015.Main Outcomes and Measures The main outcome is in-hospital mortality. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for discrimination and calibration metrics, namely area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm “Fast Causal Inference” (FCI) that infers paths of causal influence while identifying potential influences associated with unmeasured variables.Results In-hospital mortality rates differed in the range of 3.9%-9.3% (1st-3rd quartile) across hospitals. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st to 3rd quartile; median 0.801); calibration slope from 0.725 to 0.983 (1st to 3rd quartile; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (1st to 3rd quartile; median 0.092). When transferring models across geographies, AUC ranged from 0.795 to 0.813 (1st to 3rd quartile; median 0.804); calibration slope from 0.904 to 1.018 (1st to 3rd quartile; median 0.968); and disparity in false negative rates from 0.018 to 0.074 (1st to 3rd quartile; median 0.040). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. Shifts in the race variable distribution and some clinical (vitals, labs and surgery) variables by hospital or region. Race variable also mediates differences in the relationship between clinical variables and mortality, by hospital/region.Conclusions and Relevance Group-specific metrics should be assessed during generalizability checks to identify potential harms to the groups. In order to develop methods to improve and guarantee performance of prediction models in new environments for groups and individuals, better understanding and provenance of health processes as well as data generating processes by sub-group are needed to identify and mitigate sources of variation.Question Does the sub-group level performance of mortality risk prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed? What characteristics of the datasets explain the performance variation?Findings In this retrospective cross-sectional study based on a multi-center critical care database, mortality risk prediction models developed in one hospital or geographic region setting exhibited lack of generalizability to different hospitals/regions. Distribution of clinical (vitals, labs and surgery) variables significantly varied across hospitals and regions. Dataset shifts in race and clinical variables due to hospital or geography result in mortality prediction differences according to causal inference results, and the race variable commonly mediated changes in clinical variable shifts.Meaning Findings demonstrate evidence that such models can exhibit disparities in performance across racial groups even while performing well in terms of average population-wide metrics. Therefore, assessing subgroup performance differences should be included in model evaluation guidelines. Based on shifts in variables mediated by the race variable, understanding and provenance of data generating processes by population sub-group are needed to identify and mitigate sources of variation and can be used to decide whether to use a risk prediction model in new environments.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThe study was funded by the NSF grant number 1845487.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The study performs secondary analyses of publicly available de-identified data, and thus is exempt from IRB approval.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe data is available from the PhysioNet website and is accessible after completing a training course. https://eicu-crd.mit.edu/gettingstarted/access/ https://github.com/ChunaraLab/medshifts