ABSTRACT
Objective Health inequities can be influenced by demographic factors such as race and ethnicity, proficiency in English, and biological sex. Disparities may manifest as differential likelihood of testing which correlates directly with the likelihood of an intervention to address an abnormal finding. Our retrospective observational study evaluated the presence of variation in glucose measurements in the Intensive Care Unit (ICU).
Methods Using the MIMIC-IV database (2008-2019), a single-center, academic referral hospital in Boston (USA), we identified adult patients meeting sepsis-3 criteria. Exclusion criteria were diabetic ketoacidosis, ICU length of stay under 1 day, and unknown race or ethnicity. We performed a logistic regression analysis to assess differential likelihoods of glucose measurements on day 1. A negative binomial regression was fitted to assess the frequency of subsequent glucose readings. Analyses were adjusted for relevant clinical confounders, and performed across three disparity proxy axes: race and ethnicity, sex, and English proficiency.
Results We studied 24,927 patients, of which 19.5% represented racial and ethnic minority groups, 42.4% were female, and 9.8% had limited English proficiency. No significant differences were found for glucose measurement on day 1 in the ICU. This pattern was consistent irrespective of the axis of analysis, i.e. race and ethnicity, sex, or English proficiency. Conversely, subsequent measurement frequency revealed potential disparities. Specifically, males (incidence rate ratio (IRR) 1.06, 95% confidence interval (CI) 1.01 - 1.21), patients who identify themselves as Hispanic (IRR 1.11, 95% CI 1.01 - 1.21), or Black (IRR 1.06, 95% CI 1.01 - 1.12), and patients being English proficient (IRR 1.08, 95% CI 1.01 - 1.15) had higher chances of subsequent glucose readings.
Conclusion We found disparities in ICU glucose measurements among patients with sepsis, albeit the magnitude was small. Variation in disease monitoring is a source of data bias that may lead to spurious correlations when modeling health data.
INTRODUCTION
Healthcare disparities, especially based on race and ethnicity, have been extensively documented across various medical conditions and stages of disease (1–3). Disparities manifest as unequal access to the healthcare system, differences in the quality of care, variations in medical device performance, and discrepancies in health outcomes not explained by clinical factors (4). They reflect systemic and societal biases that are easily incorporated into artificial intelligence (AI) models if researchers are not aware of them (5).
One of the main challenges in fair AI modeling is missingness handling (6,7). Notably, a 2017 study found that 49 out of 107 electronic health record (EHR)-based risk prediction approaches evaluated did not mention missing data at all (8). A common strategy is to replace missing values by physiologically-normal values. Similarly to clinical practice – where certain readings are only performed if there is a diagnostic suspicion – an “absent test” would be treated by the AI model as a “normal test” (9).
While this is considered a valid approach, it has important drawbacks. The likelihood of detecting an abnormal finding largely depends on how often a test is conducted. Thus, data that is not missing at random can drive spurious correlations, which are noncausal relationships between the input and the outcome which may shift in deployment (10). When the reason for missing data is a result of systemic social discrimination, these biases can be embedded in subsequent AI models, perpetuating and exacerbating existing disparities (11,12). Especially, as frequency of measurements has gained interest in the AI building community (13). As a case study of a potential source of such spurious correlations in medical AI models, we picked glucose measurement frequency among patients with sepsis admitted to the Intensive Care Unit (ICU).
Sepsis is a severe life-threatening systemic infection that has a significant impact on the health systems (14–16). Up to 90% of ICU patients exhibit elevated glucose levels irrespective of pre-existing diabetes (17–19). While elevated glucose has been linked to severity of illness, and increased mortality rates (20–22). Numerous studies have shown conflicting results, partly attributed to the detrimental effects of hypoglycemia, a consequence of strict glucose control (22–25). Sepsis patients who are admitted to the ICU are particularly vulnerable to blood glucose fluctuations (23,26) due to the inflammatory response and various aspects of care, such as corticosteroid use (27).
This paper aims to investigate potential disparities in glucose monitoring practices. We advocate that this kind of analysis must be done prior to any AI modeling. The primary objective is to determine if racial and ethnic or demographic differences influence the frequency of glucose measurements during sepsis management. By shedding light on this aspect of care, we aim to contribute to a more comprehensive understanding of healthcare inequalities and to provide researchers developing AI models with a framework to prevent potential biases adversely influencing model fairness and equity.
METHODS
This observational retrospective study is reported in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement (28). The health equity language, narrative and concepts of this paper follows the American Medical Association’s recommendations (29).
Data Extraction
Data was extracted from the publicly-available MIMIC-IV database using SQL via Google’s BigQuery (30). The MIMIC database is maintained by the Laboratory for Computational Physiology at the Massachusetts Institute of Technology and shared via the PhysioNet platform (31). The dataset has been de-identified, and the institutional review boards of the Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the use of the database for research. The MIMIC-IV database includes physiologic data collected from bedside monitors, laboratory test results, medications, medical images and clinical progress notes captured in the electronic health record from patients admitted to the ICU between 2008-2019. Approximately 70,000 de-identified medical records are archived in the MIMIC-IV database.
Hypothesis
We hypothesized that the likelihood for a patient to receive measurement as well as the frequency of those measurements are not equally distributed across race and ethnicity, sex, or English proficiency.
Cohort Selection
The following cases are excluded to create a study cohort: patients under 18 years of age, those without sepsis as defined by the sepsis 3 criteria (32), those with length of ICU stay less than 1 day, those with a diagnosis of diabetic ketoacidosis, and with racial description that does not fit within White, Asian, Black, or Hispanic, excluding those of the heterogenous group “other”.
Covariates
We drew directed acyclic graphs to understand which variables to extract (see Supplemental Figure 1 and 2), as well as their interplay. A total of 13 patient-level variables were extracted, including non-time-varying variables such as demographics, comorbidities, and admission information and also time-varying variables including Sequential Organ Failure Assessment (SOFA) score (33), insulin treatment, glucocorticoid treatment, and instances of glucose measurements. Time-varying variables were aggregated in different ways. SOFA was calculated for the day of ICU admission, insulin treatment was used as a binary variable for whether or not it was received on day one. Glucose measurement was also used as a binary variable for whether or not it was measured on day one, in addition to taking the overall number of measurements for the whole ICU stay normalized by length of ICU stay. Glucocorticoid treatment was also normalized by ICU length of stay (see Supplemental Table 1).
Outcomes
We had two primary outcomes: The first was a binary variable predicting whether a patient received measurement on day 1, and the second being a prediction of how many glucose measurements a patient would receive per day throughout the length of their ICU stay.
Statistical Analysis
Statistical analysis was performed using Python 3.10. For summary statistics, we used the tableone Python package (34). For the outcome of whether or not a patient received a glucose measurement on day 1, we fitted a penalized, ridge logistic regression using the Python package statsmodel (35) adjusted for confounders to estimate the likelihood of receiving a glucose measurement. We report our findings as odds ratios (OR) with 95% confidence intervals (95% CI). White patients were considered as the reference group.
For the outcome of the number of glucose measurements during an ICU stay, we fitted a non-penalized, negative binomial regression (also with statsmodel) adjusted for confounders to estimate the number of glucose measurements for each patient each day during their respective ICU stay. We report our findings as incident rate ratios (IRR) with 95% CI. White patients were considered as the reference group.
RESULTS
Baseline Study Cohort
The MIMIC-IV database has 73,181 ICU stays, of which 24,927 (see Figure 1) made up our final cohort following the inclusion and exclusion criteria. The race and ethnicity distribution was 11.8% Black, 3.3% Asian, 3.3% Hispanic, 80.47% White and did not change relevantly after applying exclusion criteria.
In the resulting cohort, English proficiency was distributed unevenly across race and ethnicity with Asian (37.8%) and Hispanic (39.3%) patients being significantly less frequently English proficient than White (95.2%) and Black (89.6%) patients (see Table 1). Insurance coverage also differed between race and ethnicity with White patients having the highest Medicare coverage (52.3%), and Hispanic patients had the highest Medicaid coverage (20.2%), followed by Asian patients (17.5%). Of note, even though the percentage of patients with diabetes was higher for Black (18.6%) and Hispanic (17.9%) patients when compared to White (11.4%) and Asian (14.2%) patients, the probability for receiving a glucose measurement on day 1 was roughly even for all groups (see Supplementary Figure 3). Furthermore, Hispanic (56.1%) patients had the lowest probability of receiving insulin on day 1.
Model Results
We used logistic regression to predict whether a patient received a glucose measurement on day 1 with the null hypothesis that all patients are equally likely to receive a measurement. Asian patients (OR 0.80, 95% CI 0.62 - 1.03), Black patients (OR 1.00, 95% CI 0.87 - 1.16), and Hispanic patients (OR 1.05, 95% CI 0.82 - 1.35) patients were compared to White patients as a reference (see Figure 2A, Table 2, and Supplementary Figure 4). In addition, the effect of being English proficient (OR 0.98, 95% CI 0.82 - 1.17) and being male (OR 0.94, 95% CI 0.82 - 1.17) were not statistically significant.
A negative binomial model was fitted to predict the total frequency of glucose measurements per patient and length of stay with the null hypothesis that all patients have the same frequency. Being male (IRR 1.06, 95% CI 1.01 - 1.21), Hispanic (IRR 1.11, 95% CI 1.01 - 1.21), Black (IRR 1.06, 95% CI 1.01 - 1.12), or English proficient (IRR 1.08, 95% CI 1.01 - 1.15) significantly increased the frequency of receiving a measurement (see Figure 2B, Table 3, and Supplementary Figure 5), while being Asian (IRR 1.04, 95% CI 0.94 - 1.15) did not.
DISCUSSION
Policymakers have increasingly recognized the importance of health equity to public health(36). In December of 2022, tying reimbursements to health equity outcomes was proposed by the Center for Medicare and Medicaid Services (CMS)(37). If these policies were to be implemented, they would also need to be considered in AI development as disparities could have unintended consequences on a model’s prediction (38). There is growing awareness in the domain of medical AI development that potential advancements through AI could be hindered by such biases (39). The low accuracy of the EPIC sepsis model in external datasets is a good example of what can happen if temporal bias is neglected(40). Another study found increased illness burden in Black patients because the model wrongly included the post-treatment cost as a pre-treatment feature(41).
Aside from the risks of inputting biased data into a model, unmeasured data for a particular group reflect quality of care. Absence of data in medicine is not simply a void. In fact, caregivers often do not measure a variable deliberately; the reason to do so is usually highly confounded, affected by ongoing treatment, and reflected in the potential outcome. Models also interpret these voids as information influencing predictions, driving spurious correlations (10). Therefore, it is important to establish a thorough understanding of such associations before fitting any AI model.
To avoid encrypting biases into AI models, many advocate the use of causal inference frameworks to check for, understand, and control for potential biases (42,43). Performing similar analyses as we did – even before fitting a causal inference model – should become a common practice, as the awareness and understanding of such incongruences can help to understand these models better.
Furthermore, one key assumption of causal inference models is positivity, i.e. subjects from the treatment and control group should have enough overlap in the distribution of their confounding variables(44). Usually researchers check this assumption by tabulating the data across the most pertinent variables which can become unwieldy in case of many strata. As an additional check, we propose fitting simple regression models as we did to ascertain that the positivity assumption is not being violated and adding frequency of measurements as a distinct feature to any model.
Besides all these implications for AI models, our findings are also relevant clinically. While our study found no differences on glucose measurements in the glucose measurements on day 1 among patients admitted to the ICU due to sepsis, there was a slightly higher frequency of glucose measurement among Black, Hispanic, male, and English proficient patients.
Glucose measurement plays a pivotal role in in-patient care, as glucose levels are linked to severity of the disease and patient outcomes (20,21). In the context of sepsis patients, hyperglycemia is a common bystander associated with severity of disease and mortality (23). Studies have shown that frequent glucose measurements are associated with less hypoglycemia and hyperglycemia episodes, as well as decreased glucose variability (45–48) which has been associated with decreased hospital mortality, decreased length of hospital stay, and composite hospital complications (48). Reports of Black patients admitted to the ICU having poorer glycemic control are thus troublesome (49).
LIMITATIONS
While our study adds to the discussion of disparities and algorithmic bias in the critical care setting, we acknowledge that it has certain limitations. These include potential selection bias, as the database does not include patients who were admitted to the regular ward, and the data comes from only one academic tertiary care center in the U.S, where most of the population is White (80.47%). Also, our study design does not allow us to test whether the associations are from unmeasured confounding. Future research should strive to address these limitations for a more comprehensive understanding of the topic and should also cover other areas of care such as emergency departments, regular wards, or ambulatory care.
CONCLUSION
The implications of our study goes beyond glucose monitoring during sepsis management. Besides the challenge of achieving healthcare equity in a system marked by systemic biases, researchers need to be aware of such disparities before building any AI model. Such biases cannot only skew predictions, but may even generate further adverse effects when the predictions are being used for treatment or management decisions.
Conflicts of Interest
None of the authors have any conflicts of interest relevant to this work.
Author Contributions
All authors contributed to writing the manuscript.
KT and YJ performed the statistical analysis.
NLW wrote the first draft of the manuscript.
LAC, JM, and TS provided idea, concept, technical and administrative support.
Funding
The funding organizations had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
LAC is supported by the NIBIB, under R01 EB001659.
JM was supported by a Fulbright / FLAD Grant, Portugal, AY 2022/2023.
TS is supported by the Swiss National Science Foundation (P400PM_194497 / 1).
Data Availability
The data that support the findings of this study are available in MIMIC-IV with the identifier doi.org/10.1093/jamia/ocx084 publicly available on PhysioNet (https://physionet.org/).
Code Availability
The code that produces the results in this manuscript can be accessed at https://github.com/e754/Glucose-Research, which includes detailed instructions for running the code.