ABSTRACT
Importance Diabetic kidney disease (DKD) is the leading cause of kidney failure in the United States and predicting progression is necessary for improving outcomes.
Objective To develop and validate a machine-learned, prognostic risk score (KidneyIntelX™) combining data from electronic health records (EHR) and circulating biomarkers to predict DKD progression.
Design Observational cohort study
Setting Two EHR linked biobanks: Mount Sinai BioMe Biobank and the Penn Medicine Biobank.
Participants Patients with prevalent DKD (G3a-G3b with all grades of albuminuria (A1-A3) and G1 & G2 with A2-A3 level albuminuria) and banked plasma.
Main outcomes and measures Plasma biomarkers soluble tumor necrosis factor 1/2 (sTNFR1, sTNFR2) and kidney injury molecule-1 (KIM-1) were measured at baseline. Patients were divided into derivation [60%] and validation sets [40%]. A composite primary end point of rapid kidney function decline (RKFD) (estimated glomerular filtration rate (eGFR) decline of ≥5 ml/min/1.73m2/year), ≥40% sustained decline, or kidney failure within 5-years. A machine learning model (random forest) was trained and performance assessed using standard metrics.
Results In 1146 patients with DKD the median age was 63, 51% were female, median baseline eGFR was 54 ml/min/1.73 m2, urine albumin to creatinine ratio (uACR) was 61 mg/g, and follow-up was 4.3 years. 241 patients (21%) experienced RKFD. On 10-fold cross validation in the derivation set (n=686), the risk model had an area under the curve (AUC) of 0.77 (95% CI 0.74-0.79). In validation (n=460), the AUC was 0.77 (95% CI 0.76-0.79). By comparison, the AUC for an optimized clinical model was 0.62 (95% CI 0.61-0.63) in derivation and 0.61 (95% CI 0.60-0.63) in validation. Using cutoffs from derivation, KidneyIntelX stratified 47%, 37% and 16% of validation cohort into low-, intermediate- and high-risk groups, with a positive predictive value (PPV) of 62% (vs. 41% for KDIGO) in the high-risk group and a negative predictive value (NPV) of 91% in the low-risk group. The net reclassification index for events into high-risk group was 41% (p<0.05).
Conclusions and Relevance A machine learned model combining plasma biomarkers and EHR data improved prediction of adverse kidney events within 5 years over KDIGO and standard clinical models in patients with early DKD.
INTRODUCTION
Approximately 1 out of 4 adults with type 2 diabetes mellitus (T2D) has kidney disease (i.e. Diabetic Kidney Disease or DKD). Each year, 50,000 individuals with DKD progress to kidney failure in the United States.1 The Mount Sinai Health System alone provides care for over 70,000 patients with DKD. Estimated glomerular filtration rate (eGFR) and urinary albumin creatinine ratio (uACR), existing diagnostic measurements incorporated into the Kidney Disease: Improving Global Outcomes (KDIGO) guidelines for risk stratification,2 lack precision in identifying patients who will experience rapid kidney function decline (RKFD), especially in earlier stages of DKD (G1-G3).3 As a result, primary care physicians (PCP) and diabetologists are often not able to appropriately risk stratify and counsel patients on the progressive nature of their disease. Easily interpretable and accurate prognostic tools that integrate into clinical workflow are lacking, resulting in suboptimal treatment and referral delays to a nephrology specialist. This has led, in part, to the unacceptable level of RKFD (and kidney failure) in this population4-8 with a high proportion of patients starting unplanned dialysis.1,9,10
Several blood-based biomarkers have shown associations with DKD progression, most significantly soluble tumor necrosis factor receptors 1/2 (TNFR1/2), and plasma kidney injury molecule-1 (KIM-1).11-13 However, implementation of accurate prognostic models combining clinical data from patients’ electronic health record (EHR) with blood-based biomarkers is lacking. Although EHR data is widely available, its volume and complexity limits integration with biomarker values using traditional methodologies. Recently, machine learning approaches have been developed that can combine biomarkers and EHR data to produce prognostic risk scores. A simple risk score that improves the ability to identify patients with DKD at low, intermediate, and high risk of RKFD has the potential to improve outcomes through more effective use of medications and efficient resource allocation at the primary care physician level.
In the current study, we developed and validated the performance of a biomarker-enriched, machine learned risk score (i.e., the KidneyIntelX™ test) to predict RFKD in patients with early stage DKD and compared performance to standard clinical models. We determined risk-based thresholds that can easily be integrated into standard clinical workflows and enhance existing clinical practice guidelines.
MATERIALS AND METHODS
Study Sample
The cohort is derived from the BioMe Biobank at the Icahn School of Medicine at Mount Sinai (ISMMS) and Penn Medicine Biobank (PMBB). The BioMe Biobank is a plasma and DNA biorepository with recruitment from 2007 which includes informed-consent access to the patients’ EHR from a diverse local community in New York City.14,15 PMBB is a research cohort enrolled from the University of Pennsylvania Health System with recruitment from 2008.14 Participants actively consented to allow the linkage of biospecimens with their longitudinal EHR (eFigure 1). Both BioMe and PMBB are institutional biobanks that attempt to be representative of the patient populations of the institutions they serve. Patients are recruited from outpatient general medicine clinics and certain subspecialty clinics with limited pre-selection criteria.16,17
The study protocol was approved by institutional review boards at both ISMMS and University of Pennsylvania; all participants had provided broad written informed consent for research and were not specifically compensated for participation in the current study. Blood was collected on the day of enrollment into BioMe or PMBB with plasma isolation as per standard procedures and continuously stored at −80°C until shipping to the RenalytixAI laboratory, New York, NY where biomarker measurements were performed.
Inclusion Criteria
We selected patients from BioMe and PMBB who were 21-81 years at the time of biobank enrollment (“baseline”), with T2D, an eGFR between 30 and 59.9 ml/min/1.73m2 or an eGFR≥60 ml/min/1.73 m2 with uACR ≥ 30mg/g. The KDIGO risk model categorizes patients based on eGFR and albuminuria and has 3 colors that correspond to the prognosis of prevalent CKD (we did not include patients at “low risk” or green because they do not have CKD).2 Patients were included if by the KDIGO eGFR and uACR criteria they were stage G3a-G3b with all grades of albuminuria (A1-A3) and G1, G2 with moderate to high albuminuria (UACR ≥ 30 mg/g (A2-A3)).2 Proportion of each DKD stage was evaluated against national estimates derived from the National Health and Nutrition Examination Survey (NHANES) years 2018-2019.18 For eGFR, we defined the baseline period as 1 year before or up to 3 months after the biobank enrolment date. Baseline uACR values were derived from closest values ±1 year from enrolment to maximize sample size as these are measured less frequently; subjects without baseline values of eGFR and uACR meeting these criteria were excluded. Only patients with a stored plasma specimen, a minimum follow-up time from enrolment of at least 21 months, and ≥3 eGFR values after baseline (eFigure 1) were included. Patients with kidney transplants or on chronic maintenance dialysis before baseline were excluded from the study.
Ascertainment of clinical variables
BioMe Biobank and PMBB
Sex and race were obtained from biobank questionnaires or EHR data. Clinical data was extracted for all EHR variables with concordant time stamps. Hypertension and T2D status at baseline were determined using the eMERGE Network phenotyping algorithms.16 Cardiovascular disease and heart failure were determined by International classification of disease (ICD)-9/10 clinical modification codes.
Biomarker Assays
The three plasma biomarkers were measured in a proprietary, analytically validated multiplex format using the Mesoscale platform (MesoScale Diagnostics, Gaithersburg, Maryland, USA), which employs electrochemiluminescence detection methods combined with patterned arrays to allow for multiplexing of assays. Each sample was run in duplicate, along with quality control samples with known low, moderate, and high concentrations of each biomarker on each plate. Assay precision was assessed using a panel of 7 reference samples that spanned the measurement range. The intra-assay coefficient of variation (CV) results for KIM-1, TNFR1, and TNFR2 were mean CV 3.9%, 5.4%, and 3.7%, respectively. The inter-assay CV results for the reference samples for KIM-1, TNFR-I, and TNFR-2 were mean CV 9.9%, 10.1%, and 7.8%, respectively. Assays satisfied dilution linearity and were run at 1:4 dilution. Levey-Jennings plots were employed and followed the Westguard rules for re-run of samples. The laboratory personnel performing the biomarker assays were blinded to all clinical information.
Data Harmonization
We harmonized data from BioMe and PMBB. Race/ethnicity was collapsed into 4 major, non-overlapping categories (White, non-Hispanic Black, Hispanic, and Other). ICD and Current Procedural Terminology (CPT) codes were included as yes/no variables with timestamps. Medications were mapped to RxNorm codes19 and laboratory values to Logical Observation Identifiers Names and Codes (LOINC) codes.20 Only variables represented in >70% of subjects throughout the combined dataset (except uACR and blood pressure due to their established clinical importance) were included and used for training of the KidneyIntelX algorithm.
Ascertainment and definition of the kidney endpoint
We determined eGFR using the CKD-EPI creatinine equation.21 We employed linear mixed models with an unstructured variance-covariance matrix and random intercept/slope for each individual to estimate eGFR slope.22 The primary composite outcome included RKFD defined as an eGFR slope decline of ≥ 5 ml/min/1.73 m2/year,2 a sustained (confirmed at least 3 months later) decline in eGFR of ≥40%23 from baseline, or “kidney failure” defined by sustained eGFR < 15 ml/min/1.73 m2 confirmed at least 30 days later, or receipt of long-term maintenance dialysis or receipt of a kidney transplant.2 Additionally, two nephrologists (SC/GNN) independently adjudicated all outcomes examining each individual patient over their longitudinal course, accounting for eGFR changes (ensuring annualized decline of ≥5 ml/min or ≥ 40% sustained decrease), corresponding ICD/CPT codes and medications to ensure that outcomes represented true decline rather than a context dependent temporary change (e.g., due to medications/hospitalizations). Follow up time was censored after loss to follow-up, after the date that the non-slope components of the composite kidney endpoint were met, or 5 years after baseline.
Statistical Analysis
The datasets were randomized into a derivation (60%) and validation sets (40%). The validation dataset was completely blinded and sequestered from the total derivation dataset. Using only the derivation set, we evaluated supervised random forest algorithms on the combined biomarker and all structured EHR features without a priori feature selection and identified a candidate feature set; eTable 1. The derivation set was then randomly split into secondary training and test sets for model optimization with 70%-30% spitting and a 10-fold cross-validation for AUC. We considered both raw values and ratios of the biomarkers. Missing uACR values were imputed to 10 mg/g,24 missing blood pressure (BP) values were imputed using multiple predictors (age, sex, race and antihypertensive medications),25 and median value was used for other features where missingness was < 30%. (eTable 2) We conducted further iterations of the model by tuning the individual hyperparameters. A hyperparameter is a parameter which is used to control the learning process (e.g., no of RF trees) as opposed to parameters whose weights are learned during the training (e.g.: weight of a variable). Tuning hyperparameters refers to iteration of model architecture after setting parameter weights to achieve the ideal performance. For random forest architecture, it could include components such as maximum depth of decision tree, number of trees in forest, and majority voting rules.26 The final model was selected based on AUC performance.
We generated risk probabilities for the composite kidney endpoint using the final model in the derivation set, scaled them to align with a continuous score from 5-100 by increments of 5, and applied this score to the validation set. Risk cut-offs were chosen in the derivation set to encompass the top 15% as the high risk (scores 90-100), bottom 45% as the low risk (scores 5-45), and the intervening 40% as the intermediate risk group (scores 50-85). Primary performance criteria were AUC, positive predictive value for high risk group and negative predictive values for low risk group (PPV and NPV, respectively) at the pre-determined cut-offs. The selected model and associated cut-offs were then validated by an independent biostatistician (MK) in the sequestered validation cohort.
In addition to these traditional test statistics, we assessed calibration by examination of the slope of observed vs. expected outcome plots of the KidneyIntelX score vs. only the observed outcomes. We also constructed Kaplan Meier curves for time-dependent outcomes of 40% decline and kidney failure with hazard ratios using the Cox proportional hazards method.
The discrimination of the KidneyIntelX model was compared to a recently validated comprehensive clinical model which included age, sex, race, eGFR, cardiovascular disease, smoking, hypertension, BMI, UACR, insulin, diabetes medications, and HbA1c and was developed to predict 40% eGFR decline in eGFR in T2D.24 Finally, we calculated the net reclassification index (NRI) for events and non-events.27,28 All a-priori levels of significance were <0.05. All hypothesis tests were two-sided. All analyses were performed with R software (www.rproject.org).
RESULTS
Baseline Characteristics of Cohorts
Baseline characteristics of the total study cohort incorporating derivation and validation (n=1146) were as follows; median age 63 years, 581 (51%) female, median eGFR was 54 ml/min/1.73 m2, and the median uACR was 61 mg/g. uACR was available in 62% of the cohort and imputed to 10 mg/g in 38%. The most common comorbidities were hypertension (91%), coronary heart disease (35%), and heart failure (33%). The majority (81%) were on ACE inhibitors or angiotensin receptor blockers (ARBs). Baseline characteristics between derivation and validation sets including event rates were balanced. The median number of serum creatinine/eGFR values per patient during the follow-up period was 16 (Table 1). Distribution of DKD stages of the study cohort is similar to national estimates (eTable 3).
Prediction of the composite kidney endpoint
Overall, 241 patients (21%) experienced the composite kidney endpoint over over a median 4.3 (IQR 3.0-4.8) years. In the complete derivation set (n=686), using 10-fold cross validation for discrimination, the average AUC for the KidneyIntelX model was 0.77 (95% CI 0.74-0.79). The most significant data features contributing to performance of the KidneyIntelX model included the three plasma biomarkers (TNFR1, TNFR2 and KIM1, as discrete values and ratios), eGFR, uACR, and systolic blood pressure (Figure 1). This final model had an AUC of 0.77 (95% CI 0.76-0.79) in the validation set (n=460). The risk for the composite kidney event increased by predicted probabilities of the KidneyIntelX score (Figures 2a and 2b and eTable 4). The slope of the observed vs. the predicted risk for KidneyIntelX was 0.8 in the training set and 1.0 in the validation set, indicating good calibration (eFigure 2). By comparison, the comprehensive clinical model yielded an AUC of 0.62 (95% CI 0.61-0.63) in the full derivation set (n=686) and 0.61 (95% CI 0.60-0.63) in validation set (n=460); Delong p value for KidneyIntelX vs. clinical model <0.001).
Shapley additive explanations (SHAP) plot showing relative feature importance
Composite Kidney Endpoint Event Rates by A. Ranks of KidneyIntelX Predicted Risk and B. KidneyIntelX Score
KidneyIntelX Clinical Utility Cut-points
The distribution of patients into KDIGO risk categories was established using 296 subjects (64%) with uACR available in the validation cohort with the population stratified into “moderately increased risk” (53%), “high risk” (31%), and “very high risk” (16%) with respective probabilities of 0.15, 0.29 and 0.41 for the composite kidney endpoint over 5 years. The risk probability cutoffs of KidneyIntelX selected in the derivation set (n=686) were 0.157 for the lowest 45% of patients and 0.275 for the top 15% of patients. When these risk cut-offs were applied to the complete validation set, with imputed uACR for missing values (n=460), KidneyIntelX stratified patients to low-(47%), intermediate-(37%), and high-risk (16.5%) groups with respective probabilities for the composite kidney endpoint of 0.09, 0.22, and 0.62. This translated into a PPV of 62% in the high-risk group (compared to a PPV 41% for KDIGO “very high risk”, p value < 0.001; Table 2). The NPV in the low-risk group was 91% for KidneylntelX compared to 85% for KDIGO “moderately increased risk” group (p= 0.33). In the subgroup with non-imputed uACR (n=296), the PPV for KidneylntelX in the high-risk strata further improved to 69% and the NPV improved to 93%. Confusion matrices are available in eTable 5. Additional risk cutoffs by potentially relevant proportions of the population are shown in Table 2 and of the KDIGO model are in eTable 6.
KidneylntelX correctly classified more cases into the appropriate risk strata (NRIevent = 55% in the derivation set and 41% in the validation set, p value < 0.05; eTable 7) compared to KDIGO. NRInon-event was −8.2% in the derivation set and −7.9% in the validation set (p value NS).
Supplementary Analyses
Time to Event Analyses for 40% Sustained Decline or Kidney Failure
Patients with high-risk KidneyIntelX scores (top 15% in the derivation set and top 16.5% in the validation set) had greater risk of progression to time-to-event categorical outcomes of 40% sustained decline or kidney failure than patients in the low-or medium-risk strata combined (hazard ratio (HR) 9.2; 95% CI: 6.2-13.6 in derivation and 9.1, 95% CI 5.8-14.4 in the validation set; Figure 3A&B). Kaplan-Meier curves by KDIGO risk categories in the training and validation set are shown in eFigure 3.
Kaplan-Meier Curves by KidneyIntelX Risk Strata for the Endpoint of Sustained 40% Decline in eGFR or Kidney Failure in Derivation (Panel A) and Validation (Panel B)
Subgroup analysis
KidneyIntelX performed similarly across patients with an eGFR greater or less than 60 at baseline (0.78 and 0.76 respectively). Additionally, when only data in the year prior to enrollment was included, the AUC was identical (0.77) as was the PPV for the top 16% (62%) and the NPV for the bottom 45% (91%). Kaplan-Meier plots did not change when limited to patients with data ≥5 years to ensure alive for at least 5 years (eFigure 4)
Discrimination for “Kidney Failure” Endpoint
Using the same KidneyIntelX risk score specifically trained for the composite kidney endpoint, the AUC of KidneyIntelX for the “kidney failure” endpoint alone was 0.87 (95% CI 0.84-0.89) in the derivation cohort and 0.89 (95% CI 0.87-0.91) in the validation cohort.
DISCUSSION
Utilizing patients with T2D from two biobanks with plasma samples and linked EHR data, we developed and validated a risk score combining clinical data and three plasma biomarkers via a random forest algorithm to predict a composite kidney outcome consisting of RKFD, sustained 40% decline in eGFR, and kidney failure over 5 years. We demonstrated that the KidneyIntelX outperformed models using only standard clinical variables, including KDIGO risk categories.3,20 There were marked improvements in discrimination over clinical models, as measured by AUC, NRI, and improvements in PPV compared to KDIGO risk categories. Furthermore, we showed that KidneyIntelX accurately identified over 40% more patients experiencing events than the KDIGO strata. Finally, KidneyIntelX provided good risk-stratification for the accepted FDA endpoint of sustained 40% decline in eGFR or kidney failure with a 9-fold difference in risk between the high-risk and low-and intermediate-risk strata for this clinical and objective endpoint.
DKD is an increasingly complex and common problem challenging modern healthcare systems. In real world practice, the prediction of RKFD in patients with T2D is challenging, particularly in early disease with preserved kidney function and therefore, implementation of improved prognostic tests is paramount. Our integrated risk score has near-term clinical implications, especially when linked to clinical decision support (CDS) and embedded care pathways. The current standard for clinical risk stratification (KDIGO risk strata)2 has three risk strata that overlap with the population of DKD patients that we included in our study. We also created a risk score with three risk strata (low, intermediate, and high) incorporating KDIGO classification components (eGFR and uACR), as well as the addition of other clinical variables, and three blood-based biomarkers. In this way, we were able to augment the ability to accurately risk-stratify DKD patients, thereby enabling improved patient management.
Low- and intermediate-risk patients with DKD can continue care with their existing PCP’s or diabetologists and require less intensity of treatments, unless repeat testing, changes in clinical status or local arrangements regarding referral to specialist care indicate otherwise. For those with high-risk scores, oversight may include more referrals to nephrology,29,30 increased monitoring intervals, improved awareness of kidney health, referral to dieticians, reinforcement of usage of antagonists of the renin angiotensin aldosterone system, and increased motivation to start recently approved medications, including SGLT2 inhibitors and GLP-1 receptor agonists to slow progression.31-34 Adoption of these new therapies is lagging, especially in patients considered to be ‘low-risk’ by standard criteria, where cost of treatment and presence of adverse events are limiting factors. Earlier engagement with nephrologists may also allow for more time to advise and educate patients about home-based dialysis and pre-emptive or early kidney transplant as patient-centered kidney replacement options if more aggressive treatment does not ultimately prevent progression of DKD. The use of a risk score as part of the enrollment process in future RCTs may enrich the trial participants for greater likelihood of events and thus reduce the chances for type 2 error, or minimize the sample size needed to detect a statistically significant difference with treatment vs. control. Interventions that prevent or slow CKD progression and foster patient-centered kidney replacement modalities support the goals of the US Department of Health and Human Services’ Advancing American Kidney Health initiative.35
KidneyIntelX included inputs from biomarkers examined in several settings, including patients with DKD. Soluble TNFR1 and 2 and plasma KIM-1 have demonstrated reliable independent prognostic signals for kidney function decline and ESRD.11,12,15,36-41 By incorporating biomarker levels and the EHR data into our machine learning algorithm, we were able to provide a multidimensional representation of the patient and allow for the model to generate improved prognostic estimates.42,43
Our study should be interpreted in light of the following limitations. uACR was missing in 38% of the cohort, but this is representative of current state of care. For example, uACR was missing in over 50% of the diabetes population in two large nationally representative datasets.1,44 Moreover, our goal was to develop a risk score using real world data from EHR for prediction where uACR is missing in a significant number of patients. More widespread availability of uACR values would enhance the performance of KidneyIntelX, as it was a contributing feature in our model. However, even with this limitation, the performance of KidneyIntelX was more robust than KDIGO strata in those with uACR measured. Second, there was lack of protocolized follow-up resulting in missing data and lack of kidney biopsies, as real-world data from the EHR were used. Missing data can lead to biased machine learning models and the data are prone to ascertainment bias.45 However, the median number of eGFR values per patient was 16, and the median time of follow-up was 4.3 years, thereby providing the opportunity to determine whether the kidney outcome was met. Although the primary biobanked cohorts used in the study are broadly representative of the parent hospital populations in terms of age, race/ethnicity and gender distribution and our study is representative of the US DKD population, we cannot rule out an inherent bias since the recruitment was opt-in recruitment and patients who chose to participate in the cohorts from which the study population was selected may be different than those who did not participate in the primary cohorts. Finally, both cohorts are from the Northeast of USA and an independent validation cohort is needed to ensure generalizability. However, only 1/3rd of the participants were white, thus there was adequate representation of racial groups that experience disparities for kidney disease.
In conclusion, a machine learned model combining plasma biomarkers and EHR data significantly improved prediction of adverse kidney outcomes over standard clinical models in patients with T2 DKD from two large academic medical centers.
Data Availability
Data is available on request and subject to institutional approvals due to patient health information.
Funding/Support
This project was funded by RenalytixAI plc.
Role of funder
RenalytixAI was involved in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Author Contributions
GNN and FF had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Nadkarni, Fleming, Donovan, Coca
Acquisition, analysis, or interpretation of data: Nadkarni, Fleming, Donovan, Coca, Damrauer
Drafting of the manuscript: Chan, Nadkarni, Coca, Damrauer
Critical revision of the manuscript for important intellectual content: All authors
Statistical analysis: Nadkarni, Fleming, Kattan, Coca
Administrative, technical, or material support: McCullough, Connolly, Mosoyan
Supervision: Nadkarni, Fleming, Murphy, Donovan, Coca
Conflict of Interest Disclosures
GNN, MD, SGC receive financial compensation as consultants and advisory board members for RenalytixAI, and own equity in RenalytixAI. GNN and SGC are scientific co-founders of RenalytixAI. GM, MWK and JAV are consultants for RenalytixAI.
FF and JRM are Executive Directors and BM is a Non-Executive Director of RenalytixAI.
SGC has received consulting fees from CHF Solutions, Relypsa, Bayer, Boehringer Ingelheim, and Takeda Pharmaceuticals in the past three years. GNN has received operational funding from Goldfinch Bio and consulting fees from BioVie Inc, AstraZeneca, Reata and GLG consulting in the past three years.
This research was supported by RenalytixAI. GNN is supported by a career development award from the National Institutes of Health (NIH) (K23DK107908) and is also supported by R01DK108803, U01HG007278, U01HG009610, and 1U01DK116100. SGC and GNN are members and are supported in part by the Chronic Kidney Disease Biomarker Consortium (U01DK106962). SGC is also supported by the following grants: R01DK106085, R01HL85757, R01DK112258, and U01OH011326.