Developing deep learning-based strategies to predict the risk of hepatocellular carcinoma among patients with nonalcoholic fatty liver disease from electronic health records

Background: Deep learning models showed great success and potential when applied to many biomedical problems. However, the accuracy of deep learning models for many disease prediction problems is affected by time-varying covariates, rare incidence, and covariate imbalance when using structured electronic health records data. The situation is further exasperated when predicting the risk of one disease on condition of another disease, such as the hepatocellular carcinoma risk among patients with nonalcoholic fatty liver disease due to slow, chronic progression, the scarce of data with both disease conditions and the sex bias of the diseases. Objective: The goal of this study is to investigate the extent to which time-varying covariates, rare incidence, and covariate imbalance influence deep learning performance, and then devised strategies to tackle these challenges. These strategies were applied to improve hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. Methods: We evaluated two representative deep learning models in the task of predicting the occurrence of hepatocellular carcinoma in a cohort of patients with nonalcoholic fatty liver disease (n = 220,838) from a national EHR database. The disease prediction task was carefully formulated as a classification problem while taking censorship and the length of follow-up into consideration. Results: We developed a novel backward masking scheme to evaluate how the length of longitudinal information after the index date affects disease prediction. We observed that modeling time-varying covariates improved the performance of the algorithms and transfer learning mitigated reduced performance caused by the lack of data. In addition, covariate imbalance, such as sex bias in data impaired performance. Deep learning models trained on one sex and evaluated in the other sex showed reduced performance, indicating the importance of assessing covariate imbalance while preparing data for model training. Conclusions: Devising proper strategies to address challenges from time-varying covariates, lack of data, and covariate imbalance can be key to counteracting data bias and accurately predicting disease occurrence using deep learning models. The novel strategies developed in this work can significantly improve the performance of hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. Furthermore, our novel strategies can be generalized to apply to other disease risk predictions using structured electronic health records, especially for disease risks on condition of another disease.


Supplementary Materials
A common approach for analyzing time-to-event data with a classifier is to use the event status alone while ignoring the event times.Instead, we derive class labels by examining the event status at a chosen time point while removing patients who have insufficient follow-up (Figure 2).Specifically, the classification problem in this study is to predict NAFLD patients who will develop HCC within 10 years and those who will not, and the main objective is to identify risk factors and protective factors associated with the occurrence of HCC.We showed with Monte Carlo simulations that analyzing time-to-event data as a classification problem in this way allows us to identify risk factors reliably with strong control of type I error (Supplementary Figure S1-2).Further, under many conditions, the frequency of the positive class label is a well-calibrated estimator of the disease risk at the chosen time point (Supplementary Figure S3), although we do not need this property to identify risk factors.Accordingly, with this formulation, we can compare the performances of DeepHit and RETAIN in terms of predicting whether patients in a defined cohort develop a disease of interest within a specified time.
The formulation of disease prediction as a classification problem enabled us to take advantage of powerful deep learning models not developed for time-to-event data with censoring.We reframed the time-to-event problem as a classification problem by assigning class labels based on whether the event has occurred within a pre-specified time threshold, using the observed event times and statuses.We showed this formulation has desirable statistical properties in terms of the high accuracy of identifying risk factors and the calibration of the class probability as an estimator for the disease risk.This formulation thus allowed us to apply the RETAIN deep learning model in addition to the more statistically rigorous DeepHit model to identify risk factors for HCC progression among NAFLD patients, as well as exploring sex-specific patterns of HCC progression.It also makes it convenient to incorporate longitudinal information in the classification algorithm without specialized models for longitudinal data.
Our classification framework does not consider competing risks, so the identified risk factors must be interpreted carefully.For example, we showed that being a non-smoker appeared to be a risk factor for HCC in a NAFLD patient, which is inconsistent with a prior large retrospective study.A more reasonable interpretation would be that being a non-smoker lowers the competing risk of death and thus allows NAFLD patients more time to develop HCC.Moreover, to estimate disease risk accurately in this setting, we continue to advocate the use of competing risk models for now, especially those that can account for changes in covariate values over time.The situation could be handled by considering a new, multi-category classification algorithm in the future.Nonetheless, our formulation of disease prediction as a classification problem facilitates the application of powerful longitudinal deep learning models that do not model event censoring and thus provides timely clinical insights into disease progression over time.We expect that this algorithm works best in common situations where there are notable changes in a patient's health in the years close to the clinical event.
We showed that powerful predictive models such as deep learning can be sensitive to covariate imbalance, such as sex bias.The performance of disease prediction decreased by > 5% when RETAIN was trained data from on one sex and tested on data for the other sex.This performance reduction occurs because RETAIN can identify sex-specific features.For example, in female patients with NAFLD, analysis of the attention weights in the trained RETAIN models revealed that rheumatoid arthritis may be a risk factor and kidney stones may be a protective factor for HCC progression.Our results thus indicate that using a sex-biased dataset for training can reduce the predictive performance and generalizability of the trained deep learning model to other datasets.This finding can also be applied to other disease prediction tasks where sex and other patient characteristics such as race and ethnicity play important roles in disease progression.risk factor as statistically significant.Time-to-event data (n = 1000) were generated with event times following the Weibull distribution with various shape parameters (alpha) and rate parameters (lambda), under a uniform censoring scheme.The hazard rates were modified multiplicatively by a null, negative, or positive risk factor.The time-to-event analysis used the Cox proportional-hazards regression to identify risk factors, while the classification analysis used the logistic regression for which class labels were defined as event occurrence within a time cutoff.Results are shown for the cutoff set at the 75% quantile of observed event times, and similar results were obtained for cutoff set at the 25% and 50% quantiles.
3) all AUDIT-C scores (<4 in men and <3 in women); 4) the date of first ALT test in the study timeframe as the index date of follow-up for controls; 5) random sampling without replacement (case: control= 1:1): sex; age at first ALT (index date); duration from their first visit to the first ALT test date (These three conditions of case are best the same as the control.If it is not the same, we can take the nearest one.).

Inclusion and Exclusion Criteria for Hepatocellular cancer (HCC) cohort from EHR
Inclusion: 1) patients diagnosed with HCC (4 ICD codes for HCC: 155.0, C22.0, C22.8 and C22.9) 2) ≥18 years old at first visit date Exclusion: 1) patients were classified as having non-alcoholic fatty liver disease (NAFLD) if they had 2 or more elevated alanine aminotransferase (ALT) values (≥40 IU/mL for men and ≥31 IU/mL for women) in the ambulatory settings and more than 6 months apart (loinc code for ALT test: 1742-6)

Exclusion:
1) patients were classified as having NAFLD if they had 2 or more elevated alanine aminotransferase (ALT) values (≥40 IU/mL for men and ≥31 IU/mL for women) in the ambulatory settings and more than 6 months apart (loinc code for ALT test: 1742-6)

Figure S1 .
Figure S1.The distribution of propensity scores for positive and negative samples before and

Figure S3 .
Figure S3.Comparison of the accuracy of risk factor identification under a survival model vs. a

Figure S4 .
Figure S4.Comparison of the accuracy of risk factor identification under a survival model vs. a

Figure S5 .Figure S6 .
Figure S5.Class label probability is a well-calibrated estimator of the cumulative event

Table S2 .
High contribution medical codes with positive attention for male patients.

Table S3 .
High contribution medical codes with positive attention for female patients.

Table S4 .
High contribution medical codes with negative attention for male patients.

Table S5 .
High contribution medical codes with negative attention for female patients.