Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner

Post-acute Sequelae of COVID-19 (PASC), also known as Long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19 infection. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. Using a sample of 55,257 participants from the National COVID Cohort Collaborative, as part of the NIH Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal, AUC-maximizing combination of gradient boosting and random forest algorithms. We were able to predict individual PASC diagnoses accurately (AUC 0.947). Temporally, we found that baseline characteristics were most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after COVID-19 infection. This finding supports the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients prior to acute COVID diagnosis, which could improve early interventions and preventive care. We found that medical utilization, demographics and anthropometry, and respiratory factors were most predictive of PASC diagnosis. This highlights the importance of respiratory characteristics in PASC risk assessment. The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings.


BACKGROUND
As the mortality rate associated with acute COVID-19 incidence wanes, investigators have shifted focus to determining its longer-term, chronic impacts. 1 Postacute Sequelae of COVID-19 (PASC) is a loosely categorized consequence of acute infection that is related to dysfunction across multiple biological systems. 2 Electronic health record (EHR) databases, such as the National COVID Cohorts Collaborative (N3C), provide an important tool for predicting, evaluating, and understanding PASC. 3,4ven the broad range of factors associated with PASC, the high dimensionality of the N3C Enclave data, and the unknown determinants of Long COVID, modeling methods for predicting PASC must be highly flexible.Super Learner (SL) is a flexible, ensemble (stacked) machine learning algorithm that uses cross-validation to learn the optimal weighted combination of a specified set of algorithms. 5,6The SL is grounded in statistical optimality theory that guarantees for large sample sizes it will perform at least as well as the best-performing algorithm included in the library.Thus, a rich library of learners, with a sufficient sample size, will ensure optimal performance.This robustness is supported by numerous applications, and the SL can be specified to maximize any performance metric, such as mean squared error. 6re, we used the SL to estimate the function for predicting PASC diagnosis in COVID-infected patients, given a diverse set of features curated from the EHR.The SL was specified such that it learned the combination of algorithms, including variations of gradient boosting and random forest, that maximized the area under the receiver operator characteristic curve (AUC). 7Our set of features for predicting PASC included those previously described in the literature, 3 and additional features related to subjectmatter knowledge and patterns of missingness.We also investigated the importance of features for predicting PASC across multiple levels, including assessing the importance of each individual feature, and groups of features based on temporality (baseline, pre-COVID, acute COVID, and post-COVID features) and hypothesized biological pathways of PASC.

Sample
The Long COVID Computational Challenge (L3C, DUR RP-5A73BA) sample population was selected from the N3C dataset, a national, open dataset that has been described previously. 3,4The L3C sample included participants diagnosed with PASC (ICD code U09.9) and controls with a documented COVID-19 diagnosis who had at least one medical visit more than 4 weeks after their initial COVID diagnosis date.Controls were selected at a 1:4 (case:control) ratio and were matched based on the distribution of medical visits prior to COVID-19 diagnosis.The primary outcome of interest was PASC diagnosis via ICD code U09.9.
The dataset included 57,672 patients with 9,031 cases, 46,226 controls, and 2,415 patients excluded due to having a PASC diagnosis before 4 weeks following acute COVID diagnosis.This yielded a final analytic sample of 55,257 participants.

Feature selection
.
We extracted 304 features from N3C data.After indexing across four time periods and transforming features into formats amenable to machine learning analysis, our sample included 1,339 features (see Supplemental Table 1.Metadata).Details regarding feature selection and processing can be accessed via GitHub (https://github.com/BerkeleyBiostats/l3c_ctml/tree/v1).For continuous features, we included the minimum, maximum, and mean values for each measurement in each temporal window.For binary features, we either included an indicator (when repetition was not relevant) or a count (when repetition was relevant) over each time period and we re-coded categorical variables as indicators.
Temporal windows: We divided each participant's records into four temporal windows: baseline, which consisted of all records occurring a minimum of 37 days before the COVID index date (t -37, where t represents the COVID index date), and all time-invariant factors (such as sex, ethnicity, etc.); pre-COVID, observations falling 37 days prior to 7 days prior to the index date (t -37 to t -7); acute COVID, observations falling 7 days prior to 14 days after to the index date (t -7 to t + 14); and post-COVID, records from 14 to 28 days after the index date (t + 14 to t + 28).
Features described in the literature: Pfaff et al. used gradient-boosting machine learning models (XGBoost) to identify patients at risk for PASC using N3C data. 3We extracted and transformed key features that were identified by Pfaff et al.These features included 199 previously described factors related to medical history, diagnoses, demographics, and comorbidities. 3mporality: To account for differences in follow-up, we included as an additional factor a continuous variable for follow-up time, defined as the number of days between the COVID index date and the most recent observation.To account for temporal trends of COVID (such as seasonality and dominant variant), we included categorical (ordinal) covariates for the season and months since the first observed COVID index date.
Missing data: We applied an approach that can be used to predict future observations with missing data, and we did so by creating indicator basis functions that indicate whether, for each variable, the observation was missing (yes/no). 8By including these (along with filling each missing variable with a 0), we allow the machine to determine what predictive information can be utilized by the missingness process, without relying on a current imputation model.Thus, this indicator allows the pattern of missingness to be a predictor of PASC.
COVID-19 positivity: We added several measures of COVID severity and persistent SARS-CoV-2 viral load, which are associated with PASC incidence. 9We imported measures of COVID severity as well as 15 measures of COVID infection from laboratory measurements, which provided insights on persistent SARS-CoV-2 viral load.We assessed the duration of COVID viral positivity separately for each laboratory measure of COVID and each temporal window.For participants who had both a positive and negative value of a given test during a temporal window, we took the midpoint between the last positive test and the first negative test as being the endpoint of their positivity.For individuals who had a positive test but no subsequent negative test within that temporal window, we determined their endpoint to be their final positive test plus three days.We included separate missingness indicators in each temporal window for each test, for a positive value for each test, and for a negative value following a positive value to indicate an imputed positivity endpoint.
Additional features: We incorporated the laboratory measurements related to anthropometry, nutrition, COVID positivity, inflammation, tissue damage due to viral infection, auto-antibodies and immunity, cardiovascular health, and microvascular disease, which are potential predictors of PASC. 9 We also extracted information about smoking status, alcohol use, marital status, and use of insulin or anticoagulant from the observation table as baseline characteristics of individuals, and we included the number of times a person has been exposed to respiratory devices in each of the four windows from the device table.We extracted covariates related to COVID severity, vaccination history, demographics, medical history, and previous diagnoses from before and during acute COVID infection.

Prediction using ensemble machine learning
We used the SL, an ensemble machine learning method, also known as stacking, to learn the optimally weighted combination of candidate algorithms for maximizing the AUC.We reprogrammed the SL in Python in order to capitalize on the resources available in the N3C Data Enclave (e.g., PySpark parallelization), and this software is available to external researchers (https://github.com/BerkeleyBiostats/l3c_ctml/tree/v1).We used a relatively small ensemble of four learners (a mix of robust parametric models and machine learning models): 1. Logistic regression; 2. L1 penalized logistic regression (with penalty parameter lambda = 0.01); One important decision for optimizing an algorithm is to decide which metric will be used to evaluate the fit and optimize the weighting of the algorithms in the ensemble.We used an approach developed specifically for maximizing the area under the curve (AUC). 7Specifically, we used an AUC maximizing meta-learner with Powell optimization to learn the convex combination of these four candidate algorithms. 7The SL was implemented with a V-fold/k-fold cross-validation scheme with 10 folds.

Variable importance
In this section, for the sake of computational efficiency, we worked with the discrete SL selector (the single candidate learner in the library with the highest crossvalidated AUC) instead of the entire ensemble SL.In this case, the gradient-boosting learner was the candidate learner with the highest cross-validated AUC.We used a general approach (for any machine learning algorithm) known as Shapley values. 10We generated these values within three groupings of predictors for ease of interpretability: individual features (e.g.cough diagnosis during acute COVID window), the temporal window when measurements were made relative to acute COVID infection, (e.g.pre-COVID window), and by specific biological pathways (e.g.respiratory pathway).At the individual level, we assessed the importance of each variable (indexed across each of the four temporal windows) in predicting PASC.At the temporal level, we assessed the relative importance of each of the four temporal windows (baseline, pre-COVID, acute COVID, and post-COVID) in predicting PASC status.At the level of the biological pathway, we grouped variables based on the following hypothesized mechanistic pathways of PASC: 1) Baseline demographics and anthropometry, 2) Medical visitation and procedures, 3) Respiratory system, 4) Antimicrobials and infectious disease, 4) Cardiovascular system, 5) Female hormones and pregnancy, 6) Mental health and wellbeing, 7) Pain, skin sensitivity, and headaches, 8) Digestive system, 9) Inflammation, autoimmune, and autoantibodies, 10) Renal function, liver function, and diabetes, 11) Nutrition, 12) COVID Positivity, 13) Uncategorized disease, nervous system, injury, mobility, and age-related factors. 9For temporal and biological groupings, we assessed the mean Shapley value of the 10 most predictive features in each group.A full list of our included covariates along with their grouping by temporality and biological pathway is included in our metatable (Supplemental Table 1.Metadata).

Predictive performance
Our models accurately predicted PASC diagnosis status among participants in the training sample, with an AUC of 0.947 on a holdout test set (10% of full data).

Variable importance
Individual predictors: We found that the strongest individual predictors (mean absolute Shapley value) of PASC diagnosis were the length of follow-up (0.40), the number of medical visits associated with a diagnosis during the acute COVID window (0.26), data partner ID (0.25), viral lower respiratory infection during the acute COVID window (0.11), and age (0.06) (Figures 1 and 2).Temporal windows: Baseline and time-invariant characteristics were the strongest predictors of PASC (mean 0.093), followed by characteristics during the acute COVID window (mean 0.049) (Figure 3).Biologic pathways: We found that medical visitation and procedures included the strongest predictors (mean 0.085), followed by demographics and anthropometry (mean 0.054), respiratory factors (mean 0.023), COVID markers (mean 0.0064), and markers of pain (mean 0.0047) (Figure 4).

Predictive performance
These results provide strong support for 1) the choice of an ensemble learning approach, 2) the specific learners used, 3) how the missing data was handled, and 4) the choice of optimization criteria (maximizing the AUC).

Variable importance
Individual predictors: We found that the individual predictors most associated with PASC diagnosis were related to medical utilization rate and site of care, such as length of follow-up and data provider ID.These factors are unlikely to be causal drivers of PASC incidence.On the other hand, we found that lower viral respiratory infection during acute COVID was highly predictive of PASC diagnosis.Lower respiratory infection during acute COVID may be a causal pathway by which acute COVID leads to PASC, although future studies should apply a causal inference framework to evaluate this hypothesis.
Temporal windows: We found that baseline factors were the strongest predictor of PASC diagnosis, compared with factors immediately before, during, or after acute COVID-19 infection.This suggests that clinicians may be able to effectively identify who is at risk for PASC based on baseline characteristics and COVID infection symptoms.Although it should be noted that baseline characteristics included the greatest interval of time and included some time-variant factors that were not linked to any specific time point.Future analyses should expand on this finding to evaluate the feasibility of predicting individual PASC incidence, rather than diagnosis, using baseline characteristics alone.Additional information regarding this relationship could identify patients at risk for PASC prior to acute COVID-19 and could inform early interventions to prevent PASC.Biological pathways: These results are consistent with published literature and highlight the importance of respiratory features (e.g., asthma) as important factors in predicting who may develop PASC, which is consistent with the fact that SARS-CoV-2 is a respiratory virus. 2,3Respiratory factors can influence individual susceptibility to COVID-19, are important features of acute COVID-19 severity, and are key symptoms of PASC. 2,3,11Therefore, future studies should seek to parse the contributions of respiratory symptoms to PASC through the pathways of baseline susceptibility to COVID-19 versus phenotyping of severe COVID-19 in order to improve our understanding of respiratory features as a risk factor for PASC.Despite the range of PASC phenotypes, these findings are consistent with respiratory symptoms (e.g.dyspnea, cough) being the most commonly reported PASC symptoms. 9,11Other biological pathways, such as cardiovascular factors, have similar roles as both markers of susceptibility and severity of COVID-19 and should also be explored further in future studies.

Limitations
Our goal for this analysis was to maximize predictive accuracy, rather than to make causal inferences regarding exposure-outcome relationships, therefore we included all predictors prior to four weeks post-COVID (censored window).The inclusion of pre-COVID, acute COVID, and post-COVID factors complicates inference regarding whether predictive features (e.g., respiratory factors) reflect vulnerability to acute COVID, COVID symptoms, or early PASC symptoms.This analytic sample was matched 1:4 (PASC:non-PASC), with matching based on pre-COVID medical visitation rate, and this matched sample was drawn from N3C, which is a matched sample of COVID patients and healthy controls.Therefore, this sample may not be representative of a broader population.We note that, for future use of these data, if the prevalence of PASC in the target population is known, and the matching identifier is available, there are methods to calibrate the results to the actual population.Given that was not the case, one might generate results that need to be re-calibrated to the target population of interest.
We found measures of medical visitation to be strong predictors of PASC diagnosis.It is plausible that medical visitation may be associated with increased diagnoses in general, rather than true PASC incidence.However, increased medical visitation may be an effect of early PASC symptoms.

Future steps
In order to improve upon the interpretation and clinical applications of these findings, future studies should apply a causal inference approach to evaluate the potential causal impact of individual predictors on the risk of PASC.Future studies should rigorously evaluate highly-predictive features, e.g.3][14][15] TMLE is a general method for deriving estimates and robust inference for nonparametric measures of associations, so it is particularly well-suited for use in the context of machine learning.It can produce estimates of parameters, such as the average treatment effect, causal relative risk, causal attributable risk, direct effects, and many others; interpretation of results as estimates of causal parameters requires assumptions outside of the data (e.g., no unmeasured confounding), so though they provide good insights about the magnitude and direction of the average impact of a predictor, causal interpretation of the results should be made with caution One key exposure of interest is vaccination, which is a key strategy in preventing acute COVID-19 infection.There is evidence that COVID-19 vaccination is protective against PASC, but less is known about how vaccination timing (i.e.7][18] Additional information on the relationship between vaccination timing and PASC may inform vaccination guidelines.Furthermore, we lack biomarkers that can objectively diagnose or quantify the risk of PASC, which prevents our ability to research, prevent, and treat this condition. 9,19Evidence regarding these potential mechanistic biomarkers will be a key step in the efforts to combat this disease.

Summary
These findings highlight the importance of respiratory symptoms, healthcare utilization, and age in predicting PASC incidence, which is consistent with Pfaff et al.. 3 Although further investigation is needed, this supports the referral of COVID-19 patients with severe respiratory symptoms for subsequent PASC monitoring.In future work, we plan to investigate predictive performance when only baseline information is used as input to classify PASC, as this provides a practical implementation based on readilyavailable clinical features that could identify participants at risk of PASC prior to COVID diagnosis.