Who has long-COVID? A big data approach

Background Post-acute sequelae of SARS-CoV-2 infection (PASC), otherwise known as long-COVID, have severely impacted recovery from the pandemic for patients and society alike. This new disease is characterized by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous long-COVID definition. Electronic health record (EHR) studies are a critical element of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which is addressing the urgent need to understand PASC, accurately identify who has PASC, and identify treatments. Methods Using the National COVID Cohort Collaborative (N3C) EHR repository, we developed XGBoost machine learning (ML) models to identify potential long-COVID patients. We examined demographics, healthcare utilization, diagnoses, and medications for 97,995 adult COVID-19 patients. We used these features and 597 long-COVID clinic patients to train three ML models to identify potential long-COVID patients among (1) all COVID-19 patients, (2) patients hospitalized with COVID-19, and (3) patients who had COVID-19 but were not hospitalized. Findings Our models identified potential long-COVID patients with high accuracy, achieving areas under the receiver operator characteristic curve of 0.91 (all patients), 0.90 (hospitalized); and 0.85 (non-hospitalized). Important features include rate of healthcare utilization, patient age, dyspnea, and other diagnosis and medication information available within the EHR. Applying the "all patients" model to the larger N3C cohort identified 100,263 potential long-COVID patients. Interpretation Patients flagged by our models can be interpreted as patients likely to be referred to or seek care at a long-COVID specialty clinic, an essential proxy for long-COVID diagnosis in the current absence of a definition. We also achieve the urgent goal of identifying potential long-COVID patients for clinical trials. As more data sources are identified, the models can be retrained and tuned based on study needs. Funding This study was funded by NCATS and NIH through the RECOVER Initiative.

symptoms more than four weeks after severe, mild, or asymptomatic SARS-CoV-2 infection. 4,5 Characterizing, diagnosing, treating, and caring for long-COVID patients has proven very challenging due to heterogeneous signs and symptoms that evolve over long trajectories. 6 The impact of long-COVID on patients' quality of life and ability to work can be profound.
The wide range of symptoms attributed to long-COVID was highlighted in an extensive patient-led survey, 7 which conducted deep longitudinal characterization of long-COVID symptoms and trajectories in suspected and confirmed COVID-19 patients reporting illness lasting more than 28 days. 8 Evaluation and harmonization of patient-and clinically reported long-COVID features using the Human Phenotype Ontology (HPO) also revealed heterogeneous signs and symptoms, supporting the conclusion that a complex constellation of both patient-and clinically reported features is necessary to correctly classify and manage long-COVID patients. 9 The World Health Organization (WHO) recently published its own case definition of "post COVID-19 condition" (WHO's term) that includes twelve criteria, which similarly require a wide variety of patient-declared and clinical information. 10 In order to gain an understanding of the complexities of long-COVID, it will be necessary to recruit a large and diverse cohort of research participants. The NIH's RECOVER study 11 is one such initiative aiming to recruit thousands of participants nationwide in order to answer critical research questions about PASC such as understanding pregnancy risk factors, cognitive impairment and mental health, and outcome disparities and comorbidities. Efficient recruitment of cohorts of this size and scope often entails leveraging computable phenotypes 12-14 (i.e., electronic cohort definitions) to find sufficient numbers of patients meeting a study's inclusion criteria. Poor cohort definition can result in poor study outcomes. 15, 16 For long-COVID, as with other novel conditions, the lack of an agreed-upon definition and the heterogeneity of the condition's presentation poses a significant challenge to cohort identification. Machine learning (ML) can help address this by using the rich longitudinal data available in electronic health records (EHRs) to algorithmically identify patients similar to those in a long-COVID "gold standard." The National COVID Cohort Collaborative (N3C) 17 offers a data-driven solution to quantifying the features of long-COVID and an appropriate proving ground for an ML approach. 18 N3C is an NIH National Center for Advancing Translational Sciences (NCATS)-sponsored data and analytic environment which compiles and harmonizes longitudinal electronic EHR data from 65 sites and over 8 million patients who have (1) tested positive for SARS-CoV-2 infection, (2) whose symptoms are consistent with a COVID-19 diagnosis, or (3) are demographically matched controls who have tested negative for SARS-CoV-2 infection (and have never tested positive) to support comparative studies. 19 To build a foundation for a robust clinical definition of long-COVID, we linked curated long-COVID clinic patient lists from three N3C sites with data in the N3C repository. We then used the linked dataset to train and test three ML models and applied those models to (1) define a nationwide cohort of potential long-COVID patients and (2) derive a list of prominent clinical features shared among that cohort to help identify patients for research studies and target features for further investigation.

Research in context
Evidence before this study Initial characterization of long-COVID patients has contributed to an emerging clinical understanding, but the significant heterogeneity of disease features makes diagnosing and treating this new disease extremely challenging. This challenge is urgent to address, as many patients report long-COVID symptoms as debilitating, significantly affecting their ability to engage in activities of daily life. Few studies have utilized large-scale databases to understand concordance of clinical patterns and generate data-driven definitions of long-COVID. The NIH RECOVER program has invested in EHR studies to understand PASC, to accurately identify who has PASC, and to prevent and treat PASC.
Added value of this study N3C harmonizes patient-level EHR data from over 8 million demographically diverse and geographically distributed patients. Here, we describe highly accurate XGBoost ML models that use N3C to identify presumptive long-COVID patients, trained using EHR data from patients who attended a long-COVID specialty clinic. The most powerful predictors in these models are outpatient clinic utilization post-acute COVID-19, patient age, dyspnea, and other diagnosis and medication features readily available in the EHR. The model is transparent and reproducible, and can be widely deployed at individual healthcare systems to enable local research recruitment or secondary data analysis.
Implications of all the available evidence N3C's longitudinal data for COVID-19 patients provides a comprehensive foundation for the development of ML models to identify potential long-COVID patients. Such models enable efficient study recruitment efforts to deepen our understanding of long-COVID and offer opportunities for hypothesis generation. Moreover, as more patients are diagnosed with long-COVID and novel sources of data become more available (e.g., sensor data from wearables), our models can be refined and retrained, evolving the algorithm as more evidence emerges. Our model can also be the basis for predicting who will seek care for long-COVID, and therefore provide the basis for generalizable clinical decision support tools.

Role of the funding source
This study was funded by NCATS, which contributed to the design, maintenance, and security of the N3C Enclave; and the NIH RECOVER Initiative, which is coordinating the participant recruitment protocol to which this work contributes. No authors have been paid by a pharmaceutical computer or other agency to write this article. Authors were not precluded from accessing data in the study, and they accept responsibility to submit for publication.

Selecting and subsetting the base population
To model long-COVID, we used EHR data integrated and harmonized in Palantir Foundry inside the N3C Secure Data Enclave to identify unique healthcare utilization patterns and clinical features among COVID-19 patients. For the purpose of this study, we defined our base population (n = 1,793,604) as any non-deceased adult patient (age >= 18 years) with either an ICD-10-CM COVID-19 diagnosis code (U07.1) from an inpatient or emergency visit, or a positive SARS-CoV-2 PCR or antigen test, and for whom at least 90 days have passed since COVID-19 index date. Prior to this analysis, patients from six N3C sites were removed from the cohort due to their sites' use of randomly shifted dates of service, which limits our ability to use temporal logic during analysis.
Without an agreed-upon case definition, no long-COVID gold standard exists to validate computable phenotypes and train ML models. However, three N3C sites provided lists of locally identified patients who had visited that site's long-COVID speciality clinic one or more times. These patients represent a "silver standard" within our base population (n = 597, once our base population criteria were applied). Hereafter, we will refer to this group of patients as "long-COVID clinic patients"; patients identified by our trained model as long-COVID patients will be referred to as "potential long-COVID patients." This silver standard enabled us to develop a model to identify patients likely to be referred to or seek care at a long-COVID clinic-a valuable proxy for long-COVID until a true gold standard is available.
For the purposes of training and testing ML models, we created a subset of our overall cohort containing only patients originating from those three sites (n = 97,995), including the long-COVID clinic patients. This subset was stratified further into patients who had been hospitalized with acute COVID-19 (n = 19,368) and patients not hospitalized (n = 78,627). We narrowed further to patients who had at least one encounter and at least one diagnosis or one medication in their post-COVID-19 window (15,621 hospitalized, 58,351 not hospitalized). The full cohort selection and subsetting process is illustrated in Supplemental Figure 1.

Feature selection
Within the three-site subset, we examined demographics, encounter details, conditions, and medication orders for each patient before and after their period of acute COVID-19. Though its use was considered, lab result data ultimately proved too sparse among the cohort for use in the models, especially for non-hospitalized patients. Features were selected for inclusion in the model by gathering data points in these domains associated with the long-COVID clinic patients in the time period of interest (see Figure 1). For each patient, we only counted diagnoses that newly occurred or occurred in greater frequency in the post-COVID-19 period compared to the pre-COVID-19 period, and only counted medications that were newly prescribed in the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; post-COVID-19 period, with no order records in the pre-COVID-19 period. Detailed feature engineering methods are described in Supplemental Methods.

Modeling
To reflect that long-COVID may look different depending on the severity of the patient's acute COVID-19, we built three different ML models using the three-site subset: (1) all patients, (2) patients who had been hospitalized with acute COVID-19, and (3) patients who were not hospitalized. The intent of each model is to identify the patients most likely to have long-COVID, using attendance at a long-COVID specialty clinic as a proxy for long-COVID diagnosis. To train and test each model, patients were randomly sampled to yield similar patient counts in both classes (long-COVID clinic patients and patients who did not attend the long-COVID clinic). For the all-patient model, data were also sampled to yield similar numbers of hospitalized and non-hospitalized patients. Counts of patients in each group used for training and testing are shown in Supplemental Table 1.
The python package XGBoost was used to construct the models, using 924 features in total. Categorical features were one-hot encoded. Age and encounter rates were treated as continuous variables and conditions and drugs were modeled as binary features; see Supplemental Methods for feature engineering details. Model hyperparameters were tuned using scikit-learn's GridSearchCV, with 5-fold cross validation, set to optimize the area under the receiver operating characteristic curve (AUROC). We trained each model using 5-fold cross validation, repeated 5 times. To assess performance, we calculated the AUROC, as well as the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; https://doi.org/10.1101/2021.10.18.21265168 doi: medRxiv preprint precision, recall, and F-score for each model with a predictive probability threshold of 0.45. To aid with interpretability, we calculated SHapley Additive exPlanation (SHAP) values 20 for all features to quantify each feature's importance to the classifications made by each model.
Once trained, we ran the "all patients" model over the full base population of patients who had at least one encounter and at least one diagnosis or one medication in their post-COVID-19 window (n = 846,981), in order to flag potential long-COVID patients in the N3C Enclave.

Findings
The combined demographics of the long-COVID clinic patients supplied by the three sites (first two columns of Table 1) show significant differences from the COVID-19 patients at those sites who did not attend the long-COVID clinic (third and fourth columns of Table 1). Notably, non-hospitalized long-COVID clinic patients are disproportionately female. Long-COVID clinic patients who were hospitalized with acute COVID-19 are disproportionately Black when compared with all patients hospitalized with acute COVID-19, and are much more likely to have a pre-COVID-19 comorbidity (diabetes, kidney disease, congestive heart failure, or pulmonary disease).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; Each model was run against this three-site population; the performance metrics including receiver operating characteristic (ROC) curves are shown in Figure 2. For the purpose of calculating these metrics, long-COVID clinic patients are considered as actual positives; patients from the three sites who have not visited the specialty clinic are counted as true negatives. Patients labeled by the model as potential long-COVID patients should therefore be interpreted as "patients likely to be referred to or seek care at a long-COVID specialty clinic;" a proxy for long-COVID diagnosis in the current absence of a definition.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.   Table  2a-c. Figure 4 shows the aggregate feature importance and univariate odds ratios for each model.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ;

Interpretation Important model features
To avoid influencing the model with prior assumptions, we took a light-touch approach to feature selection, performing as little manual curation of features as possible prior to training and testing our models. Because of this approach, the reasons that a given feature may be important to one or more of the models is not always obvious. However, review by clinical experts of the features shown in Figures 3 and 4 and Supplementary Table 2a-c revealed a number of possible themes, described here.
Post-COVID respiratory symptoms and associated treatments: These features are commonly reported for long-COVID patients. 7,9,22 A confounding factor that prioritizes these . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; features may be that the long-COVID clinics at two of the three sites contributing long-COVID clinic patients are based in the pulmonary department. However, given that SARS-CoV-2 is a primary respiratory virus, it is not surprising that long-term respiratory symptoms were observed. Similar long-term respiratory symptomatology is well-described with respiratory viral syndromes including those from severe acute respiratory syndrome, respiratory syncytial virus, influenza, and COVID-19. 23,24 The high proportion of albuterol use and use of inhaled steroids is consistent with the expected high prevalence of post-viral reactive airways disease. Example features include dyspnea/difficulty breathing, cough, albuterol, guaifenesin, and hypoxemia.
Non-respiratory symptoms widely reported as part of long-COVID and associated treatments: Sleep disorders, anxiety, malaise, chest pain, and constipation have all been reported as symptoms of long-COVID, and are included in WHO's recent case definition. 10 The example features in this group include both symptoms and mitigating treatments. Example features include dyssomnia, chest pain, malaise; and treatments with lorazepam, melatonin, and polyethylene glycol 3350.
Preexisting risk factors for greater acute COVID severity: Some known risk factors for acute COVID-19 and severity are associated with long-COVID. This includes chronic conditions like diabetes, chronic kidney disease, and chronic pulmonary disease, which predispose patients at increased risk for worsened symptoms. 25 Proxies for hospitalization: Features that are representative of standard hospital admission orders are likely contributing to the model as proxies for hospitalization in general rather than being individually meaningful. These features tend to feature more prominently in the non-long-COVID group, suggesting that the model is (correctly) differentiating between acute illness requiring hospitalization and long-COVID. Example features include the use of glucose, ketorolac, propofol, and naloxone. Table 2a-c, while there is considerable overlap between the most important features across the three models, there are also distinct differences. Notable differences include the high importance of dexamethasone in the hospitalized model, which decreased the likelihood of being labeled a potential long-COVID patient. Dexamethasone is not present in the top 50 features of the non-hospitalized model. Similarly, cough and dyssomnia, which increased the likelihood of being labeled a potential long-COVID patient, are important features in the non-hospitalized model, but do not appear in the hospitalized model. COVID-19 vaccination after acute disease, consistently an important feature in all three models, decreased the likelihood of patients being labeled potential long-COVID. This result is noteworthy, and indicates that not only does vaccination against COVID-19 protect against hospitalization and death, but that it may also protect against long-COVID.

As shown in Figures 3 and 4 and Supplementary
Rates of outpatient and inpatient utilization are clearly important features in all three models. This can be interpreted in a number of ways-patients who continue to feel unwell long after acute COVID-19 may be more likely to visit their providers repeatedly than those patients who . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; fully recover. Because diagnosing and treating the heterogenous symptoms of long-COVID is an ongoing challenge, these patients may be referred to one or more specialists, increasing their utilization even more.
ML models do not consider each feature individually; rather, complex relationships between features can greatly influence classification. Each patient has their own path through the model, based on the data available about that individual. Figure 5 illustrates the path taken by three hypothetical patients through each of our three models, respectively. Information of this type is useful to make the outcomes of the ML models interpretable.

Figure 5. Example paths taken by the ML models to classify potential long-COVID patients. These
Force plots illustrate the contribution of individual features to the final predicted probability of long-COVID generated by each of the three models for individual synthetic patients. Features colored red increase the probability of a long-COVID prediction, while features colored blue decrease that probability. The width of the bar for a given feature is proportional to the impact that feature has on the prediction for that patient. The final predicted probability is shown in bold text. 6A shows the "all patients" model, 6B the hospitalized model, and 6C the non-hospitalized model.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021 Use of EHR data for long-COVID phenotyping Though it contains rich clinical features, EHR data is also a proxy for healthcare utilization and can be interpreted through that lens. Diagnoses coded in the EHR are not representative of the whole patient, but rather are focused on the specific reasons the patient has come to the clinic or hospital on that day. Hospitalized and non-hospitalized patients are likely quite clinically different-but inpatient and outpatient coding practices are also different. Some of the differences between the hospitalized and non-hospitalized models may be an artifact of EHR data itself, and thus divergences between the two models are likely a result of both factors.
However, even as a proxy for utilization, EHR data is particularly well-suited to the task of cohort definition by way of computable phenotyping, especially when the end goal is study recruitment. Balancing inclusion and exclusion criteria is critically important to sufficiently define as homogeneous a population as possible, while still retaining key diversity and other recruitment goals (such as age distribution). While there are other methods of identifying potential study participants, a computable phenotype allows us to efficiently narrow the recruitment pool down from "everyone" to "patients likely to qualify"-easily eliminating large numbers of patients that do not qualify, and ascertaining patients that may elude human curation.
There are additional advantages to utilizing EHR data to identify long-COVID patients. With no single definition and no gold standard to compare with, the EHR allows us to define proxies for a condition and select on those-in this case, a patient's visit to a long-COVID specialty clinic. However, rather than settling for a very restrictive inclusion criterion of "must have at least one visit to a long-COVID speciality clinic," our ML models allow us to decouple patients' utilization patterns from the clinic visit, meaning that we can use the models to identify similar patients who may not have access to a long-COVID clinic. Moreover, institutions without a long-COVID specialty clinic will be able to make use of the model for cohort identification within their own data. While using a proxy all but ensures that not all of the patients identified by the model have long-COVID, the resulting utilization patterns allow us to make some educated assumptions. EHR data is skewed toward patients who make more use of health care systems, and is further skewed toward high utilizers, sicker patients, and inpatients. When we train models on N3C's EHR data, it is essential to acknowledge whose data is less likely to be represented--uninsured patients, patients with limited access to or ability to pay for care, or patients seeking care at small practices or community hospitals with limited data exchange capabilities. Thus, while we believe our models are highly useful for identification and recruitment of long-COVID study participants, they should be used as one of several complementary recruitment methods to enable the widest possible reach and the most representative participants in research.
We did not include race and ethnicity as model features, as we did not believe our three-site sample of long-COVID clinic patients to be appropriately representative. As more long-COVID patients and training data are available over time, we will have the ability to balance the cohort based on demographics and, critically, to carefully account for race and ethnicity in future iterations of the model. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 22, 2021. ; Beyond identifying cohorts for research studies, the models presented here can be applied in a variety of applications and could be enhanced in a number of ways. Specifically, it will undoubtedly be necessary to utilize a large sample size of long-COVID patients to validate hypotheses relating to social determinants of health and demographics, comorbidities, and treatment implications. It is also critical to understand the relationship between acute COVID-19 severity and specific long-COVID signs and symptoms and their longitudinal progression, as well as the influence of vaccination in such trajectories. The inclusion of additional data resources such as patient surveys, physiological data (e.g. sleep data), and wearables may further suggest EHR features that allow stratification to better inform mechanistic studies and therapy selection. Finally, it is plausible that long-COVID will not ultimately have a single definition, and may be better described as a set of related conditions with their own symptoms, trajectories, and treatments. Thus, as larger cohorts of long-COVID patients are established, future research should identify sub-phenotypes of long-COVID by clustering long-COVID patients with similar EHR data "fingerprints." Future iterations of our models could discern among these clusters given N3C's large sample size and recurring data feeds.
ncats.nih.gov/n3c/resources. Enclave data is protected, and can be accessed for COVID-related research with an approved (1) IRB protocol and (2) Data Use Request (DUR). A detailed accounting of data protections and access tiers is found in [1]. Enclave and data access instructions can be found at https://covid.cd2h.org/for-researchers; all code used to produce the analyses in this manuscript is available within the N3C Enclave to users with valid login credentials to support reproducibility.

Ethics and Regulatory
The N3C data transfer to NCATS is performed under a Johns Hopkins University Reliance Protocol # IRB00249128 or individual site agreements with NIH. Use of the N3C data for this study is authorized under the following IRB Protocols: . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 22, 2021. ; https://doi.org/10.1101/2021.10.18.21265168 doi: medRxiv preprint