Abstract
Since late 2019, the novel coronavirus SARS-CoV-2 has introduced a wide array of health challenges globally. In addition to a complex acute presentation that can affect multiple organ systems, increasing evidence points to long-term sequelae being common and impactful. The worldwide scientific community is forging ahead to characterize a wide range of outcomes associated with SARS-CoV-2 infection; however the underlying assumptions in these studies have varied so widely that the resulting data are difficult to compareFormal definitions are needed in order to design robust and consistent studies of Long COVID that consistently capture variation in long-term outcomes. Even the condition itself goes by three terms, most widely “Long COVID”, but also “COVID-19 syndrome (PACS)” or, “post-acute sequelae of SARS-CoV-2 infection (PASC)”. In the present study, we investigate the definitions used in the literature published to date and compare them against data available from electronic health records and patient-reported information collected via surveys. Long COVID holds the potential to produce a second public health crisis on the heels of the pandemic itself. Proactive efforts to identify the characteristics of this heterogeneous condition are imperative for a rigorous scientific effort to investigate and mitigate this threat.
Introduction
SARS-CoV-2 emerged in late 2019 as the third human coronavirus identified in the 21st century. As of early 2021, new impacts of the virus are still being identified. The virus initially targets epithelial cells, endothelial cells, alveolar macrophages (via ACE2 proteins and the TMPRSS2 protease) causing symptoms attributable to the lungs, digestive tract, kidneys, heart, brain, and other organs.1,2 Additional research has begun to explore viral presence in other tissues that exhibit ACE2 and TMPRSS2 expression; these include skeletal muscle, smooth muscle, bone, cartilage and synovia.3–6 Collectively, these symptoms constitute coronavirus disease 2019 (COVID-19). Individual symptoms and disease severity vary widely among patients, with some patients developing mild or even asymptomatic infections, while others experience acute respiratory distress syndrome (ARDS), sepsis, and other life-threatening conditions.7,8
As more information about patient recovery has been collected, and as pathophysiologic mechanisms are revealed, a wide range of outcomes following acute COVID-19 have emerged. Some patients experience residual symptoms and others develop new symptoms long after the initial infection. These symptoms can present across a wide range of organ systems and tissues. Given the timeline of SARS-CoV-2’s emergence, studies to date have tracked patients’ clinical course up to six months post-infection,9–14 but anecdotal reports are available describing patients with ongoing symptoms as long as a year post-infection.15,16 Symptoms experienced after the acute illness represent a significant challenge for patients, physicians, and society as a whole. The causes, patient profile, and even symptom patterns associated with Long COVID remain difficult to isolate, and the natural history of this condition remains uncharacterized.
Post-Acute Sequelae after Other Infections
The fact that some COVID-19 patients experience symptoms following recovery from acute infection is not unexpected. Other infectious diseases, including Epstein-Barr Virus,17 Giardia lamblia, Coxiella burnetii, Borrelia burgdorferi (Lyme disease) and Ross River virus,18 are also associated with an increased risk for post-infectious sequelae. These sequelae include symptoms such as disabling fatigue, musculoskeletal pain, neurocognitive difficulties, and mood disturbance.17–19 Chronic fatigue syndrome (CFS) is frequently preceded by a viral infection.20 However, although these sequelae are well documented, they are still not well understood, and the molecular mechanisms underlying these post-acute presentations have yet to be elucidated.
Post-infectious sequelae have also been documented following infection by other coronaviruses. A subset of patients with severe acute respiratory syndrome (SARS), caused by the coronavirus SARS-CoV, and Middle-Eastern Respiratory Syndrome (MERS), caused by the coronavirus MERS-CoV, were observed to experience persistent or new-onset symptoms, including fatigue,21 following recovery from the acute infection.21–23 For SARS, follow-ups have been conducted up to 15 years post-infection. In addition to fatigue, studies reported effects on lung health and capacity,24–27 psychological health,21 bone health,27 and lipid metabolism,28 with the latter two attributed to treatments involving large doses of steroids.27,28 Most of the improvements among SARS patients occurred within the first one to two years following infection,27,29,30 although some patients continued to experience decreased quality of life for more than a decade following the acute illness.28 Though follow-up studies in MERS patients are more sparse, effects on pulmonary function were observed at one year post-infection, with patients who experienced more severe cases at greater risk for long-term effects.31
Post-Acute Sequelae Following COVID-19
While post-acute sequelae are not an unexpected outcome of SARS-CoV-2 infection, the number of people affected and range of symptoms associated with Long COVID is unprecedented. The multisystem nature of Long COVID compared to previously studied post-acute sequelae of human coronaviruses has raised questions about how to most effectively identify indicators of Long COVID. An analysis of 32 symptoms in patients with and without SARS-CoV-2 infection identified several symptoms that were enriched in patients with COVID-19 in comparison to other illnesses of comparable severity.32 After 30 days, loss of smell, loss of taste, memory loss, chest pain, and muscle weakness were the symptoms enriched in patients who were positive for SARS-CoV-2 at the time of their acute illness. The association between these symptoms and COVID-19 diagnosis fluctuated slightly at 60 and 90 days, with muscle weakness no longer associated at 60 days, difficulty concentrating emerging at 60 days, and confusion and bone or joint pain emerging at 90 days. Many of the symptoms most strongly associated with Long COVID are therefore distinct from those observed in post-acute SARS or MERS, and therefore may be challenging to identify based on research on other post-infectious sequelae.
Furthermore, regardless of whether they are unique to Long COVID, symptoms frequently reported by Long COVID patients are not assessed consistently across studies. A systematic review available as a preprint33 evaluated all research on Long COVID released prior to January 1, 2021, that included at least 100 patients; based on the 15 studies that met the inclusion criteria, the authors identified 55 symptoms of Long COVID. None of the most common symptoms were assessed by all 15 studies. They reported that the five most common symptoms evaluated in the literature were fatigue, headache, attention disorder, hair loss, and dyspnea. They also reported the frequency at which clinical measurements such as chest X-ray and biomarkers such as C-reactive protein and D-dimer were evaluated. The authors concluded that the symptoms of Long COVID are extremely heterogeneous, and that the assessment of these symptoms varies widely among studies.
However, Long COVID’s emergence has followed a different trajectory than that of most medical syndromes. Rather than building from a clinically determined framework of the illness, to date, much of the growing awareness of Long COVID and its symptoms has been driven by patient-led efforts.34,35 Observing residual or new symptoms months after experiencing COVID-19, patients have established online communities to provide support and identify similarities in their experiences.36 Some Long COVID patients who are also researchers have led efforts to systematically categorize the range of experiences associated with Long COVID. An extensive patient-led survey (Patient-Led Research Collaborative) performed deep longitudinal characterization of the Long COVID symptoms and trajectories in suspected and confirmed COVID-19 patients who reported illness lasting more than 28 days.13 Evaluating data from 3,762 respondents to 257 survey questions, this analysis documented 205 phenotypic features associated with Long COVID. The symptoms most frequently reported after 6 months were fatigue, post-exertional malaise, and cognitive dysfunction. Patients who reported symptoms lasting for longer than six months following acute infection experienced an average of 14 symptoms in month 7, and 86% of patients experienced relapses during the period assessed,with exercise, physical or mental activity, and stress reported as common triggers. The diversity of the symptoms reported by Long COVID patients underscores the urgent need to understand the natural history of COVID-19 following the initial infection in order to manage medical care of affected individuals.
Given Long COVID’s very recent emergence, no standard framework has yet been established for identifying and assessing associated symptoms or other clinical indicators. Most of the studies analyzed in the systematic review33 utilized a survey-based approach, meaning that they were able to analyze only symptoms identified a priori as concerns. These studies also varied in whether they included formerly hospitalized patients exclusively or a mixture of patients with mild, moderate, and severe cases of acute COVID-19. Long COVID can occur following either severe or relatively mild acute illness,37 and it has been suggested that the severity of acute illness affects the clinical course of Long COVID,32 as it does in SARS and MERS.38 Additionally, patients who are treated in the intensive care unit (ICU) would be assumed to be particularly likely to experience ongoing health challenges due to the well-documented occurrence of post-intensive care syndrome (PICS).39 Several different frameworks have been proposed to describe Long COVID cases, without any clear criteria emerging about how to define the condition or how to stratify patients. This ambiguity presents a concern as more and more data is collected: as of the end of 2020, at least 239 papers and preprints about the post-acute effects of COVID-19 had been released, and approximately 20 additional papers become available each month.40 These papers do not conform to a single definition of Long COVID and do not evaluate consistent symptoms or markers of the disorder (or constituent disorders). In addition, the differences between the common symptoms as identified in a systematic review of the literature33 compared to the patient-led assessment13 indicates that current research on Long COVID may fail to address the full diversity of and even the most significant symptoms identified by patients with lived experience of Long COVID. Additionally, because proactive self-report is an important component of the patient-led research collaborative, symptoms and experiences of persons with low access to and uptake of technology may be under-represented in Long COVID studies thus far.
In order to develop clinical management strategies to prevent or mitigate Long COVID, it will be essential for studies to use a unified definition of Long COVID and its subforms so that data from different studies can be integrated to provide the foundation for robust statistical inferences about risk factors for the development of Long COVID, as well as the natural history and response to treatments. Additionally, it is essential that survey-based research efforts to investigate Long COVID operate from a framework that addresses the symptoms most common among and most debilitating to Long COVID patients. A rigorous framework for evaluating Long COVID will also help to elucidate the organ systems involved in the disease and its sub-forms; such a framework could help to distinguish for example pulmonary versus cardiovascular syndromes and whether these are interrelated.. In this analysis, we present methodologies, findings, and perspectives related to the extraction of data from the literature, from an extensive patient survey, and from the NCATS N3C Data Enclave (covid.cd2h.org/enclave) to provide guidance towards defining and identifying symptoms and patient variables that must be considered while designing and developing studies of Long COVID. Given that Long COVID is poised to produce an additional public health crisis on top of the COVID-19 pandemic,36 rapid harmonization of existing data and the integration of this information into new efforts to characterize Long COVID will be the critical next steps in responding to this looming threat.
Methods
Literature Review
In order to explore how Long COVID is currently being characterized and reported, we conducted an exploratory landscaping review of the literature. The results of this search will inform a future, more systematic review of this topic. In addition to searching PubMed (MEDLINE), we included searches of specialized databases (e.g.,CoronaCentral; WHO Global Literature on Coronavirus Disease) and relied on expert recommended key articles, with snowball techniques to find similar studies. Both published articles and preprints were included for abstraction. The questions we explored in the review were: for observational studies of Long COVID, how are studies characterizing Long COVID, and what outcomes are reported and/or associated with this syndrome? In addition we explored whether any COVID-specific measures or tests have been developed or validated, whether any patient subgroups or medical specialties report unique signs or symptoms, and what patient-reported and patient-centered outcomes were reported (Supplemental Table 1). While specific Inclusion and Exclusion criteria were not developed, we did exclude papers discussing only rehabilitation therapy, mortality or hospitalization, as these were not outcomes specific enough to Long COVID.
The newly emergent nature of Long COVID and lack of definition complicate traditional search methods. Each study abstracted was analyzed to identify the relationship of participant recruitment in the study to formal definitions of Long COVID that have been proposed. This analysis required evaluating how the duration of long-term symptoms was defined relative to the acute illness and whether patients were selected or stratified based on variables related to clinical course. Due to the proposed definitions at the time of analysis, the variables considered were: definition of onset of disease course (e.g., diagnosis, positive test, hospitalization), time elapsed since onset (as defined in each manuscript), patient-reported symptoms or clinical measures assessed, and tests or measurements reported or developed.
Formal Definitions Used for Comparison
Long COVID can be broadly defined as delayed recovery from an episode of COVID-19 and is characterized by lasting effects of the infection, e.g., persistence of symptoms or onset of new chronic diseases, for far longer than would be expected.41 Although no firm criteria have been established to define the post-acute period or sub-categories within Long COVID, several sets of guidelines have been proposed for the classification of COVID-19-related disease phenotypes, and these criteria were compared to the definitions used in the literature. For example, a recently proposed public health framework classifies SARS-CoV-2-related disease into three categories.42 The first is acute COVID-19, or the disease most commonly associated with acute SARS-CoV-2 infection. The second category includes Multisystem Inflammatory Syndrome in Children (MIS-C) and in adults (MIS-A), a less common presentation of SARS-CoV-2 infection characterized by hyperinflammation that can appear 4-6 weeks after viral infection.43 The third category describes late sequelae.42 In terms of defining study cohorts, adherence with this definition would therefore require a clinical diagnosis, rather than a SARS-CoV-2 test alone, in order to distinguish MIS-C/A and COVID-19.
Other frameworks break down the “late sequela” category into subtypes depending on either timing or disease natural history. For example, the United Kingdom’s National Institute for Health and Care Excellence’s guideline on long COVID provides two definitions of postacute COVID-19: (1) ongoing symptomatic COVID-19 for people who still have symptoms between 4 and 12 weeks after the start of acute symptoms; and (2) post-COVID-19 syndrome for people who still have symptoms for more than 12 weeks after the start of acute symptoms.44 Similarly, PACS has been defined operationally as extending beyond three weeks from the onset of first symptoms, and the term chronic COVID-19 has been proposed to refer to PACS cases where symptoms extend beyond 12 weeks;37 these PACS definitions are consistent with the virological data available thus far.45
However, other criteria recommend defining the post-acute period as starting once a patient is discharged from inpatient acute care for those hospitalized longer than three weeks.45 Some authors go further and subdivide Long COVID into three groups:
1.patients who have experienced severe COVID with ARDS and experience long-term respiratory symptoms dominated by breathlessness;
2.individuals with milder initial disease who were not necessarily hospitalized during the acute infection but but who present with a multisystem disease with cardiac, respiratory, or neurological manifestations of end-organ damage; and
3.people who have persistent fatigue and other symptoms but with no evidence of organ damage.46
Terminological Extraction from the Literature
We reviewed patient-reported symptoms reported in the literature (or caregiver-reported symptoms in the case of one pediatric study47) and created a table row for each symptom in each publication. We then used a Python script to extract symptoms;. symptoms remained exactly as described in the manuscripts except to adjust for capitalization, punctuation, plurals (e.g., headache versus headaches), spelling in British versus American English (e.g., dyspnoea versus dyspnea), and to standardize labels assigned to specific measures. The identifiers for specific assessments used were as follows: FLU-PRO for InFLUenza Patient-Reported Outcome,48 EQ-5D-5L for the 5-level EQ-5D,49 EQ VAS for the EQ visual analogue scale,49 and mMRC Dyspnea Scale Scores for Modified Medical Research Council Dyspnea Scale Scores.50 We tabulated the relationships between publications and the symptoms they reported; we then manually mapped symptoms to one or more body systems and visualized the result using a Sankey diagram (Figure 1).
Ontological Coding of Literature and a Patient Survey
The Human Phenotype Ontology (HPO) provides a standardized vocabulary of over 15,000 terms to describe phenotypic abnormalities observed in human disease.51 In our review of the literature, we identified studies that also contained a description of the counts of affected individuals who displayed specific phenotypic features. We manually curated the mappings between literature-reported signs and symptoms and HPO terms. Overall, 141 unique symptoms were identified of which 80 terms were curated from the originally extracted literature terms, and 112 terms were captured from the both of the patient-led survey questions/answers.13,52 These are available in Supplemental Table 2.
Cohort Selection
We performed analysis of electronic health record (EHR) data in the N3C Secure Data Enclave (covid.cd2h.org/enclave) with the intention of identifying unique healthcare utilization patterns among COVID-positive patients that may differentiate them as Long COVID patients. To achieve this, we looked for patterns found only in COVID-positive patients compared to COVID-negative controls. Some patients were expected to be COVID-positive but non-Long COVID, so this analysis was expected to distinguish at least three categories: COVID-positive and Long COVID, COVID-positive and non-Long COVID, and COVID-negative.
We define COVID-positive as any non-deceased patient in the N3C enclave with an ICD-10-CM diagnosis code for COVID (U07.1) or a positive PCR, antibody, or antigen test for COVID (n = 905,592). We define COVID-negative as any non-deceased patient in the N3C enclave with at least one negative PCR, antibody, or antigen test for COVID who is not also in the positive group (n = 2,473,206). We then further narrow the set of patients whose data is used for analysis in the following ways:
We require all patients to have at least one year of history with their contributing health care system.
For COVID-positive patients, we require that at least 90 days have passed since their COVID index date (minimum date of diagnosis or positive test).
Applying these restrictions resulted in a case (positive) cohort of 314,237 patients, and a control (negative) cohort of 1,917,935 patients.
We then employed the R package MatchIt53 to perform nearest-neighbor propensity matching on the positive and negative patients, at a ratio of 2:1 (control:case). The following factors were used in matching: age, sex, race, site (exact match required), and comorbid conditions (diabetes, chronic kidney disease, congestive heart failure, peripheral vascular disease, chronic pulmonary conditions). A patient was defined as having a comorbid condition if they had two or more ICD-10-CM codes equating to that condition in their EHR data. Two sites (representing 14,222 cases) were removed from the matching process due to a significant amount of missing data required for matching. Additionally, 2,311 cases were dropped because they were not able to be matched with two controls at their same site. This resulted in a final case set of 297,704 patients, and a final control set of 595,408 patients. The case set was further split into two groups: cases who were hospitalized for COVID (n = 51,903) and patients not hospitalized for COVID (n = 245,801).
We opted to model COVID-related healthcare utilization patterns among the cases and controls by counting occurrences of COVID-and Long COVID-related diagnoses (See “Long COVID Concept Sets,” below) for each patient before and after their COVID index date. (Controls were assigned their matched case’s index date.) Diagnosis occurrences were counted across an equal time period before and after the patient’s COVID index, based on how many days have passed since the COVID index. We ignored diagnoses occurring in a “buffer” period of 60 days before and after the COVID diagnosis, to attempt to differentiate “post-COVID” from active COVID.
After representing the data as a matrix of pre-and post-diagnosis conditions, we applied nonnegative matrix factorization in order to extract conserved co-occurring sets of diagnoses that best represent the cohort. The result of this step is a data-driven representation of which sets of diagnoses occur together. We then compared the change in frequency of these diagnoses before and after COVID to identify potential signatures of Long COVID.
Long COVID Concept Sets
Concept sets were obtained by mapping a subset of the manually curated HPO concepts to OMOP concept identifiers within the Conditions domain (Supplemental Table 3). These mappings were obtained using OMOP2OBO.54 OMOP2OBO is an algorithmic framework designed to generate clinically meaningful mappings between Open Biomedical Ontologies (OBO) and standard clinical terminologies in the OMOP common data model. Using version 1.0.0 of the mappings, each of the HPO concepts were processed and all reasonable matches returned. HPO concepts unable to be mapped using OMOP2OBO were manually mapped using Version 1.12.0.6.210309.1608 of the Athena -OHDSI Vocabulary Repository,55 which at the time of mapping was populated with OMOP Vocabulary version: v5.0 26-FEB-21. All manual mappings were discussed with one or more professional ontologist and/or clinical phenotyping experts. Upon ingestion into the N3C Enclave, HPO concept sets were extended to include all descendant concepts for each included OMOP concept identifier. Each of the completed concept sets received an additional round of review by clinical domain experts within the Enclave prior to use in classifying the cohorts as described above.
Results
Literature Review
The analysis of 39 studies revealed a variety of criteria were used to identify and evaluate patients with post-acute COVID-19 sequelae. With nearly as many definitions as studies, it is clear that there is no agreement on the definition of Long COVID (Figure 1). Studies differed in how they referred to the phenomenon studied. Some referred to it as Long COVID or using a similar term such as post-acute COVID-19 syndrome, whereas others discussed the clinical course or patient recovery without mentioning Long COVID specifically. These definitions fell roughly into four categories (Table 1). Most studies refer to their patient recruitment in terms of recovery (e.g., “COVID-19 survivors”56 or “discharged COVID-19 patients”57) or clinical course (e.g., “medium-and long-term consequences”58 or “delayed return to usual health”59). A number of studies did refer to their participant groups using terms like “Long COVID”,13,60–68 “post-acute COVID-19”,69,70 “post-COVID syndrome”,71 or “post-acute COVID-19 syndrome”,72 but these terms were not standardized among studies. A few studies46,60 acknowledged the proposed distinction at 12 weeks post-infection between post-acute COVID-19 and chronic COVID-19,37 but otherwise the definitions used typically did not refer to any proposed operationalizations of Long COVID. Therefore, while operational definitions of the constituent components of Long COVID have been proposed,37,42,44 reviewing the Long COVID literature revealed that they are rarely used when describing cases or identifying study cohorts.
Moreover, the existing operational definitions of Long COVID differ in important ways, many of which are not differentiated by existing studies. For example, one framework46 subdivides Long COVID patients into three groups based on whether their long-term symptoms are primarily respiratory in nature following severe COVID-19 with ARDS, whether they present with a multisystem disease with cardiac, respiratory, and/or neurological manifestations of end-organ damage, or whether their primary symptoms are persistent fatigue and other symptoms that do not necessarily indicate organ damage.46,89 In the literature analyzed, this definition was never used to define cohorts. Many studies included patients with acute infections that varied in their severity, including both inpatient and outpatient convalescents (e.g., 82,85). Additionally, of the studies available thus far, data directly assessing organ damage is rarely collected, and the concept of organ damage itself has not been operationalized in this context.
Other efforts to define Long COVID identify the severity of the acute phase as an important consideration in determining the onset of the post-acute phase. Specifically, for individuals hospitalized for more than three weeks following symptom onset, some definitions identify the post-acute period as starting once the patient is discharged from inpatient acute care.45 In the literature surveilled, most studies recruited and assayed patients based on time elapsed from a COVID-19-related milestone, but what the milestone was varied widely. Some studies use the date of diagnosis or positive test, others the onset of symptoms, others hospital discharge, and others by even broader criteria (e.g., patients with suspected or confirmed COVID-19 in the past). Many studies used a relatively precise window for patient assessment (e.g., 30 to 45 days after diagnosis65 or 14 to 21 days after symptom onset59), while others included participants at various distances from acute SARS-CoV-2 infection under the umbrella term of Long COVID.46 In the latter case, these patients could fall under either the Long COVID (PACS) or chronic COVID-19 definitions if using a 12-week cutoff.37,44 Because the relationship between infection, symptoms, and viral clearance occupies a wide distribution,62,90,91 this heterogeneity among and sometimes within studies could introduce significant variability in disease course within and among patient cohorts.
Finally, studies varied wide in the terminology used to describe patient-reported symptoms. Comparing symptoms described across the literature reviewed revealed 142 unique terms related to symptoms, including scales used to assess symptom profiles (e.g., the University Of California San Diego Shortness Of Breath Questionnaire) or other dimensions of recovery (e.g., 5-level EuroQoL 5-Dimensions for quality of life) (Figure 1). The most commonly evaluated symptoms were fatigue (15 studies), dyspnea (11 studies), chest pain (11 studies), and headache (8 studies). In many cases, studies assessed similar symptoms but differed in the nomenclature used. For examples, the studies analyzed included a mixture of reports of ageusia,32,52,79 anosmia,32,52,79,82 anosmia/ageusia,76 loss of smell,59,68 loss of taste,59 loss of smell and taste,66 loss of smell or taste,65 and loss of smell and/or taste.77 While in many cases there are parallels among studies (e.g., studies reporting anosmia and loss of smell are likely to be asking the same or similar questions of patients), the lack of a strict definition prevents straight-forward symptom matching across analyses. Further, there seemed to be limited surveying of neurological and systemic symptoms in some cases, hence the absence of common symptoms like cognitive dysfunction or “brain fog", sensorimotor symptoms, and post-exertional malaise. This is where standard use of a full terminology such as HPO would be useful to create expressive and consistent meaning across studies.
Therefore, the literature indicates that at present there is little consistency among studies in definitions of Long COVID, including the symptoms analyzed. Few studies use terminology with a proposed, narrow-scope definition such as Greenhalgh and colleagues37 definitions of PACS and chronic COVID. Instead, studies typically define a period of time to investigate symptoms agnostic of how this factors into the broader conversation on the disease. The exception is studies that state they are investigating Long COVID, which use a wide variety of definitions. The same is true for post-acute COVID-19 or post-COVID syndrome, which are typically not explicitly tied to working definitions or explicit disease phenotypes. Among studies, patient inclusion criteria can be based on any number of relevant milestones from the acute phase, and only a subset of studies separate patients based on the severity of disease they experienced in the acute phase. Finally, no standardized terminology is used for patient-reported symptoms, and studies often report symptoms using similar but non-identical terminology. Thus, the literature analysis suggests significant heterogeneity among studies with respect to how they define cohorts of interest and analyze the experiences of patients experiencing Long COVID.
Ontological Analysis of Literature
From the results of the above exploratory review, candidates were selected to comprise a cohort of studies for further abstraction and analysis. Because of the poor reporting by and heterogeneity within our initial set of literature, the numbers of studies in this cohort is much smaller than the set of studies summarized above. This highlights the need for improved quality and reporting of even small cohorts. Details such as the specific definition used to identify patients with COVID-19 (e.g., a PCR test versus a clinical diagnosis), hospitalization status (outpatient versus inpatient versus ICU), the severity of illness represented among patients in the cohort, and the number of patients presenting with each symptom or other clinical measure are important to efforts to compare results across studies.
Here, 21 studies, including 20 published studies14,47,52,56,58,59,63,66,72,73,75–77,79,92–97 and one preprint,13 were chosen for the in-depth phenotypic analysis using HPO. A total of 154 different phenotypic abnormalities could be encoded using HPO terms. Table 3 provides an overview of the most commonly observed abnormalities in four major categories, and Supplemental Table 2 contains information about all 154 terms. The studies investigated and reported the phenotypic features in a very heterogeneous fashion. Only one abnormality, dyspnea (shortness of breath), was reported in every study. 95 terms were reported on only a single study.
EHR Analysis
Transforming the HPO codesets
For the EHR analysis, we focused on 77 HPO annotations commonly used in the literature. Of these, 76 were successfully mapped to at least 1 OMOP concept identifier within the Condition domain (min=1, max=84, median=3). The unmapped HPO concept, increased circulating brain natriuretic peptide concentration (HP:0033534), could not be reasonably aligned to an OMOP concept identifier within the Condition domain. When expanding each OMOP concept identifier to include its descendant concepts, the total number of OMOP concepts used was 7,542 (4,694 unique) and the median number of OMOP concept identifiers mapped to each HPO codeset was 16. The largest HPO codeset sets were paresthesia (HP:0003401; n=1,606 concepts), pain (HP:0012531; n=1,399 concepts), skin rash (HP:0000988; n=505 concepts), and anxiety (HP:0000739; n=355), which was not unexpected given the variability in the clinical presentation (e.g., severity, duration, and location) of the conditions associated with these concepts.
Defining EHR phenotypes, including/excluding HPO codesets
The 297,404 patients in the final case group represent the pool of patients from which we have the potential to detect Long COVID (Figure 3). 85,912 of these patients had at least one instance of the identified HPO codes in their post-COVID period (and thus may be more likely to have Long COVID). Slightly more than half of these patients showed an increase in HPO codes after their diagnosis, with the largest shifts observed in hospitalized patients. Reduced dimension representation of the data suggested that HPO groups related to pain, anxiety/depression, and respiratory ailment. Further analysis will be required to determine which clusters of HPO codes are potentially indicative of Long COVID, allowing us to further stratify patients.
Discussion
The analyses described above demonstrate the heterogeneity both in symptoms associated with Long COVID and in assessments and definitions used to study Long COVID present in the literature as well as an EHR-based approach for identifying natural language data associated with potential Long COVID patients available in N3C.
Sources of Variance in Defining Long COVID
The literature review revealed a wide variety of terms used in describing patient cohorts used for studies of symptoms occurring after the acute phase of COVID-19. Most studies do not seek to assign their patients to a particular diagnosis or operational definition, although several referred to the definition from Greenhalgh et al. (2020)37, which is consistent with the virological data available thus far.45 There are a number of dimensions in which the existing literature varies in efforts to operationalize definitions of Long COVID. These differences are expected to vary in their effects. An important goal in the next phase of Long COVID research needs to be identifying the most critical considerations in defining patient cohorts.
Ambiguity in Defining the Acute Infectious Period
Long COVID is typically defined based on an elapsed acute infectious period, but at present, the relationship between the timing of COVID-19 symptoms relative to SARS-CoV-2 infection is not well understood.98 One early study examining viral load in hospitalized patients reported that viral shedding continued for at least 28 days following symptom onset in some patients.99 Another study reported that the median period between a patient’s first positive PCR test and cessation of viral shedding was 17 days and that up to 70% of patients were still symptomatic when their viral shedding ceased.90 However, viral shedding (e.g., the presence of detectable SARS-CoV-2 virus in samples such as nasopharyngeal swabs) does not necessarily indicate the presence of replication-competent viral particles. Viable viral particles have been detected from 6 days prior to up to 9 days after symptom onset.100–102 Patients have also been observed to test positive by PCR following a negative test,103–105 but the virus could not be cultured. Both asymptomatic and symptomatic patients with retest-positive COVID-19 have been identified.103 Even in individuals whose nasopharyngeal swabs produce negative PCR results, some test positive for SARS-CoV-2 in the intestine.106 These results therefore suggest that after the initial infection, patients shed non-infectious, degraded viral particles.104
In Long COVID, this relationship is further complicated by the fact that many patients who report symptoms of Long COVID lack a formal diagnosis. Due to the scarcity of tests in many places at the beginning of the COVID-19 pandemic, many patients who had suspected COVID-19 were never tested for the presence of the SARS-CoV-2 virus.107 In current studies, there is significant variability in the inclusion/exclusion criteria used for patient recruitment in terms of COVID-19 test status. While some studies require a positive test, others recruit patients with either a confirmed or suspected diagnosis (Table 1). Furthermore, some studies fail to specify whether the tests used for selecting patients are PCR-based, serum antibody based, or a mixture of the two. This distinction is important because the rate of false positives and false negatives is much higher in the antigen/antibody tests,108–111 meaning error rates may vary among studies. This limitation presents challenges for clinicians in determining the likelihood that patients with non-specific symptoms have Long COVID, and also presents difficulties for large-scale efforts to characterize symptoms associated with COVID-19 and Long COVID.32,112
Initial EHR characterization of a potential Long COVID Patient Cohort
There is not currently an ICD-10-CM diagnosis code for long-COVID; thus, our ability to find patients with long-COVID using structured EHR data is limited. Lacking an ICD-10-CM code, we utilized the HPO terms curated from the literature and patient surveys to refine the potential cohort based by looking for patients with at least one of these specific HPO terms. The patients characterized in Table 4 represent a base population from which EHR analysis may be able to identify Long-COVID. These are patients who had COVID, have enough pre-COVID longitudinal data to enable us to compare their healthcare utilization pre-and post-COVID, and have had enough time pass since their COVID diagnosis to be out of the acute phase. While we cannot say with certainty that the patients who reported one or more long-COVID symptoms have long-COVID, as shown in Table 4, the characteristics of this group are significantly different from those cases lacking a reported symptom. This cohort would be an ideal group for deeper phenotyping, leveraging additional data sources such as features derived from free-text notes in the EHR, imaging, or claims data.
Related and Concurrent Disorders
One major issue arising from the challenges to determining whether a patient has recovered from COVID-19 is that post-acute symptoms can also arise from different etiologies. One potential source of ambiguity come from PICS, which describes new or worsening cognitive, psychological, and physical limitations experienced by patients following discharge from an intensive care setting.39 Some impairments have been observed to persist for years after discharge, including pulmonary effects that are exacerbated by intubation and can persist for five years or longer and decreased ability to conduct activities of daily living that can last for 1-2 years.113 Therefore, symptoms of PICS could potentially be conflated with symptoms of Long COVID in patients who were ventilated and/or treated for COVID-19 in the ICU. Another possible source of long-term symptoms is the treatments used during the acute illness. In SARS, some of the most common post-acute sequelae are thought to be caused by treatment with corticosteroids.27,28 Therefore, the care received during the acute phase of the illness holds the potential to influence the clinical course of recovery, and therefore should be considered in efforts to identify signifiers of Long COVID.
While COVID-19 is a complex and heterogeneous multisystem illness, patients infected with SARS-CoV-2 can also develop distinct illnesses. A multisystem inflammatory illness has been observed in children and in some adults following acute infection with SARS-CoV-2. This syndrome, called multisystem inflammatory syndrome in children (MIS-C) and in adults (MIS-A),43 is characterized by hyperinflammation and can begin subsequent to host clearance of active SARS-CoV-2 infection.42 This condition is rare, with estimates of two in every 100,000 children in a descriptive analysis of MIS-C cases in New York State.114 This report also identified a median of 21 days from when children experienced COVID-19 (or an illness likely to be COVID-19) and when they were admitted to the hospital for MIS-C and that they were hospitalized for a median of 6 days.114 MIS-A has been reported only very rarely, with only 30 known cases as of October 2020.115 The importance of distinguishing the natural history of MIS-C/A from that of COVID-19 has been highlighted in some efforts to operationalize definitions of Long COVID,42 but at present, MIS-C/A is not widely discussed in the Long COVID literature, even though it too manifests in the post-acute phase of infection.
Similarly, preliminary findings suggest that patients with SARS-CoV-2 infection are at risk for chronic illnesses associated with post-viral sequelae. One example is that some presentations of Long COVID bear a resemblance to CFS, another chronic condition that is often triggered by a viral infection.20 The broad relationship between these known sequelae of viral infections and the specific pathogenesis of SARS-CoV-2 remains to be identified, although some mechanisms have been proposed116. In terms of characterizing the long-term sequelae of SARS-CoV-2 infection, they may introduce additional ambiguity regarding the specific outcomes associated with this particular virus compared to viral infections more broadly.
Organ Damage
One definition of Long COVID46 specifically highlights the potential importance of distinguishing long-term symptoms arising from organ damage from those arising from other etiologies. Given that a large number of Long COVID patients suffer from fatigue, which is associated with other post-viral syndromes but for which there are limited treatment options,20 identifying whether and when Long COVID patients have sustained long-term organ damage may provide additional options for treatment and understanding of the disease. However, few studies of Long COVID to date have conducted analyses elucidating the presence or extent of organ damage. Many assessments to collect evidence of long-term organ damage are intensive, meaning that their feasibility may vary with the strain on hospitals during the course of the COVID-19 pandemic. However, preliminary investigations of a number of organ systems have identified organ damage in Long COVID patients. These findings are also important because they highlight the possibility of asymptomatic Long COVID patients, who could sustain organ damage due to the SARS-CoV-2 virus that does not immediately present with symptoms. Therefore, an improved understanding of organ damage as an outcome of acute COVID-19 or as a long-term sequelae of the SARS-CoV-2 virus may present new options for patients experiencing persistent symptoms or elucidate new information about how the SARS-CoV-2 virus interacts with a range of organ systems.
Post-AKI CKD, Diabetes, and Long COVID Syndrome
During acute SARS-CoV-2 infection, diffuse endothelial injury, leads to end organ perfusion abnormalities and microthrombi. This reduced perfusion contributes to acute kidney injury (AKI), and possibly to new-onset diabetes.117–119 AKI, especially moderate/severe AKI, is a risk factor for the development of chronic kidney disease (CKD).120 Apoptosis, maladaptive repair, and fibrosis have been postulated as mechanisms involved in the transition from AKI to CKD.121 The kidney is an organ of interest in Long COVID because acute SARS-CoV-2 infection is associated with kidney injury.122,123 SARS-CoV-2-associated microvascular injury may cause perfusion abnormalities within the pancreatic islets, skeletal muscle, heart and or brain. In the islet, for example, microcirculation is essential for both glucose sensing and insulin secretion; abnormal islet capillary architecture and fragmentation contributes to beta cell dysfunction in type 1 and type 2 diabetes.124 Diabetes is a known contributor to CKD. Both CKD and diabetes are major risk factors for cardiovascular disease (CVD)125 and long term disability, which may overlap with the complicated picture of PASC.126
An unpublished investigation and a complementary published analysis provide evidence highlighting the relevance of kidney damage to medium-to-long COVID-19 outcomes. A pilot investigation (unpublished) was conducted on a subgroup of 35 COVID-19 AKI survivors who were admitted at Stony Brook University Hospital, NY between March and June 2020 and subsequently followed in a “Post-AKI COVID clinic.” Patients were observed at a 6-month follow-up to have a high incidence of persistently reduced renal function after moderate/severe AKI in the setting of hospitalization with COVID-19. De novo or progressive CKD was noted in 25.7% & 74.3% of cases based on estimated glomerular filtration rate (eGFR) + serum creatinine (SCr) and only SCr measures, respectively. A second study in a Swedish cohort127 similarly investigated kidney dysfunction following acute illness. In a group of 60 ICU patients admitted for COVID-19 infection, they found that inpatient AKI severity was associated with higher CKD stages at 3-to 6-month follow up.127 They found no differences between patients with CKD progression compared to those without progression in terms of demographics, comorbid conditions, or ICU admission characteristics.127 Similarly, in the unpublished study, neither inpatient AKI recovery nor a history of CKD prior to admission were associated with worsening renal function at follow up. Both of these analyses are limited due to a small sample size. Ongoing study at Stony Brook’s Post-AKI COVID clinic will include additional patients and longer follow-up and therefore should provide a more accurate estimate of CKD risk. It is not yet known whether inadequate renal repair after severe injury or persistence of SARS-Cov-2 in the kidney drives post-AKI CKD in COVID-19.
While AKI is an established independent risk factor for CKD,128 this association has not yet been extensively explored in the setting of COVID-19, given that the virus has been circulating for just over a year at this time and studies so far have mostly reported the persistence of renal dysfunction (AKD) at time of hospital discharge.129,130 Persistent organ damage is now considered part of the Long COVID syndrome,45,131 and kidney disease should be considered part of this syndrome. While these two studies are among the first reporting this association, further multi-center studies with larger sample sizes and with pathology data are needed to further analyze the relationship between AKI and development/progression of CKD in COVID-19.
Neuroimaging in Analyses of Long COVID
A variety of neuroimaging findings have been reported in COVID-19 patient populations, and efforts to better understand pathophysiologic origins and neuroanatomical correlates are ongoing. A number of studies have been carried out to characterize COVID-19 neuroimaging findings and associated neuropsychiatric symptoms, e.g. 132–135. There have been a few focused imaging studies that attempt to dissect neuroimaging correlates associated with specific symptoms; for instance, olfactory bulb abnormalities were characterized in an MR imaging study of COVID-19 anosmic patients.136 One study comparing 35 Long COVID patients to 44 controls found significant hypometabolism in the brain, including the olfactory gyrus, right temporal lobe (including the hippocampus and amygdala), the bilateral pons/medulla brainstem, and the bilateral cerebellum; notably, the clusters of hypometabolism were correlated with patient symptoms, including hyposmia and anosmia, memory and cognitive impairment, pain, and insomnia.137 There have also been additional suggestions that brainstem dysfunction might be involved in a variety of COVID-19 clinical manifestations. For instance, Yong138 cites a number of autopsy studies to support this hypothesis.139,140
Autopsy as a Means to Diagnose Long COVID
Autopsy analysis is an important method to obtain insights into the pathology associated with COVID-19 and the presence of the SARS-CoV-2 virus in diseased tissues. Kidney tissue provides a good example of the importance of autopsy analysis. AKI is very common in patients hospitalized with COVID-19 and is a major risk factor for mortality.141–143 Kidney autopsies or biopsies in patients with Covid-19 related AKI do not generally reveal suggestions of direct viral cytotoxic effects such as nuclear, cytoplasmic inclusions or with extensive tissue necrosis and inflammation.144 Autopsy and kidney biopsy tissue studies have indicated that acute tubular injury is the most common pathologic finding Covid patients with AKI or proteinuria.145 Collapsing glomerulopathy and thrombotic microangiopathy are also been associated with Covid 19 AKI in autopsy and biopsy studies.145
Multi-organ and especially renal tropism of SARS-CoV-2 has been observed in autopsy studies on COVID-19 patients. Puelles et al1 reported the presence of viral load and also viral RNA and proteins in the kidney using in situ hybridization and indirect immunofluorescence with confocal microscopy. In an autopsy study of 26 patients with COVID-19, Su et al.146 found clusters of coronavirus-like particles in the kidney tissue on electron microscopy and also detected positive immunostaining with SARS-CoV nucleoprotein antibody associated with injury patterns on light microscopy.146 Autopsy studies have also shown that SARS-CoV-2 infects and replicates inside pancreatic beta cells, reducing insulin-sensing functions of those cells.147 This direct infection of the pancreatic beta cells is likely a lead cause of metabolic dysfunction and glycemia after SARS-CoV-2 infection. The N3C database offers a unique opportunity to study glycemia before and after SARS-CoV-2 infection, as well as how new-onset diabetes may contribute to PASC effects on quality of life for both adults and children.
In a review of brain autopsy studies, summarizing 24 studies with results from 149 individuals chronic inflammation or neural changes typically associated with viral infections were found to be largely absent.148 Interestingly, in one recent study, megakaryocytes were found in cortical capillaries in 33% of brain autopsy cases examined.149 The authors thereof a) noted that this observation was consistent with other observers who have noted megakaryocytes150, 151 and b) suggest that these large cells could cause ischemic alternation in a distinct pattern and might be associated with COVID-19 neurological impairment.
The timing of autopsy is likely to be important in efforts to detect whether SARS-CoV-2 remains in tissue. In another study examining 42 postmortem samples of patients who died with COVID-19, no presence of SARS-CoV-2 was noted in analysis with immunofluorescence, electron microscopy or in situ hybridization of the kidney tissue.152 This study raised concerns about the method and timing of post-mortem tissue collection and processing, since a significant degree of autolysis was noted in the kidney tissue in this study.153 In a recent study of immediate (≤3 hours) post-mortem renal biopsies of 16 patients with COVID-19 and 5 control patients with sepsis, investigators reported that the presence nCoV2019 N-Protein was detected in proximal and distal renal tubules in 9 of 16 cases, out of which 6 of the 9 were confirmed by in situ hybridization. This finding supported the presence of SARS-CoV-2 in the kidney.154 However, SARS-CoV-2 E and N1/N2 genes were detected by RT-PCR of the kidney total RNA in only one case, and classical viral inclusions were not detected via electron microscopy. Therefore, while autopsy can serve as an important tool in looking for the presence of SARS-CoV-2 in tissues and the associated pathology, the methodology used for autopsy is critical to providing accurate insights into disease patterns.
COVID-19 and Quality of Life
The circumstances surrounding infection with SARS-CoV-2 and the pandemic itself are likely to have a significant impact on patients’ health. In particular, psychosocial health, nutrition, and physical fitness may all be impacted by the broader societal response to SARS-CoV-2. Early in the pandemic, the WHO released recommendations to support psychosocial health in light of the pandemic (Mental health and psychosocial considerations during the COVID-19 outbreak). Since then, many reports have indicated concern about a rise of psychosocial distress internationally.155–158 While unique psychological stressors are likely to affect patients who experience COVID-19 and especially those with more severe cases, the impact of a broader societal decline in psychological health (including addiction/substance abuse disorder) may be difficult to identify in studies that evaluate only COVID-19 patients with and without Long COVID. Similarly, viral infections can exacerbate pain and other chronic conditions,159 but these effects are not specific to SARS-CoV-2 even though they may appear that way depending on study design. Similarly, the conditions of the pandemic have reduced access to healthy food choices in some places160,161 and reduced opportunities for exercise.162 Social distancing also presents unique challenges to patients with substance use disorders; as loneliness and stress can make people more inclined to substance use.163,164 As governmental and societal responses to SARS-CoV-2 evolve, it is possible that quality of life and psychosocial reports from Long COVID patients may shift along with those of the population more broadly.165
Research Response and Measurement Problems
Because of the pandemic, there has been an incredible surge of research and a call for the surveillance of COVID-19 patients.166,167 Thousands of clinical trials are being registered, initiated and, in many cases, completed on COVID-19 treatment and prevention in the USA and across the planet.168 While this response is impressive, there are risks to rapidly planning and performing expedited clinical trials.169 For example, recent reviews of registered protocols have revealed methodologic flaws and a wide array of outcomes measures, particularly patient-reported outcome measures (PROMs), being used,169–171 most of which have not been vetted for relevance to COVID-19 patients.172 Moreover, the lack of available terminological standards greatly impede the ability to compare studies.
There are two obstacles to the design of clinical research in this area. First, there has not yet been any rigorous large-scale effort to characterize the constellation (incidence and breadth) of outcomes most important to Long COVID patients. Without this characterization it is not possible to design inclusion criteria for responsible clinical studies. A second and related obstacle is there have not yet been efforts to define the sets of core domains and outcomes for patients in future clinical studies. Heretofore the lack of uniformity in outcome measurement across clinical research creates multiple problems: it undermines the validity of this research, shows a lack of relevance to the patient perspective, and limits our ability to compare findings between studies or to pool data for meta-analyses.139,173
In an effort to reduce heterogeneity in outcomes measured across clinical trials, and to improve the clinical monitoring of patients, the development of core domain sets (CDS) and core outcome sets (COS) in specific health conditions has been routinely recommended.174–176 Core outcomes are instruments (e.g., EURoQOL scale, PROMIS Emotional Distress -Depression scale) that measure particular core domains (e.g., quality of life, depression, pain), the latter of which are specific symptoms or broader symptom categories. A CDS is an agreed upon selection of symptoms or symptom domains (categories) that should be measured and reported in all clinical trials for a particular health condition. A COS is defined as an agreed minimum selection of outcomes that should be measured and reported in all clinical trials for a particular health condition.176 CDSs must be developed prior to the development of COSs of measurement instruments. Given that the scientific community has only recently started to examine Long COVID, a CDS is the first necessary step. A CDS would increase the reporting of patient important outcomes in Long COVID, reduce the risk of selective outcome reporting,177 and increase the feasibility of conducting meta-analyses on such topics in the future.174,177 In relation to value-based health care, core domain and outcome sets are key to performing research that inform quality indicators related directly to patient outcomes and are routinely being used by national health-care organizations in the USA and abroad178–184 and in particular can be used in the measurement of quality of care in the COVID era.185
Some work has been done to create various types of CDSs and COSs for clinical trials of acute COVID-19.170,186–188 While this work is important for the acute period of COVID-19, these efforts do not focus on the long-term outcomes associated with Long COVID. To date no work has been done to explore what is important to patients with Long COVID. Without a CDS informed by a large sample of patients that had COVID-19, clinicians and clinical trialists will lack an essential assessment tool to adequately measure patient specific and patient important outcomes and changes across time. A CDS would provide a critical means of comparing results across trials, which is extremely difficult in the current conditions where many different PROMs are being used in many different samples of patients.
These problems undermine the relevance and usefulness of this evidence for decision-making, and the research does not focus on what is most important to patients. Because evidence suggests long-term effects of COVID-19 on health-related quality of life, working to identify the domains and corresponding measures (e.g., Patient Reported Outcomes Measurement Information System [PROMIS] item banks) that are most relevant to COVID-19 patients following the acute infection is urgently needed given the rapid expansion of clinical research in this group. The incidence of those with Long COVID will climb, and soon much clinical care and research will be directed at this group, as evidenced by the increase in research in the area.
Importance of Defining Long COVID
Available evidence suggests that Long COVID is a substantial public health problem with severe consequences for affected individuals and society at large. Patients commonly report being emotionally affected by health problems related to Long COVID. In the United States, patients have reported mild to severe financial impacts related to acute or chronic COVID-19,77,189 This concern is underscored by reports that Long COVID patients experience increased disability related to breathlessness and decreased quality of life.190 Understanding the needs of these patients will allow for the development of healthcare, rehabilitation, and other resources needed to support their recovery.191,192 However, identifying patient needs is contingent on developing a research infrastructure that accurately assesses the natural history of this illness.
Given the heterogeneity of clinical presentations of individuals with prolonged clinical manifestations following acute COVID-19, it is likely that clinical management should be tailored to individuals. However, the clinical management of Long COVID remains challenging because there are no evidence-based guidelines. Existing studies do not always provide comprehensive information about the clinical course, and often present aggregated results for individuals with differing clinical courses, such as for instance severe COVID requiring admission to an ICU and moderate COVID requiring hospitalization but not care in the ICU. Existing literature is contradictory with respect to the natural history of Long COVID. For instance, one study found that persistent fatigue is independent of severity of initial infection, but another found that 10 of 16 individuals (63%) with severe acute COVID-19 but only 26/65 (40%) individuals with moderate COVID had persistent fatigue.73 It should be noted that available studies have investigated COVID-19 patients who have come to medical attention, and much less data are available at the population level about the extent of late sequelae.42
Most studies to date use survey-based methods to ascertain patient-reported symptoms of Long COVID, although some studies are beginning to use imaging and other technologies to identify the physical signs of organ damage. Vital signs are a third category of indicators that are likely to prove valuable in efforts to investigate Long COVID. Vital signs have several attractive properties for the study of COVID-19. Data are often available from prior to the illness allowing for pre-post comparisons and are routinely collected in affected and unaffected individuals allowing for case-control comparisons. Moreover, analyses of discontinuities in a vital sign’s trajectory of time are possible. An ecosystem where associations between patient-reported symptoms, data available in EHR, and results of simple and/or complex clinical assessments with Long COVID have been evaluated and standardized will introduce a positive feedback cycle where clinicians are able to collect the data needed for the elucidation of Long COVID phenotype.
While heterogeneity in the presentation of Long COVID has been identified, the specific variables influencing outcomes remain to be characterized. The number of syndromes within Long COVID and the extent to which symptom profiles, frequency of occurrence, and duration are unique to these groups remains to be explored. At present, however, data is not collected in a way to allow for these subtle differences to be parsed. In order to develop clinical management strategies to prevent or mitigate Long COVID, it will be essential for studies to use a unified definition of Long COVID and its subforms so that data from different studies can be integrated to provide the foundation for robust statistical inferences about risk factors for the development of Long COVID, as well as the natural history and response to treatments.
Among the reasons for needing an unambiguous definition of Long COVID is the need to make clear contrasts and comparisons between affected and unaffected people. In addition, a clear definition is necessary to understand whether or to what extent defining phenotypic features of Long COVID were present prior to COVID-19 illness in patients affected by Long COVID or potentially serving as controls in studies. The identification of appropriate unaffected people and pre-illness time periods for comparisons is foundational to advancing the state of the art in Long COVID research. It is imperative that patient-reported symptoms be taken into account alongside deep clinical characterization and large scale observational data such as in the N3C. However, all three sources of data are subject to biases and all sources are needed to provide a more complete picture of Long COVID characterization for individuals and populations.
Ethics and Regulation
The N3C data transfer to NCATS is performed under a Johns Hopkins University Reliance Protocol # IRB00249128 or individual site agreements with NIH.
Use of the N3C data for this study is authorized under the following IRB Protocol:
The N3C Data Enclave is approved under the authority of the NIH Institutional Review Board for Protocol 000082 associated with NIH iRIS reference number: 546652 entitled: “NCATS National COVID-19 Cohort Collaborative (N3C) Data Enclave Repository.” Further information can be found at ncats.nih.gov/n3c/resources.
Data Availability
The N3C Data Enclave (covid.cd2h.org/enclave) houses fully reproducible, transparent, and broadly available limited and de-identified datasets (HIPAA definitions: https://www.hhs.gov/hipaa/for-professionals/privacy/specialtopics/de-identification/index.html). Data is accessible by investigators at institutions that have signed a Data Use Agreement with NIH who have taken human subjects and security training and attest to the N3C User Code of Conduct. Investigators wishing to access the limited dataset must also supply an institutional IRB protocol. All requests for data access are reviewed by the NIH Data Access Committee. A full description of the N3C Enclave governance has been published;193 information about how to apply for access is available on the NCATS website: https://ncats.nih.gov/n3c/about/applying-for-access. Reviewers and health authorities will be given access permission and guidance to aid reproducibility and outcomes assessment. A Frequently Asked Questions about the data and access has been created at: https://ncats.nih.gov/n3c/about/program-faq The data model is OMOP 5.3.1, specifications are posted at: https://ncats.nih.gov/files/OMOP_CDM_COVID.pdf
Contributions
Contributions are organized according to contribution roles as follow:
data curation: Halie M. Rando, Tiffany J. Callahan, Christopher G. Chute, Hannah Davis, Rachel Deer, Feifan Liu, Julie A. McMurry, Emily R. Pfaff, Rose Relevo, Peter N. Robinson, Melissa A. Haendel
data integration: Tiffany J. Callahan, Christopher G. Chute, Feifan Liu, Emily R. Pfaff, Peter N. Robinson, Melissa A. Haendel
data quality assurance: Christopher G. Chute, Emily R. Pfaff
data visualization: Julie A. McMurry, Peter N. Robinson
manuscript review and editing: Rachel Deer
clinical subject matter expertise: Tellen D. Bennett, James Brian Byrd, Hannah Davis, Rachel Deer (patient perspective), Joel Gagnier, Farrukh M Koraishy, Joel H. Saltz
manuscript drafting: Halie M. Rando, Tiffany J. Callahan, Christopher G. Chute, Rachel Deer, Farrukh M Koraishy, Feifan Liu, Julie A. McMurry, Emily R. Pfaff, Justin T. Reese, Rose Relevo, Peter N. Robinson, Joel H. Saltz, Anthony Solomonides, Melissa A. Haendel
project management: Julie A. McMurry, Melissa A. Haendel
biological subject matter expertise: Halie M. Rando, Justin T. Reese, Joel H. Saltz, Melissa A. Haendel
funding acquisition: Christopher G. Chute
database / information systems admin: Rose Relevo
clinical data model expertise: Christopher G. Chute, Feifan Liu, Emily R. Pfaff, Peter N. Robinson
N3C Phenotype definition: Carolyn Bramante, Christopher G. Chute, Farrukh M Koraishy, Emily R. Pfaff, Peter N. Robinson, Melissa A. Haendel
statistical analysis: Justin T. Reese
governance: Christopher G. Chute, Melissa A. Haendel
project evaluation: Christopher G. Chute, Julie A. McMurry
critical revision of the manuscript for important intellectual content: Halie M. Rando, Tellen D. Bennett, James Brian Byrd, Carolyn Bramante, Hannah Davis, Joel Gagnier, Farrukh M Koraishy, Feifan Liu, Julie A. McMurry, Richard A. Moffitt, Peter N. Robinson, Joel H. Saltz, Melissa A. Haendel
Declaration of Conflicts of Interest
Julie A. McMurry: Cofounder, Pryzm Health; Melissa A. Haendel: co-founder Pryzm Health
Data Sharing
The N3C Data Enclave (covid.cd2h.org/enclave) houses fully reproducible, transparent, and broadly available limited and de-identified datasets (HIPAA definitions: https://www.hhs.gov/hipaa/for-professionals/privacy/specialtopics/de-identification/index.html).Data is accessible by investigators at institutions that have signed a Data Use Agreement with NIH who have taken human subjects and security training and attest to the N3C User Code of Conduct. Investigators wishing to access the limited dataset must also supply an institutional IRB protocol. All requests for data access are reviewed by the NIH Data Access Committee. A full description of the N3C Enclave governance has been published;193 information about how to apply for access is available on the NCATS website: ncats.nih.gov/n3c/about/applying-for-access. Reviewers and health authorities will be given access permission and guidance to aid reproducibility and outcomes assessment. A Frequently Asked Questions about the data and access has been created at: ncats.nih.gov/n3c/about/program-faq
The data model is OMOP 5.3.1, specifications are posted at: https://ncats.nih.gov/files/OMOP_CDM_COVID.pdf
Acknowledgements
The analyses described in this publication were conducted with data or tools accessed through the NCATS N3C Data Enclave covid.cd2h.org/enclave and supported by NCATS U24 TR002306. Halie M. Rando was supported by The Gordon and Betty Moore Foundation (GBMF 4552) and the National Human Genome Research Institute (R01 HG010067); Halie M. Rando supported by The Gordon and Betty Moore Foundation (GBMF 4552) and the National Human Genome Research Institute (R01 HG010067); Tellen D. Bennett supported by NIH UL1TR002535 03S2 and NIH UL1TR002535; James Brian Byrd supported by NIH grant K23HL128909 protected Dr. Byrd’s time to participate.; Christopher G. Chute supported by U24 TR002306; Rachel Deer supported by UTMB CTSA, 2P30AG024832-16 (PI: Volpi). This research was possible because of the patients whose information is included within the data from participating organizations (covid.cd2h.org/dtas) and scientists who have contributed to the on-going development of this community resource.
The project described was supported by the National Institute of General Medical Sciences, 5U54GM104942-04. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Data partners include but are not limited to the following: Carilion Clinic (UL1TR003015-02S2: Provision of Clinical Data to Support a Nationwide COVID-19 Cohort Collaborative); George Washington Children’s Research Institute (UL1TR001876: Clinical and Translational Science Institute at Children’s National); Duke University (UL1TR002553: Duke CTSA); Johns Hopkins University (UL1TR003098: Johns Hopkins Institute for Clinical and Translational Research); Mayo Clinic Rochester (UL1TR002377: Mayo Clinic Center for Clinical and Translational Science); Medical University of South Carolina (UL1TR001450: South Carolina Clinical & Translational Research Institute SCTR); Penn State Health Milton S. Hershey Medical Center (UL1TR002014: Penn State Clinical and Translational Science Institute); Rush University Medical Center (UL1TR002389: Institute for Translational Medicine); Stony Brook University; The Ohio State University (UL1TR002733: The OSU Center for Clinical and Translational Science: Advancing Today’s Discoveries to Improve Health); Tufts University Boston (UL1TR002544-03S4: Tufts Clinical and Translational Science Institute N3C Supplement); University of Massachusetts Medical School Worcester (UL1TR001453: University of Massachusetts Center for Clinical and Translational Science); University of Alabama at Birmingham (UL1TR003096: Center for Clinical and Translational Science); University of Arkansas for Medical Sciences (UL1TR003107: UAMS Translational Research Institute); The University of Chicago (UL1TR002389: ITM 2.0: Advancing Translational Science in Metropolitan Chicago); University of Colorado Denver (UL1TR002535-03S2: CCTSI Participation in the National COVID Cohort Collaborative N3C); University of Illinois at Chicago (UL1TR002003: Clinical and Translational Science Award); The University of Iowa (UL1TR002537: The University of Iowa Clinical and Translational Science Award); University of Kentucky (UL1TR001998-04S1: Kentucky Center for Clinical and Translational Science); University of Miami (UL1TR002736: Miami Clinical and Translational Science Institute); The University of Michigan at Ann Arbor (UL1TR002240: Michigan Institute for Clinical and Health Research); University of Minnesota (UL1TR002494: University of Minnesota Clinical and Translational Science Institute); University of Nebraska Lincoln (U54GM115458: University of Nebraska Center for Clinical & Translational Research); University of North Carolina at Chapel Hill (UL1TR002489: ICEES+ COVID-19 Open Infrastructure to Democratize and Accelerate Cross-Institutional Clinical Data Sharing and Research); University of Southern California (UL1TR001855: Southern California Clinical and Translational Institute); The University of Texas Medical Branch at Galveston (UL1TR001439: UTMB Clinical and Translational Science Award); The University of Utah (UL1TR002538-03S3: Infrastructure Support for Participation in the N3C Data Repository); University of Washington (UL1TR002319: Institute of Translational Health Sciences); University of Wisconsin-Madison (UL1TR002373: Institutional Clinical AND Translational Science Award); University of Virginia (UL1TR003015-02S2: Provision of Clinical Data to Support a Nationwide COVID-19 Cohort Collaborative); Virginia Commonwealth University (UL1TR002649-03S3: N3C & All of Us Research Program Collaborative Project); Wake Forest University Health Sciences (UL1TR001420: Wake Forest Clinical and Translational Science Award); Washington University in St. Louis (UL1TR002345: Washington University Institute of Clinical Translational Sciences); West Virginia University (U54GM104942: West Virginia Clinical and Translational Science Institute)
Footnotes
↵** Contact author: Melissa A. Haendel, Center for Health AI, University of Colorado Anschutz Medical Campus, Aurora, CO, USA (melissa{at}tislab.org)