Abstract
After one year of the COVID-19 pandemic, over 130 million individuals worldwide have been infected with the novel coronavirus, yet the post-acute sequelae of COVID-19 (PASC), also referred to as the ‘long COVID’ syndrome, remains mostly uncharacterized. We leveraged machine-augmented curation of the physician notes from electronic health records (EHRs) across the multi-state Mayo Clinic health system to retrospectively contrast the occurrence of symptoms and diseases in COVID-19 patients in the post-COVID period relative to the pre-COVID period (n=6,413). Through comparison of the frequency of 10,039 signs and symptoms before and after diagnosis, we identified an increase in hypertensive chronic kidney disease (OR 47.3, 95% CI 23.9-93.6, p=3.50×10−9), thromboembolism (OR 3.84, 95% CI 3.22-4.57, p=1.18×10−4), and hair loss (OR 2.44, 95% CI 2.15-2.76, p=8.46×10−3) in COVID-19 patients three to six months after diagnosis. The sequelae associated with long COVID were notably different among male vs female patients and patients above vs under 55 years old, with the hair loss enrichment found primarily in females and the thromboembolism enrichment in males. These findings compel targeted investigations into what may be persistent dermatologic, cardiovascular, and coagulopathic phenotypes following SARS-CoV-2 infection.
Introduction
Since its emergence in December 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its clinical syndrome coronavirus disease 2019 (COVID-19) have elicited clinical presentations ranging from asymptomatic disease to severe respiratory failure and death, as well as a number of extra-pulmonary manifestations. Despite extensive characterization of its acute presentation, relatively little is known about the long-term effects of both mild and severe COVID-191,2. Telephone interviews of outpatient SARS-CoV-2 infected patients 14-21 days after positive RT-PCR test identified persistence of fatigue, cough, dyspnea, and headache most commonly3. A study of 143 recovered COVID-19 patients 60 days after symptom onset found that 87.4% still had at least one symptom, most commonly fatigue or dyspnea4. Another study of 55 recovered COVID-19 patients 3 months after discharge found 71% had radiologic abnormalities and 25% had abnormal pulmonary function tests5.
A better understanding of the convalescent phase and long-term consequences of COVID-19 (“long COVID”) is crucial as the number of patients who have been infected with SARS-CoV-2 and recover from COVID-19 continues to rise. Several efforts are ongoing to monitor self-reported symptoms following COVID-19 recovery6, but results from these studies will be limited by the pace at which patients are recruited. Current attempts to assess the long-term sequelae of COVID-19 using structured data such as International Classification of Disease (ICD) codes are limited by a finite set of possible variables and are biased by a priori selection of variables of interest.
We have previously reported using an augmented curation approach, wherein BERT-based neural networks are used to identify diseases and symptoms that have been positively attributed to patients in unstructured clinical documentation to curate millions of notes in a short period of time7. Here we leverage augmented curation to retrospectively curate the complete electronic health records (EHRs) of COVID-19 patients in order to identify symptoms and phenotypes that persist several months after diagnosis of the viral infection.
Results
To identify signs and symptoms associated with long COVID, we performed augmented curation from clinical notes of COVID-19 patients 1-6 months before and 3-6 months after diagnosis. Using deep neural networks over an institution-wide platform, we extracted positively attributed signs, symptoms, and diseases from Mayo Clinic electronic health records (EHRs). Among 10,039 signs and symptoms identified from the EHRs of 6,413 COVID-19 patients (Table 1), we identified seven terms significantly increased after diagnosis relative to before diagnosis: infection (OR 1.72, 95% CI 1.67-1.78, p=1.31×10−29), pneumonia (OR 1.91, 95% CI 1.81-2.01, p=1.15×10−14), hypertensive chronic kidney disease (OR 47.3, 95% CI 23.9-93.6, p=3.50×10−9), thromboembolism (OR 3.84, 95% CI 3.22-4.57, p=1.18×10−4), venous thromboembolism (OR 5.91, 95% CI 3.22-3.84, p=1.84×10−4), and hair loss (OR 2.44, 95% CI 2.15-2.76, p=8.45×10−3), and immunodeficiency (OR 4.13, 95% CI 3.31-5.16, p=1.95×10−2) (Figure 1, Supplementary Table 1). In contrast, 183 terms were more prevalent before diagnosis, most significantly pain (OR 0.65, 95% CI 0.63-0.66, p=5.14×10−39), hypertension (OR 0.67, 95% CI 0.65-0.68, p=1.26×10−36), and swelling (OR 0.58, 95% CI 0.56-0.60, p=8.75×10−28). These findings suggest that hypertensive chronic kidney disease, coagulopathies, and hair loss might be associated with long COVID.
Demographic and clinical stratification of patients revealed distinct signs and symptoms associated with long COVID (Table 2). For female patients (n=3,870), only infection and hair loss were significantly enriched after COVID-19 diagnosis, while infection, pneumonia, and thromboembolism were specifically enriched among male patients in the post-COVID period (Figure 2). For patients under 55 years old (n=3,464), only infection was significantly enriched after COVID-19 diagnosis (Figure 3). Among hospitalized (n=703) and ICU-admitted (n=209) patients, pneumonia and respiratory failure were documented more frequently in clinical notes 3-6 months after COVID-19 diagnosis than in the 1-6 months before (Figure 4).
Although we focused on notes in the 3-6 months after COVID-19 diagnosis, it is important to realize that the documentation of a phenotype in this time interval does not necessarily imply that it actually occurred during this interval. For example, the enrichment of infection and pneumonia in the post-COVID period likely reflects a discussion of the prior COVID-19 infection rather than the development of a new infection. To distinguish such signals that may reflect phenotypes which occurred in the peri-COVID window from those that truly emerge in the post-COVID period, we examined the cumulative incidence of each phenotype found to be enriched in the post-COVID period as a function of time relative to COVID diagnosis (Figure 5). Indeed, we observe a distinct acceleration in occurrences of “hair loss” among females and “thromboembolism”/”venous thromboembolism” among males in the post-COVID period, whereas this was not observed for “pneumonia”, “hypertensive chronic kidney disease”, and “respiratory failure”. Taken together, these data suggest that hair loss and thromboembolism warrant further study as potentially unique features of long COVID in females and males, respectively.
Discussion
Through augmented curation analysis of over sixty thousand COVID-19 patient charts, we have identified a range of phenotypes, including type 2 diabetes, thromboembolism, and hair loss, as possible long-term associations with COVID-19 diagnosis. By comparing COVID-19 patients before and after diagnosis, we have unbiasedly characterized signs and symptoms both due to disease onset, and signs and symptoms specific to a given respiratory infection.
Early investigations of long COVID have provided helpful preliminary characterization of the sequelae of signs and symptoms that arise months after COVID-19 diagnosis. Among signs and symptoms reported in the months after COVID-19 diagnosis, the most common associations with long COVID include fatigue, cough, myalgia, headache, memory problems, and anosmia. Notably, these analyses have relied on structured patient data or self-reported symptoms, and are therefore limited to a finite set of biased variables. The present analysis of physician-documented diseases, signs, and symptoms in unstructured EHR data provides an unbiased approach to identify signs and symptoms associated with long COVID. Reassuringly, our findings corroborate early studies of the signs and symptoms of long COVID, including headache and hair loss, emphasizing the importance of monitoring for these in the months after COVID-19 diagnosis. We also report several new associations with long COVID, including hypertensive chronic kidney disease and thromboembolism, which may warrant further investigation and validation using structured patient data. Overall, these differences between our analysis using augmented curation and previous analyses of long COVID highlight the need for a deeper understanding of the differences between information captured in unstructured compared to structured health data.
The approach to automatically curating physician notes taken here has limitations. First, we curate whether or not a physician is ascribing a diagnosis or symptom to a patient independent of the acuity of the diagnosis. Documentation of a prior or historical illness is not differentiated from a newly documented illness. The within-patient comparison of post-COVID symptoms and diseases to pre-COVID symptoms and diseases mitigates this effect, but it is important to note that the pre-COVID period covers only the 1-6 months prior to COVID-19 diagnosis. Furthermore, though the inclusion criteria for the study require at least one symptom or disease documented in both the pre-COVID and post-COVID period, the type of clinical document(s) that this symptom or disease may be identified in is left unrestricted. Thus, comprehensive patient medical histories may not be captured in the pre-COVID period.
As the COVID-19 pandemic continues, there is mounting concern over its long-term consequences. Although our understanding of the acute manifestations of COVID-19 has improved, disease presentation and long-term effects remain relatively uncharacterized due to the paucity of long-term patient data. Augmented curation allows us to begin to address this shortcoming by unbiasedly extracting key word associations from tens of thousands of patient charts. Here we have identified a range of phenotypes, including thromboembolism and hypertensive chronic kidney disease, as possible long-term associations with COVID-19 diagnosis. These findings compel targeted investigations into what may be persistent coagulopathic phenotypes following SARS-CoV-2 infection.
Methods
Study design
This retrospective study was comprised of patients who presented to the Mayo Clinic Health System (including tertiary medical centers in Minnesota, Arizona, and Florida) and received at least one positive SARS-CoV-2 PCR test since the start of the COVID-19 pandemic and further limited to those that have had at least one symptom or disease attributed to them in unstructured clinical documentation during the 1 to 5 months prior to COVID positive test and 3 to 6 months post COVID positive test (n = 6,432). COVIDpos patients were identified via a positive PCR test result.
Natural Language Processing / augmented curation of electronic health records (EHRs)
We used previously developed and detailed state-of-the-art BERT-based neural networks13 to rapidly curate clinical notes that were authored within 6 months of and COVID-19 diagnoses. Specifically, the model extracts sentences containing clinical phenotypes and symptoms and classifies their sentiment into the following categories: Yes (confirmed clinical manifestation or diagnosis), No (ruled out clinical manifestation or diagnosis), Maybe (possibility of clinical manifestation or diagnosis), and Other (alternate context, e.g., family history of disease). The neural networks are pre-trained on 3.17 billion tokens from the biomedical and computer science domains (SciBERT)14 and subsequently trained using 18,490 sentences and approximately 250 phenotypes with an emphasis on cardiovascular, pulmonary, and metabolic phenotypes. It achieves 93.6% overall accuracy and over 95% precision and recall for both “Yes” and “No” sentiment classification.
Statistical analyses
To compute the statistical significance of enrichment of a specific condition following a COVID-19 diagnosis, McNemar’s test was employed. This test is specifically designed for matched pairs with binary outcomes (here, the absence or presence of symptom or disease in time period of interest) and evaluates the null hypothesis that the probability of a specific condition occurring following a COVID-19 diagnosis is equal to the probability of the aforementioned condition occurring before said diagnosis. In this analysis we used the exact version of the McNemar test to compute the p-value for the null hypothesis. We computed the Risk Difference (the difference between the probability of a condition occurring following the diagnosis and the probability of the same condition occurring before the diagnosis), the Risk Ratio (the ratio of the probability of a condition occurring following the diagnosis to the probability of the same condition occurring before the diagnosis), and the Odds Ratio (the ratio of the odds of a condition occurring following the diagnosis to the odds of the same condition occurring before the diagnosis), as well as their respective 95% Wald Confidence Intervals. All multiple hypothesis tests were Bonferroni corrected.
Institutional Review Board (IRB)
This is a retrospective study of individuals who underwent polymerase chain reaction (PCR) testing for suspected SARS-CoV-2 infection at the Mayo Clinic and hospitals affiliated with the Mayo Clinic Health System. This study was reviewed by the Mayo Clinic Institutional Review Board (IRB) and determined to be exempt from the requirement for IRB approval (45 CFR 46.104d, category 4). Subjects were excluded if they did not have a research authorization on file.
Patient and public involvement
The development of the research question and outcome measures was informed by prior literature and information from the Centers for Disease Control and Prevention (CDC) on risk factors for severe COVID-19 illness. No patients were involved in the design of the study, but physicians from the Mayo Clinic who are involved with the COVID-19 research taskforce and the clinical care for COVID-19 patients were involved with the study design and execution.
Data Availability
After publication, the data will be made available to others upon reasonable requests to the corresponding author. A proposal with detailed description of study objectives and statistical analysis plan will be needed for evaluation of the reasonability of requests. Deidentified data will be provided after approval from the corresponding author and the Mayo Clinic standard IRB process for such requests.
Data Availability
After publication, the data will be made available to others upon reasonable requests to the corresponding author. A proposal with detailed description of study objectives and statistical analysis plan will be needed for evaluation of the reasonability of requests. Deidentified data will be provided after approval from the corresponding author and the Mayo Clinic standard IRB process for such requests.
Conflict of Interest Statement
ADB is a consultant for Abbvie, is on scientific advisory boards for nference and Zentalis, and is founder and President of Splissen therapeutics. One or more of the investigators associated with this project and Mayo Clinic have a Financial Conflict of Interest in technology used in the research and that the investigator(s) and Mayo Clinic may stand to gain financially from the successful outcome of the research. This research has been reviewed by the Mayo Clinic Conflict of Interest Review Board and is being conducted in compliance with Mayo Clinic Conflict of Interest policies. The authors from nference have financial interests in nference.
Footnotes
↵+ Joint first authors
Substantially updated the manuscript by curating an additional 6 months worth of data, resulting in a 20-fold expansion of the COVID cohort size. Focused the paper on the within COVID analysis rather than COVID vs. influenza.