Abstract
Understanding the relationships between pre-existing conditions and complications of COVID-19 infection is critical to identifying which patients will develop severe disease. Here, we leverage 1.1 million clinical notes from 1,903 hospitalized COVID-19 patients and deep neural network models to characterize associations between 21 pre-existing conditions and the development of 20 complications (e.g. respiratory, cardiovascular, renal, and hematologic) of COVID-19 infection throughout the course of infection (i.e. 0-30 days, 31-60 days, and 61-90 days). Pleural effusion was the most frequent complication of early COVID-19 infection (23% of 383 complications) followed by cardiac arrhythmia (12% of 383 complications). Notably, hypertension was the most significant risk factor associated with 10 different complications including acute respiratory distress syndrome, cardiac arrhythmia and anemia. Furthermore, novel associations between cancer (risk ratio: 3, p=0.02) or immunosuppression (risk ratio: 4.3, p=0.04) with early-onset heart failure have also been identified. Onset of new complications after 30 days is rare and most commonly involves pleural effusion (31-60 days: 24% of 45 patients, 61-90 days: 25% of 36 patients). Overall, the associations between pre-COVID conditions and COVID-associated complications presented here may form the basis for the development of risk assessment scores to guide clinical care pathways.
Introduction
The COVID pandemic remains an ongoing public health crisis1, and it is critically important to understand the full spectrum of complications that arise throughout the course of COVID infection. There are already several emerging reports of risk factors of severe disease as well as lingering long-term effects such as fatigue, myalgia and kidney related complications2. However, there is an incomplete understanding of the relationship between pre-existing comorbidities and post-COVID complications.
Longitudinal multi-center patient data in EHRs of over 20,000 COVID-19 patients (1,903 hospitalized) from the Mayo Clinic (Rochester, Arizona, Florida) and associated health systems, provides a unique opportunity to understand the relationship between comorbidities and COVID complications3. While the structured EHR fields such as ICD codes are modestly informative, the true context of comorbidities and complications are buried in the 1.1 million+ unstructured patient notes of the 1,903 COVID-19 patients. In this study, we have leveraged ‘augmented curation’ of EHR notes in COVID patients3 to map the relationships between complications, comorbidities, and outcomes in the hospitalized COVID-19 patients.
Methods
Institutional Review Board (IRB)
This retrospective research was conducted under IRB 20–003278, ‘Study of COVID-19 patient characteristics with augmented curation of Electronic Health Records (EHR) to inform strategic and operational decisions’. For further information regarding the Mayo Clinic Institutional Review Board (IRB) policy, and its institutional commitment, membership requirements, review of research, informed consent, recruitment, vulnerable population protection, biologics, and confidentiality policy, please refer to www.mayo.edu/research/institutional-review-board/overview.
Study design
In this study, we consider all hospitalized COVID-19 positive patients (positive PCR for SARS-CoV2) in the Mayo Clinic electronic health record (EHR) database from March 12, 2020 to September 15, 2020 (1,903 patients total). An overview of the clinical characteristics of this study population is provided in Table 1.
Clinical characteristics of all hospitalized COVID-19 positive patients in the Mayo Clinic EHR dataset. For each clinical covariate, the number of unique patients in the dataset is shown, with the percentage of the study of population in parentheses.
In order to capture the complications and comorbidities recorded in the clinical notes with positive sentiment, we ran the 1.1 million patient notes through a neural network based software that captures the sentiments of phenotypes from clinical notes.
For comorbidities, we considered 21 risk factors for COVID-19 severe illness reported by the CDC4, including: (1) anemia, (2) asthma, (3) BMI between 25-30 (overweight), (4) BMI between 30-40 (obese), (5) BMI >= 40 (severe obesity), (6) cancer, (7) cardiomyopathy, (8) chronic kidney disease (CKD), (9) chronic obstructive pulmonary disease (COPD), (10) coronary artery disease (CAD), (11) heart failure (HF), (12) hyperglycemia, (13) hypertension, (14) immunosuppressant medication usage, (15) liver disease, (16) neurologic conditions, (17) obstructive sleep apnea (OSA), (18) smoker (former or current), (19) steroid medication usage, (20) type 1 diabetes mellitus (T1D), (21) type 2 diabetes mellitus (T2D). We also note that bone marrow transplant, HIV/AIDS, pediatric conditions, pregnancy, sickle cell disease, solid organ transplant, and thalassemia were also considered, but were not included in the analysis because fewer than 20 patients had each of these comorbidities.
For complications, we considered 20 COVID-associated complications: (1) acute respiratory distress syndrome / acute lung injury (ARDS / ALI), (2) acute kidney injury (AKI), (3) anemia, (4) cardiac arrest, (5) cardiac arrhythmias, (6) chronic fatigue syndrome, (7) disseminated intravascular coagulation (DIC), (8) heart failure, (9) hyperglycemia, (10) hypertension, (11) myocardial infarction (MI), (12) pleural effusion, (13) pulmonary embolism (PE), (14) respiratory failure, (15) sepsis, (16) septic shock, (17) stroke / cerebrovascular incident, (18) venous thromboembolism / deep vein thrombosis (VTE / DVT), (19) delirium/encephelopathy, and (20) numbness.
A patient was determined to have a clinical phenotype (the comorbidity or complication, in question) if the clinical phenotype or synonyms were mentioned (with positive sentiment) within that patients’ electronic health records. For comorbidities, the mention must have occurred within a note at any point in the patient history prior to the patient’s first positive COVID-19 PCR test. Patients were only considered if they had at least one note within Mayo Clinic EHR system dated before days -31 relative to their first positive PCR test.
For the patients included in this study, we stratified the rates of new onset complications by comorbidities (Table 2). For each comorbidity, e.g. chronic kidney disease, we compare the rates of “new onset” complications in cohorts of COVID-19 patients with and without chronic kidney disease. To calculate the rate of a new-onset complication, e.g. acute kidney injury (AKI), the numerator is the number of patients with AKI recorded in the clinical notes (with positive sentiment) during but not prior to the time period. The denominator is the number of patients without AKI recorded in the clinical notes with positive sentiment prior to the time period.
In each row, we compare the rates of “early onset” complications in cohorts of COVID-19 patients with and without comorbidities, during the time period (Days 0-30) relative to the PCR diagnosis date. To calculate the rates of new onset complications, the numerator is the number of patients with any complication recorded in the clinical notes with positive sentiment during but not prior to the time period (see Methods for full list of complications). The denominator is the number of patients without the complication recorded in the clinical notes with positive sentiment prior to Day 0. The rows with statistically significant differences are highlighted in green. The columns are: (1) Comorbidity: Comorbidity that defines the cohorts, including chronic conditions which are risk factors for severe COVID-19 disease, (2) Overall rate of new onset complications in cohort with the comorbidity: Overall rate of new onset complications from Days 0-30 in the cohort of patients with the comorbidity. (3) Overall rate of new onset complications in cohort without the comorbidity: Overall rate of new onset complications from Days 0-30 in the cohort of patients without the comorbidity, (4) Relative risk [95% C.I.]: (Rate of complication in cohort with comorbidity) / (Rate of complication in cohort without comorbidity), along with the associated 95% confidence interval, (5) BH-adjusted p-value: Benjamin-Hochberg corrected p-value for the Fisher exact statistical significance test comparing the rates of overall complications in the cohorts of patients with and without the specified comorbidity.
Augmented curation to identify comorbidities and complications in clinical notes
An augmented curation approach was used to classify the sentiment of phenotypes mentioned in the clinical notes. In order to determine if a patient is indicative of a comorbidity or complication, we used a system consisting of:
A high performance text retrieval and search engine that given a patient cohort consisting of one or more patients, pipes all the sentences, paragraphs and documents pertaining to those patients to a Named Entity Recognition (NER) system
An NER system consisting of both a knowledge graph based entity recognition subsystem as well as a Bidirectional Encoder Representations from Transformers (BERT) based5 entity recognition system that, given any sentence or paragraph is able to identity whether it contains one or more phenotypes (comorbidities or complications etc.) so that all such sentences and paragraphs then are piped to a further downstream phenotype sentiment classification system.
A BERT-based phenotype sentiment model3 that given a sentence or paragraph containing a phenotype/disease token provides accurately the sentiment as to whether or not the sentence indicates if the patient has the disease or not.
A system that aggregates, patient-wise, all the sentence/paragraph sentiments, grouped by disease, and uses practical heuristics to determine if the patient may be confidently deemed to have the disease, based on the sentence level sentiments. As part of immediate future work, we are developing formal Bayesian models to aggregate the sentence level BERT sentiment information into a sound patient level sentiment inference.
Presently we use heuristics based on the number of positive sentiment sentences found in a patient’s notes to deem whether or not the patient has a disease.
Validation of the augmented curation model
In order to validate the augmented curation model for a set of phenotypes of complications/comorbidities, we manually labelled a set of 2,404 randomly selected sentences from the clinical notes containing the phenotypes. The true positive, true negative, false positive, and false negative rates are reported in Supplementary Table S1. Overall, the out-of-sample precision, recall, and accuracy values were 0.980, 0.982, and 0.966, respectively.
Results
Patient characteristics
1,903 patients were hospitalized with a diagnosis of COVID-19 between March 12, 2020 and October 15, 2020. Using the date of the first positive SARS-CoV-2 PCR test, we analyzed the clinical notes of each patient in their pre-COVID vs the post-COVID19 phase (Figure 1A). Using deep language models (Figure 1B) we extracted the 20 risk factors for COVID-19 severe illness reported by the CDC (Figure 1C) and the 18 COVID-associated complications (Figure 1D) in order to analyze their association in our cohort (Figure 2-4).
(A) Relative timeline for each of the patients in the study, divided into the pre-COVID phase (1 year prior to the first positive PCR test), and the SARS-CoV-2 positive phase (90 days following the first positive PCR test). (B) Clinical notes from 1903 hospitalized patients are analyzed with a Bidirectional Encoder Representations from Transformers (BERT) model to extract the presence or absence of comorbidities and complications. (C) Distribution of pre-existing conditions in the first-year before the first positive PCR test (D) Distribution of complications at early (0-31 days), and late time points (31-60 days) (61-90 days).
The color of each (comorbidity, complication) pair corresponds to (number of patients with the comorbidity that had the complication for the first time in the time window (0-30 days)) / (Total number of patients with the comorbidity). In particular, darker shades of red correspond to higher rates of complications, and lighter shades of red correspond to lower rates of complications. A patient is determined to have a comorbidity if it is recorded in their clinical notes with positive sentiment any time before their first positive PCR test. For each comorbidity row and for each complication column, the number of patients is shown in parentheses.
In Table 1, we present the general characteristics of the study population. All age groups are included, and, as expected from the severity of the disease in different age groups, more than 50% of the patients were over 50 year-old with only 2.6% of pediatric patients. Female, male and different ethnic origins of the US population are adequately represented. The most frequent comorbidities were hypertension (42.8%), diabetes/hyperglycemia (38%), obesity (25.2%) and cancer (21.1%), reflecting the most common causes of chronic diseases in the US.
The most common COVID complications recorded were respiratory (ARDS, respiratory failure, pulmonary embolism), followed by cardiovascular (hypertension, myocardial infarction, arrhythmia, stroke), acute kidney injury, anemia, sepsis and diabetic decompensation/hyperglycemia (Figure 1D).
Frequency of COVID-19 complications and association with underlying comorbidities
The main objective of our analysis was to identify the association between comorbidities and short-term (up to 30 days post-infection) and long-term (31-90 days post-infection) complications of COVID-19 infection. Here, we observe the majority of complications occur within the first month post-infection (Figure 1D).
We identify multiple comorbidities that are associated with significantly higher rates of any complications in the early onset time period (Days 0-30 post-PCR diagnosis). From this analysis, we validate that many of the CDC-reported risk factors for COVID-19 severe illness are associated with increased rates of early onset COVID complications across multiple organ systems (Table 2). Among these we identify hypertension (RR: 9.4, p-value: 2.9e-64) as the most significant risk factor followed by other cardiovascular chronic disease (heart failure, coronary artery disease, cardiomyopathy), anemia (RR: 3.2, p-value: 9.8e-14) and chronic kidney disease (RR: 4.4, p-value: 1.5e-22), as the most significant predictors of clinical complication in early COVID-19 infection.
Respiratory complications
Pleural effusions are the most common early onset complications: 23% within the first months and 7% for ARDS/ALI, the second most frequent complication (Figure 1D). The primary risk factor for pleural effusion was hypertension (RR 9.2, p = 2.4e-22) (Figure 2, Table 3). While the risk of new onset of pleural effusion is reduced after a month, our data reveal persistent risk of pleural effusion beyond 30 days post-infection, particularly among patients with type 1 diabetes (∼5%) (Figures 3,4).
In each row, we compare the rates of “early onset” complications in cohorts of COVID-19 patients with and without comorbidities, during the time period (Days 0-30) relative to the PCR diagnosis date’. To calculate the rates of new onset complications, the numerator is the number of patients with the complication recorded in the clinical notes with positive sentiment during but not prior to the time period. The denominator is the number of patients without the complication recorded in the clinical notes with positive sentiment prior to the time period. Rows with cardiovascular complications are highlighted in light red, and rows with respiratory complications are highlighted in light blue. Rows are sorted first by complication, and second by statistical significance. The columns are: (1) Complication: Complication phenotype that is used to define the rates, including phenotypes associated with severe COVID-19 disease, (2) Comorbidity: Comorbidity that defines the cohorts, including chronic conditions which are risk factors for severe COVID-19 disease, (3) Rate of new onset complication in cohort with the comorbidity: Rate of complication from Days 0-30 in the cohort of patients with the comorbidity. (4) Rate of new onset complication in cohort without the comorbidity: Rate of complication from Days 0-30 in the cohort of patients without the comorbidity, (5) Relative risk [95% C.I.]: (Rate of complication in cohort with comorbidity) / (Rate of complication in cohort without comorbidity), along with the associated 95% confidence interval, (6) BH-adjusted p-value: Benjamin-Hochberg corrected p-value for the Fisher exact statistical significance test comparing the rates of the specified complications in the cohorts of patients with and without the specified comorbidity.
The color of each (comorbidity, complication) pair corresponds to (Number of patients with the comorbidity that had the complication for the first time in the time window (31-60 days)) / (Total number of patients with the comorbidity). In particular, darker shades of red correspond to higher rates of complications, and lighter shades of red correspond to lower rates of complications. A patient is determined to have a comorbidity if it is recorded in their clinical notes with positive sentiment any time before their first positive PCR test. For each comorbidity row and for each complication column, the number of patients is shown in parentheses.
The number in each (comorbidity, complication) cell corresponds to the number of patients with the comorbidity that had the complication for the first time in the time window (61-90 days). The color of each (comorbidity, complication) pair corresponds to (Number of patients with the comorbidity that had the complication for the first time in the time window (61-90 days)) / (Total number of patients with the comorbidity). In particular, darker shades of red correspond to higher rates of complications, and lighter shades of red correspond to lower rates of complications. A patient is determined to have a comorbidity if it is recorded in their clinical notes with positive sentiment any time before their first positive PCR test. For each comorbidity row and for each complication column, the number of patients is shown in parentheses.
Acute respiratory distress syndrome / acute lung injury is the second most frequent and is the most dreaded complication of severe COVID-19 infection (7%) (Figure 1D). In the early stages of COVID infection (i.e. 0-30 days post-infection), ARDS/ALI was most significantly associated with hypertension (p-value: 4.2e-8). The other most significantly associated baseline comorbidities include anemia (p-value: 2.9e-4) and chronic kidney disease (2.7e-3) (Figure 2). In later stages of infection, we observe additional instances of ARDS/ALI, but at lower rates. Further, in later stages of infection, we fail to observe significant associations between baseline comorbidities and increased risk of new onset of ARDS/ALI (Figures 3,4, Table 3).
Cardiovascular complications
Cardiac arrhythmia was the most common cardiovascular complication following COVID infection (12%) (Figure 1D). Hypertension is by far the most important risk factor (RR= 21, p = 2.7e-19) (Table 3). Up to 7% of hypertensive patients present with this complication within the first 30 days (Figure 2). But the risk declines to less than 1% for new onset after one month post-infection (Figures 3,4).
Early onset COVID heart failure is the second most common cardiovascular complication (9%) (Figure 1D). It is primarily associated with coronary heart disease (RR=7.3, p=2.9e-3) and other cardiovascular risk factors (hypertension, anemia, type 2 diabetes, smoking) (Table 3). But interestingly, cancer and immunosuppression are also uncovered as significant risk factors (RR=5.1 and 4.5, P = 4.0e-4 and 0.04, respectively). The cardiovascular complications examined in this study occur most frequently in days 0-30 post-infection (Figure 2). Beyond 30 days, the risk of new-onset arrhythmia, hypertension, MI, PE/DVT dropped to less than 1% for all comorbidities (Figures 3,4).
Renal complications
Acute kidney injury is among the most common early-onset post-COVID complications (7%), (Figure 1D) and is associated in our cohort mostly with hypertension (RR=11, p=1.2e-7), and chronic kidney disease (RR 8.9, p=4.1e-6) (Table 3) (Figure 2). Specifically, we observe acute kidney injury in 7% of hospitalized COVID patients in aggregate in early infection. The risk of acute kidney injury is highest in the early stages of infection (i.e. 0-30 days post-infection), while there is a reduction in the new onset of acute kidney injury beyond 30 days (either 31-60 days or 61-90 days) (Figures 3,4).
Neurologic Complications
Encephalopathy and delirium are a commonly observed complication of COVID (c), which is most associated in our cohort with heart failure (RR = 11, p = 3.9e-5), hypertension (RR = 7.2, p = 3.3e-4), and coronary artery disease (RR = 9.2, p = 7.5e-4) (Table 3). Further, the risk of encephalopathy and delirium was observed to be highest in early COVID infection (Figures 3,4).
Predictors of long-term complications of COVID-19 infection
While we observe a substantial reduction in the frequency of new onset of complications beyond 30 days post-infection, the risk of some complications is non-zero. In the case of pleural effusion, the risk decreases from 10-15% for all patients to less than 1% (Figures 2,3,4) with increased risk in patients with cardiomyopathy, chronic kidney disease, coronary artery disease, heart failure and hypertension. Patients with liver disease, stroke and type 1 diabetes also appear more susceptible to late-onset complications (days 31-90) (Figure 3,4).
Discussion
In the present study we have set out to understand the relationship between baseline comorbidities and clinical complications over the course of COVID-19 infection. Here, we leverage natural language processing of unstructured patient notes from 1,903 patients hospitalized with COVID-19 in the Mayo Clinic health system. While it stands to reason that individuals with poorer health status and multiple underlying comorbidities will experience worse outcomes during COVID-19 infection, our study reveals that not all risk factors are created equal and are associated with different complications. Previous studies have begun to uncover numerous factors associated with increased risk of more severe COVID-19 infection6–10 including hypertension, chronic kidney disease, type 2 diabetes, cardiovascular disease, and malignancy. In general, these studies have examined risk of severe COVID-19 infection, but have not examined the relationship between baseline comorbidities and risk of specific complications.
In our analysis, we observe that hypertension is the single most significant risk factor for all examined complications with exception of deep vein thrombosis. Notably, this is consistent with previous studies,, where patients with baseline hypertension have been reported to have higher risk of more severe COVID-19 disease7,8. Specifically, our data suggest that a recent history of hypertension is the strongest predictor of ARDS, the most significant and life-threatening complication of COVID-19, among hospitalized COVID-19 patients, similar to previous observations10. We further observed anemia, chronic kidney disease, immunosuppression, coronary artery disease and hyperglycemia to be associated with increased rates of ARDS. Our analysis further uncovered unexpected associations like between a history of cancer and immunosuppression with heart failure following COVID-19 infection.
Our data also highlight the temporal relationship between baseline health status and complications throughout COVID-19 infection. For example, cancer, obesity, and obstructive sleep apnea are associated with higher rates of short-term complications (days 0-30 post-PCR test), but not with late-onset complications. While many comorbidities are chronic (e.g. cancer, obesity, coronary artery disease and chronic kidney disease), others are amenable to short-term intervention, suggesting that tight control of modifiable risk factors might limit the risk of complication due to COVID-19 infection. For example, controlling hypertension, smoking cessation, treating anemia, and having tight glycemic control might reduce the rate of cardiovascular complications in the early stages of COVID-19.
Many of the comorbidities examined likely influence the development of complications, even in the absence of COVID-19 infection. For example, we do not observe new-onset pleural effusion among patients with pre-existing liver disease. It is possible that this is related to previous incidence of pleural effusion among patients with liver disease11. In the present analysis of the relationship between baseline co-morbidities and the likelihood of developing severe complications, we are not using a reference population of COVID-19 negative individuals. Instead, to understand these relationships, we have limited our inquiry to hospitalized COVID-19 patients and examined differences in the rates of clinical complications and various comorbidities. In future analyses, we will explore the rate of clinical complications in a control population of hospitalized COVID-negative patients to establish baseline complication rates within a hospitalized population.
At present, our analysis does not account for the co-dependent relationships between comorbidities or between complications. In many cases, individual patients likely have multiple complications, which can obscure the interpretation of data, particularly at later time points where we observe fewer events. Additionally, it is possible that many of the late-stage complications arise directly from baseline comorbidities rather than a direct result of COVID19 infection. This study can be leveraged for the development of controlled trials to identify appropriate prophylactic or therapeutic interventions for high-risk COVID-19 patients. Future analyses will focus on creation of a multivariate model to enable risk prediction of post-COVID complications12.
Data Availability
Deidentified data will be made available upon reasonable request to corresponding author after peer-reviewed publication of this manuscript
Supplementary Material
The columns are (1) Phenotype: disease token identified in the sentence, (2) TP (true positives): count of sentences in which the BERT model correctly identified the sentiment as ‘Yes’, (3) TN (true negatives): count of sentences in which the BERT model correctly identified the sentiment as not ‘Yes’, (4) FP (false positives): count of sentences in which the BERT model incorrectly identified the sentiment as ‘Yes’, (5) FN: (false negatives): count of sentences in which the BERT model incorrectly identified the sentiment as not ‘Yes’, (6) Precision: precision of the BERT model, equal to TP/(TP+FP), (7) Recall: recall of the BERT model, equal to TP/(TP+FN), (8) Accuracy: accuracy of the BERT model, equal to (TP+TN)/(TP+TN+FP+FN).
Footnotes
↵* Joint first authors