Abstract
South Korea was one of the epicenters for both the 2015 MERS and 2019 COVID-19 outbreaks. However, there has been a lack of published literature, especially using the EMR records, that provides a comparative summary of the prognostic factors present in the coronavirus-derived diseases patients. Therefore, in this study, we aimed to compare and evaluate the distinct clinical traits between the patients of different coronaviruses, including the lesser pathogenic HCoV strains, SARS-CoV, MERS-CoV, and SARS-CoV-2. We also conducted observed the risk factors by the COVID severity to investigate the extent of resemblance in clinical features between the disease groups and to identify unique factor that may influence the prognosis of the COVID-19 patients. Here, we utilize the common data model (CDM), which is the database that houses the EMR records transformed into the common format to be used by the multiple institutions. For the comparative analyses between the disease groups, we used independent t-test, Scheffe post-hoc test, and Games-howell post-hoc test and for the continuous variables, chi-square test and Fisher’s exact test. Based on the analyses, we selected the variables with p-values less than 0.05 to predict COVID-19 severity by nominal logistic regression with adjustments to age and gender. From the study, we observed diabetes, cardio and cerebrovascular diseases, cancer, pulmonary disease, gastrointestinal disease, and renal disease in all patient groups. Of all, the proportions of cancer patients were highest in all groups with no statistical significance. Most interestingly, we observed a high degree of clinical similarity between the COVID-19 and SARS patients with more than 50% of measured clinical variables to show statistical similarities between two groups. Our research reflects the great significance within the bioinformatics field that we were able to effectively utilize the integrated CDM to reflect real-world challenges in the context of coronavirus. We expect the results from our study to provide clinical insights that can serve as predicator of risk factors from the future coronavirus outbreak as well as the prospective guidelines for the clinical treatments.
Background
COVID-19 is a global pandemic that has caused more than a million deaths and nearly 50 million cases since its outbreak1. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), also known as the coronavirus disease 2019 (COVID-19) is a novel respiratory viral disease caused by the SARS-CoV-2 virus that was first discovered in Wuhan, China, in December of 20192. Current evidence suggests the virus transmits via respiratory droplets, the aerosols, from cough, sneeze, speech, and heavy breaths yet we still don’t know whether these aerosols persist in the air for a prolonged period2-3. However, it is not the first time the world is experiencing the health threats from the coronaviruses. Coronavirus infections in humans were first reported in the 1960s. They are the RNA viruses that are commonly present in bats and belong to the family Coronaviridae. The family comprises four genera, Alphacoronavirus, Betacoronavirus, Gammacoronavirus and Deltacoronavirus, of which the two, Alphacoronavirus and Betacoronavirus are known to cause respiratory infections in humans. Including the COVID-19, there are seven human coronaviruses, including HCoV229E, HCoV-NL63, HCoV-OC43, HCoV-HKU1, SARS-CoV, and MERS-CoV. While the HCoV strains cause mild upper respiratory diseases, more pathogenic strains including the SARS-CoV, MERS-CoV, and SARS-CoV-2 cause severe respiratory symptoms and complications that may lead to death4-6. South Korea was no exception in facing the pandemic as one of the epicenters for both the 2015 MERS and 2019 COVID-19 outbreaks. However, there has been a lack of published literature that provides a comprehensive and comparative summary of the prognostic factors present in the coronavirus-derived diseases patients in South Korea. In the light of emerging and re-emerging viral transmissions, it is utterly important to collect a comprehensive set of data upon which the robust control measures can be established. Therefore, in this study, we aim to compare and evaluate the distinct clinical traits between the patients of different coronaviruses, including the HCoV strains, SARS-CoV, MERS-CoV, and SARS-CoV-2. Further, we aim to conduct clinical characterization of the COVID-19 patients to observe the risk factors by severity of the disease with an attempt to compare them with the identified risk factors of all other coronavirus infections in order to see whether any common or distinct trait observed from the other coronavirus diseases significantly influence the prognosis of the COVID-19 patients. The study utilizes the common data model (CDM), which is the database that houses the EMR records transformed into the common format to be used by the multiple institutions for the research purposes7. With the use of real-world data, we expect to present valuable insights on clinical variables of importance on the COVID-19 as well as the degree of resemblance between the coronaviruses.
Methods
I. Data collection and definition
We utilized the CDM within the Seoul National University Hospital located in Seoul, South Korea. The data collection period is from October 15th of 2004 to July 31st of 2020. Without any restriction on age and gender, we collected the records of symptoms, comorbidities, and laboratory test results of patients diagnosed with HCoV229E, HCoV-NL63, HCoV-OC43, HCoV-HKU1 (HCoVs), SARS-CoV, MERS-CoV, and SARS-CoV-2 infections. During the data mining process, we categorized all individual diagnoses by the disease types with references to the ICD codes and selected the laboratory measurement variables primarily based on the literature review. We excluded any variable that had null or sparse patient records. We divided the COVID-19 patients by disease severity based on the criteria from the published World Health Organization (WHO)7. Among the COVID-19 confirmed patients, those who experienced mild cold-like symptoms with no pneumonia were classified as mild, whereas the patients who showed any additional clinical presentation of pneumonia were classified as non-mild.
II. Statistical analysis
Continuous variables were compared by the independent t-test, Scheffe post-hoc test in case of homogeneity within the variance, and Games-howell post-hoc test. They were expressed by the mean values with ± standard deviations. For categorical variables, we conducted chi-square test or Fisher’s exact test to indicate whether presence of conditions differed across disease groups. Based on the analyses, we selected the variables with p-values less than 0.05 to predict COVID-19 severity by logistic regression with adjustments to age and gender8-12. All statistical analyses were carried using R version 3.6.2.
Results
I. Comparative analyses between the coronavirus infections
Demographic and clinical characteristics
For the study, we collected the records of 2840 COVID patients, 67 MERS patients, 39 SARS patients, and 81 other HCoV positive patients. Table 1 shows the summary of clinical characteristics of the COVID, MERS, SARS, and HCoV patients. Among the COVID patients, the mean age was 51.8 ± 26.0 and 1457 (51.3%) were males. Among the MERS patients, the mean age was 36.9 ± 23.4 and 35 (52.2%) were males whereas among the SARS and HCoV patients, the mean age of patients was 22.8 ± 28.9 and 7.2 ± 11.9 and 27 (69.2%) and 42 (51.9%) were males, respectively. For comorbidities, diabetes, cancer, pulmonary disease, gastrointestinal disease, and renal disease were present in all groups. From the comparison between the COVID and MERS patients, the COVID patients experienced significantly more of the diabetes (4.4% vs. 1.1%, p<0.001), cancer (29.5% vs. 3.0%, p<0.001), and gastrointestinal diseases (14.4% vs. 1.5%, p < 0.01) compared to the MERS group. However, SARS patients, compared to the COVID patients, were observed to experience more cerebrovascular (12.8% vs. 1.1%, p < 0.001), pulmonary (61.5% vs. 8.4%, p < 0.001), and renal conditions (17.9% vs. 6.4%, p < 0.01), whereas the HCoV group had significantly higher proportion of patients with pulmonary diseases (48.1% vs. 8.4%, p < 0.001) and musculoskeletal diseases (9.9% vs. 2.2%, p < 0.001) than the COVID group. For symptoms, patients in all disease groups reported fever, cough, dyspnea, gastrointestinal symptoms, and upper respiratory infections. Compared to the COVID patients, MERS and HCoV patients presented fever (79.1% vs. 33%, p < 0.001 and 60.5% vs. 33%, p < 0.001, respectively) and upper respiratory infections (28.4% vs 0.8%, p < 0.001 and 37% vs 0.8%, p < 0.001) more frequently. Except for dyspnea and sore throat, SARS patients compared to the COVID patients did not show any statistically significant difference in their symptom presentations.
Summary of clinical characteristics between the HCoVs, SARS, MERS, COVID-19 groups. Data are presented as the mean ± standard deviation for the continuous variables, otherwise the number of patients and percentage.
Laboratory findings
Laboratory findings of each disease group are summarized in Table 2. In complete blood cell counts, lymphocyte count (38.90 ± 27.00 vs. 18.40 ± 17.10, p < 0.001) and eosinophil count (4.38 ± 6.09 vs. 1.37 ± 2.61, p < 0.01) were higher in HCoV group than the COVID group whereas hemoglobin (13.20 ± 2.12 vs. 10.90 ± 2.47, p < 0.001) and hematocrit (39.70 ± 5.59 vs. 33.00 ± 7.24, p < 0.001) levels were higher in the MERS group. In liver function measurements, COVID patients had a higher level of total bilirubin (1.39 ± 2.34 vs. 0.69 ± 0.47, p < 0.001 and 1.39 ± 2.34 vs. 0.67 ± 0.33, p < 0.001, respectively) compared to the HCoV and MERS patients, whereas the COVID patients had a higher level of albumin compared to the SARS group (3.53 ± 0.74 vs. 2.94 ± 0.83, p < 0.001) but lower level than the MERS group (4.17 ± 0.52 vs. 3.53 ± 0.74, p < 0.001). Within the kidney function, levels of blood urea nitrogen and creatinine were significantly higher among the COVID patients compared to the HCoV group (23.30 ± 20.20 vs. 12.90 ± 11.80, p< 0.001 and 1.33 ± 1.77 vs. 0.58 ± 0.64, p< 0.1, respectively). There were no observed statistical differences between the groups in the coagulation function.
Summary of laboratory findings between the disease groups. Data are presented as the mean ± standard deviation for the continuous variables, otherwise the number of patients and percentage. Hb: Hemoglobain, Hct: Hematocrit, RBC: Red blood cell, WBC: White blood cell, PLT: Platelet, PCT: Procalcitonin, AST: Aspartate Aminotransferase, ALT: Alanine Aminotransferase, BUN: Blood urea nitrogen, eGFR: Estimated glomerular filtration rate, aPTT: Partial thromboplastin time, PT: Prothrombin time.
II. Comparative analyses of the COVID-19 severity
Demographic and clinical characteristics
After dividing the COVID patients by the WHO disease severity criteria (ref), there were 2596 patients in the mild group and 159 patients in the non-mild group. Table 3 shows the summary of clinical characteristics of the non-mild and non-mild patients. Within the COVID positive patients, 2596 people belonged to the mild group (91.4%) and 244 (8.6%) belonged to the non-mild group. In the mild group, the mean age was 47.9 ± 25.6 and 1298 were males. In the non-mild group, the mean age was 67.0 ± 21.7 and 159 were males. There were 1567 COVID patients with the comorbidities, of which 1327 and 240 patients belonged to the mild and non-mild group, respectively. By comparison, non-mild COVID patients presented more comorbidities upon their hospital admissions (98.3% vs. 51.1%, p < 0.001). As shown in the Table 3, patients in the non-mild group reported to have more cerebrovascular disease (4.9% vs. 0.8%, p < 0.001), pulmonary disease (36.9% vs. 5.7%, p < 0.001), renal disease (16.0% vs. 5.5 %, p < 0.001), whereas more patients in the mild-group experienced hepatic disease than the non-mild group (5.0% vs. 0.8 %, p < 0.01). For symptoms, non-mild group had the higher proportion of people reporting dyspnea (20.1% vs. 3.1%, p < 0.001), chest pain (5.7% vs. 1.9%, p < 0.001), and upper respiratory infections (2.5% vs. 0.7%, p < 0.001), whereas the mild group experienced more fever (34.5% vs. 18.0%, p < 0.001) compared to the non-mild group.
Summary of clinical characteristics between the COVID-19 severity groups. Data are presented as the mean ± standard deviation for the continuous variables, otherwise the number of patients and percentage.
Laboratory findings
Laboratory findings of each severity group are summarized in Table 4. Compared to the mild group, non-mild group showed statistically higher levels of white blood cell count (11.13 ± 9.44 vs. 9.99 ± 15.83, p < 0.05), eosinophil count (39.70 ± 46.42 vs. 34.65 ± 45.64, p < 0.05), procalcitonin (24.14 ± 14.18 vs. 22.54 ± 14.65, p < 0.05), and platelet (208.26 ± 120.78 vs., 190.18 ± 120.26, p < 0.001). While in liver function measurements, the mild group showed higher levels of albumin (3.56 ± 0.74 vs. 3.30 ± 0.64, p < 0.001) and total bilirubin (1.45 ± 2.46 vs. 0.97 ± 1.20, p < 0.001), they had lower levels of blood nitrogen urea (45.01 ± 21.11 vs. 50.31 ± 30.10, p < 0.001), creatinine (91.86 ± 86.73 vs. 123.59 ± 117.71, p < 0.001), and fibrinogen (392.81 ± 135.42 vs. 449.46 ± 117.01, p < 0.001) compared to the non-mild group.
Summary of laboratory findings between the COVID severity groups. Data are presented as the mean ± standard deviation for the continuous variables, otherwise the number of patients and percentage. Hb: Hemoglobain, Hct: Hematocrit, RBC: Red blood cell, WBC: White blood cell, PLT: Platelet, PCT: Procalcitonin, AST: Aspartate Aminotransferase, ALT: Alanine Aminotransferase, BUN: Blood urea nitrogen, eGFR: Estimated glomerular filtration rate, aPTT: Partial thromboplastin time, PT: Prothrombin time
III. Logistic regression
In order to observe the extent of common clinical characteristics among disease groups that may uniquely affect the prognosis of COVID-19 patients, we conducted nominal logistic regression with the variables that showed statistical significance with p values less than 0.05. Table 5 below shows the results from the regression that is adjusted for both age and gender. The model below demonstrated that cerebrovascular disease (OR: 5.34, 95% CI: 1.06-26.27, p < 0.05), pulmonary disease (OR: 925.00, 95% CI: 426.15-2351.15, p < 0.001), renal disease (OR: 7.70, 95% CI: 3.35-19.28, p < 0.001), and increased eosinophil count (OR: 4.11, 95% CI: 1.21, 14.91, p < 0.05) were statistically associated with the COVID patients progressing towards the more severe stage whereas the increased level of bilirubin showed statistical association with the lesser severe form of COVID-19 (OR: 0.57, 95% CI: 0.33-0.96, p < 0.05).
Nominal logistic regression adjusted for age and gender. Hb: Hemoglobain, RBC: Red blood cell, WBC: White blood cell, BUN: Blood urea nitrogen
Discussion
COVID-19 is the third coronavirus-derived disease that has fatally affected the global population. As such, it is noteworthy that the coronavirus may continue to be the viral source for a number of morbidities and mortality. With our aim to then comprehensively understand the impact of coronavirus, we examined the characteristics and associations between the COVID-19, HCoV, SARS and MERS patients within South Korea, using common data model (CDM).
From all disease groups, we observed patients with diabetes, cardio and cerebrovascular diseases, cancer, pulmonary disease, gastrointestinal disease, and renal disease. Among the comorbidities, proportion of cancer patients were highest in all groups with no statistical significance. Interestingly, we observed the most similarities in clinical features between the COVID-19 and SARS patients. Out of 17 conditions including the comorbidities and symptoms, the two groups showed no statistical differences in 12 conditions (71%). Further, within the laboratory findings, both groups presented statistical similarities in 17 out of 19 measurements (89%). Such similarity in clinical characteristics between the SARS-CoV-2 and SARS-CoV-1, which also been supported by Petrosillo et al.5, maybe explained by their common ancestor, the bat coronavirus HKU9-113. From both the comparative analyses between the disease groups and COVID-19 severity groups, cerebrovascular disease, hepatic disease, pulmonary disease, and renal disease were showed statistical significance. Thus, when applying those factors along with the selected measurement variables, we saw that cerebrovascular disease, pulmonary disease, renal disease, and increased eosinophil count were associated with the worse prognosis of COVID-19. Studies have found that COVID-19 infection can accelerate the development of cerebrovascular disease; previously published autopsy results of COVID-19 patients showed hyperemic and edematous brain tissue with some degenerated neurons14-15. With such findings, Avula et al. suggested the possibility of hypercoagulation leading to macro and micro thrombi formation in the vessels during the COVID-19 infection16. Further, our observations of patients’ existing conditions, especially cancer, reflect the susceptibility of immunosuppressed patients. Other coexisting conditions in multiple organs including kidney and GI tract may be attributed to the particular pathway— upon its entry to human body cells, the spike protein of SARS-CoV-2 bind to the angiotensin-converting enzyme 2 (ACE2) receptors, which are also expressed in heart, lungs, kidneys, and other organs17-18. Thus, our observations of a wide range of comorbidities among the COVID-19 patients may be explained by the aforementioned mechanism.
Our research has a great significance that we effectively utilized the big data from the integrated model, the CDM, within one of the biggest national hospitals in South Korea. CDM research is conducted worldwide for the vitalization of medical research19. With the use of commonly formatted EMR records, our research successfully reflects the real-world challenges with the coronavirus. Thus, we expect the results from our study to provide clinical insights that can be used as basis for predicting the prospective clinical representations by yet another coronavirus-derived illness in near future.
Data Availability
The data that support the findings of this study are available on request from the corresponding author, Yeon Hee Kim. The data are not publicly available as they contain information that could compromise the privacy of research participants.