Abstract
With the dramatically fast spread of COVID-9, real-time reverse transcription polymerase chain reaction (RT-PCR) test has become the gold standard method for confirmation of COVID-19 infection. However, RT-PCR tests are complicated in operation andIt usually takes 5-6 hours or even longer to get the result. Additionally, due to the low virus loads in early COVID-19 patients, RT-PCR tests display false negative results in a number of cases. Analyzing complex medical datasets based on machine learning provides health care workers excellent opportunities for developing a simple and efficient COVID-19 diagnostic system. This paper aims at extracting risk factors from clinical data of early COVID-19 infected patients and utilizing four types of traditional machine learning approaches including logistic regression(LR), support vector machine(SVM), decision tree(DT), random forest(RF) and a deep learning-based method for diagnosis of early COVID-19. The results show that the LR predictive model presents a higher specificity rate of 0.95, an area under the receiver operating curve (AUC) of 0.971 and an improved sensitivity rate of 0.82, which makes it optimal for the screening of early COVID-19 infection. We also perform the verification for generality of the best model (LR predictive model) among Zhejiang population, and analyze the contribution of the factors to the predictive models. Our manuscript describes and highlights the ability of machine learning methods for improving the accuracy and timeliness of early COVID-19 infection diagnosis. The higher AUC of our LR-base predictive model makes it a more conducive method for assisting COVID-19 diagnosis. The optimal model has been encapsulated as a mobile application (APP) and implemented in some hospitals in Zhejiang Province.
Introduction
The coronavirus disease 2019 (COVID-19) cases were first reported in Wuhan in December 2019. Soon after, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), this new emerging virus has spread rapidly in over 200 countries and areas [1,2]. On March 11, 2020, the World Health Organization (WHO) declared that COVID-19 outbreaks a global pandemic. As of March 25, 2020, COVID-19 has confirmed over 1,946,000 cases and over 126,000 deaths. COVID-19 is a novel pathogen with characteristics of fast transmission and strong infectivity [3,4]. The early symptoms of COVID-19 are similar to other respiratory infectious diseases, which makes it difficult for early differential diagnosis [5-7]. So far, accurate RT-PCR test has been regarded as the gold standard for the diagnosis of COVID-19. However, RT-PCR tests are complicated in operation and it usually takes 5-6 hours or even longer to get the results [8]. Additionally, due to the low virus loads in early infected COVID-19 patients, RT-PCR tests show false negative results in a number of cases [9,10]. It has greatly hindered the prevention and control of the global pandemic. Thus, it is dramatically essential to establish a rapid diagnostic model to screen high-risk patients with COVID-19 infection.
In recent years, machine learning solutions are widely used to predict diagnosis and individual risk factors for diseases, and support clinical decisions [11]. Numerous researchers have adopted different methods in an attempt to improve the precision of data classification in medical field, and a method with superior classification precision would provide better robustness for predicting unknown data [12-14]. Some machine learning methods have achieved remarkable results in medical filed [15,16]. Jagpreet Chhatwal et al. [17] utilized logistic regression to create a breast cancer risk estimation model based on the descriptors of National Mammography Database (NMD) format that can aid in decision-making for early detection of breast cancer. Maggipinto et al. [18] used random forest method to identify the patients who suffer from Alzheimer’s disease based on ADNI datasets, which shows better accuracy and can be used as clinical assistant diagnosis. Recently, the combination of machine learning approaches and epidemic infectious diseases has been emerged extensively. Soyoung Hong et al. [19] used SVM with double class analysis for MERS-COV epidemiological study and discovered the relevance between two sequences of MERS-COV. Wang Jia et al. [20] constructed predictive model with higher accuracy for antigen mutation of influenza virus subtype H1, which used CART decision tree algorithm combined with Amino acid variation sites of viral proteins. The combination of machine learning and medical data has become the main development direction to meet the needs of early diagnosis and prognosis assessment.
In this study, we attempted to identify the best appropriative algorithm for early COVID-19 detection based on clinical big data. We analyzed clinical data of 912 patients who were confirmed as COVID-19 or other respiratory infectious diseases from 18 hospitals in Zhejiang Province, focusing on extraction of risk factors and construction of five types of classification models: SVM, LR, DT, RF as well as deep neural network (DNN). Four epidemiological factors and six clinical manifestations were selected by feature engineering approach as diagnostic models input, and they were much fewer than candidate features of medical records. Essentially, the diagnostic model constructed with fewer meaningful clinical factors is practical for outpatient service. Clinical symptoms, laboratory tests and imaging findings play significant roles in identification of COVID-19 infection [21]. To Evaluate the contributions of clinical symptoms, laboratory tests and imaging information for diagnostic models, we established predictive models based on the data excluding epidemiological information. It was found that the diagnostic models established with clinical symptoms, laboratory tests and imaging information only presented poorer performance. In other words, epidemiological information tremendously affects the performance of COVID-19 predictive models. Briefly, making full use of clinical clinical manifestations and epidemiological characteristics integratedly is essential for constructing the early diagnosis model of COVID-19.
Materials and methods
Data construction
The COVID-19 dataset contains clinical information of 914 patients who were confirmed as COVID-19 or other respiratory infectious diseases from 18 hospitals in Zhejiang between Jan 17 and Feb 19, 2020. Considering about the completeness of the clinical information, we firstly screened out the patients with complete clinical records, which results in total number of 912 eligible patients. We then split processed patients into training(80%) and validation(20%) partitions randomly to train our models. Subsequently, we collected 115 clinical dataset from other hospitals in Zhejiang as test partition to verify the universality of implemented models in Zhejiang population.
To obtain the datasets for early stage COVID-19 rapid diagnostic models, all selected patients were categorized into positive or negative cases. The patients who met any one of the following criterias were considered to be positive cases.
Positive RT-PCR test results in throat swab, sputum, blood samples,
The genetic sequences detected in the samples are highly homologous to the known SARS-CoV-2.
Positive cases are considered to be the patients confirmed as COVID-19 infection by RT-PCR. Conversely, negative cases are patients excluded as COVID-19 infection by RT-PCR for at least two times. The 912 eligible participants enrolled in this study, include 361 COVID-19 infected patients (positive cases) and 551 COVID-19 non-infected patients (negative cases). Each patient’s clinical record contains 31 factors including gender, age, coexisting diseases, epidemiological informations, laboratory tests, clinical symptoms and imaging findings. Details of these 31 factors and their distribution characteristics on training and validation dataset are shown in table 1.
Feature Selection
Feature selection is used to select effective factors from numerous features to reduce the feature space dimension and classification error rate. We leveraged embedded feature engineering approach based on logistic regression algorithm to select COVID-19 risk factors from the 31 factors mentioned above. Finally, 10 factors were chosen for the early COVID-19 prediction task by setting the threshold as 0.85. The final selected factors include four epidemiological features(relationship with a cluster outbreak, travel or residence history over the past 14 days in Wuhan, exposure to patients with fever or respiratory symptoms over the past 14 days who had a travel or residence history in Wuhan, exposure to patients with fever or respiratory symptoms over the past 14 days who had a travel or residence history in other areas with persistent local transmission, or community with definite cases) and six clinical manifestations (muscle soreness, dyspnea, fatigue, lymphocyte count (×109/L), white blood cell count (×109/L) and imaging changes of Chest X-ray or CT). In practice, the diagnostic model constructed with fewer incoherent factors is beneficial and practical for outpatient service. Details of selected risk factors and their related coefficients are shown in table 2. The importance of the factors relies on absolute value of the coefficients. Table 2 suggests that imaging changes of Chest X-ray or CT is more vital than others. Table 2 also shows that the tolerance of these 10 factors is more than 0.1 and variance inflation factor of them is less than 10, which indicated that there was no collinearity among selected factors.
Methodology
Machine learning models
In this study, we conduct four conventional types of machine learning algorithms and a deep learning solution to establish the early stage COVID-19 rapid diagnostic models. We implement LR model with L2 regularization penalty, and train other three models including SVM with kernel of rbf, ID3 DT and FR. The FR model is constructed by 50 decision trees with information gain algorithm. This study used deep learning-base method, namely DNN, which is a four-layer network with the hidden dimension of 64,32 16 and 20 respectively. A Softmax layer is added at the top of the network to output the probability of a patient infected with COVID-19.
We evaluat the performance of the early stage COVID-19 diagnostic models at the 20% validation using familiar assessment strategies, which include measuring accuracy and the AUC generated by plotting sensitivity vs 1 – specificity. Classification accuracy is obtained via an optimum cut-off point. AUC measures the overall performance of the recall concerning different false positive rate, which exhibits robustness for performance assessment of predictive models [22]. Models with higher AUC will show more powerful identified and diagnostic capacities to assist health care workers. High-sensitivity (or recall rate of positive cases) and high-specificity (or recall rate of negative cases) play a vital role in screening the infectious patients [23]. Essentially, a model with high sensitivity can correctly identify patients infected with COVID-19 for timely treatment, while a model with high specificity can excellently screen non-infective patients, thereby effectively avoid cross infection.
Results
The experiments we conduct to evaluate the performance of the five types of predictive models are illustrated in this section. We evaluate the predictive models on validation set and compare the results of validation to obtain the best solution for identifying early COVID-19 infection. Ultimately, we test the best model based on test dataset to obtain general diagnostic model for Zhejiang population
We implement multiple model structures as our constructed models and deploy different combinations of feature inputs. Table 4 summarizes the performance of conventional solutions and deep learning-based methods. Table 3 part (a) reveals the performances of predictive models constructed based on the raw dataset including 31 factors(table 1), and part (b) exhibits the performances of models established with ten factors selected by using feature selection approach. Feature selection is intended for data dimensionality reduction [24]. In practice, the diagnostic models constructed with less meaningful clinical factors are more practical for outpatient services. The results inTable 3 demonstrate that the predictive models of part (b) perform slightly better than that of part (a) in terms of AUC. The sensitivity, specificity, as well as accuracy of these predictive models of part (b) are relatively approximate to those of part(a). Thus, feature selection partly improves the performances of COVID-19 diagnostic models, and the ROC curve of some selected high performing machine learning models are shown in Fig 1. Table 4 part (b) shows that LR combined with feature selection outperforms other four methods by reaching an AUC of 0.971, high-specificity of 0.95 and accuracy of 0.90 respectively. These results suggest that the combination of LR and feature selection approach presents the best AUC and specificity among five categories of classification methods. Higher specificity of model will facilitate the elimination of infected diseases such as COVID-19 infection. In addition, according to the clinical experience of experts, the AUC (0.86) calculated by the diagnostic scale is compared, and the LR diagnostic model shows better performance. Therefore, LR can be selected as the optimum classification model for the early-stage COVID-19 rapid screening.
Under the background of COVID-19 pandemic, clinical symptoms, laboratory tests and imaging findings are vital clinical criterion for the diagnosis of COVID-19 infection. In order to verify the contribution of above-mentioned three indicators to the COVID-19 diagnostic models, we establish predictive models based on the dataset excluding epidemiological information. The performances of various predictive models are shown in table 3 (c) and (d). Results in part (a) and part (b) illustrates that epidemiological information is beneficial for early COVID-19 rapid diagnostic models construction. In the absence of epidemiological information, the sensitivity, specificity and accuracy of the predictive models (part (c)) exhibits sharp reduction compared with the models shown in part (a). In addition, the five types of machine learning approaches combining with feature selection is constructed based on the dataset excluding epidemiological information, as is shown in Table 4 part (d). Compared with part (c), AUC of part (d) is slightly improved. While due to the absence of epidemiological information, part (c) and part (d) show poorer performances compared with part (a) and part (b). In brief, it indicates laterally that epidemiological information is essential for constructing the early COVID-19 diagnostic models in Zhejiang population.
The above results clearly illustrates that the combination of traditional logistic regression method and feature selection has a great probability to predict early COVID-19 infection. And construction of highly precious diagnostic model relies on integrating and taking the most advantages of clinical symptoms, laboratory tests, imaging findings as well as epidemiological information.
Moreover, LR algorithm is proved as the most ideal method among the five classification solutions for the early COVID-19 rapid screening. The experiments performed in this study used test dataset for verifying generality of the optimum diagnosis model. As is shown in Table 5, the sensitivity, specificity, accuracy and AUC of the LR+ FS model on test dataset are 0.87, 0.95, 0.91 and 0.95, respectively. These results show that the predictive model constructed by combination of logistic regression and feature selection as early COVID-19 rapid diagnostic tool is universally applicable in Zhejiang Province.
Discussion
Under the background of COVID-19 pandemic, the early prevention and control of COVID-19 still face severe challenges. According to the reports, the most common early symptoms of COVID-19 are fever, cough, fatigue, and myalgia, followed by diarrhea, nausea, headache and sore throat [25,26]. As the disease goes on, some infected patients, especially those with low immune functions, gradually become dyspnea [21,27]. Additionally, complications such as acute arrhythmia and shock, respiratory distress syndrome (ARDS), are probably related to a poor prognosis [28,29]. Thus, early prediction of suspected patients and early aggressive treatment of confirmed patients are the key to reduce cross infection and mortality. CT scan has become the main auxiliary tool for screening of COVID-19 cases. However, CT scan can not be used to identify specific viral infections [30]. Moreover, some COVID-19 patients can also present with normal pulmonary imaging in early stage [31]. Clinical symptoms and laboratory tests are sometimes non-specific for early COVID-19 infection [21,32]. At present, RT-PCR is still the accepted detection method for the diagnosis of COVID-19 infection. While the time consuming and instability of test results are still the most struggling problems [8]. Therefore, to improve the timeliness for the early COVID-19 infection diagnosis, it is essential to develop a decision-making tool to assist early diagnosis of COVID-19 patients in fever clinics.
Current studies which analyze symptoms and laboratory examination results of COVID-19 patients mainly focus on predicting mortality risk and progression of the disease [33]. Only few studies aims at COVID-19 early diagnosis. At present, Zirui Meng et al. [34] selected nine representative variables(including age, Activated Partial Thromboplastin Time, Red Blood Cell Distribution Width-SD, Uric Acid, Triglyceride, Serum Potassium, Albumin/globulin, 3-Hydroxybutyrate, Serum Calcium) and constructed an optimized diagnostic model through Lasso regression screening and Multivariate logistic regression based on 431 samples. The AUC of their early COVID-19 screening model in the testing set and independent validation cohort were 0.890 and 0.872. Cong Feng et al. [35] used logistic regression with Lasso regression for features selection and screening model development based on clinical data of 132 recruited patients. The final chosen features include 1 demographic variable (age); 4 variables of vital signs (e.g., Temperature (TEM), Heart rate (HR), etc.); 5 variables of blood routine values (e.g., Platelet count (PLT), Monocyte ratio (MONO%), Eosinophil count (EO#), etc.); 7 variables of clinical signs and symptoms (e.g., Fever, Fever classification, Shiver, etc.); and 1 infection-related biomarker (Interleukin-6 (IL-6)). The performance of their model constructed based on the final selected features in held-out testing set and validation cohort resulted in AUCs of 0.841 and 0.938, and specificity of 0.727 and 0.778. In our study, we selected four epidemiological features and six clinical manifestations from the raw dataset including 31 factors, further developed multiple models with various machine learning algorithms and screened an optimum early COVID-19 diagnostic model with an AUC of 0.971. We tested the best model based on LR on the external test data set, and its AUC and specificity were 0.950 and 0.95, respectively. Compared to previous studies, we screened out fewer risk factors based on a larger clinical data set, and the early COVID-19 diagnostic model we established has better performance and is more suitable for clinical assisted diagnosis. Moreover, our study is based on a large clinical data set, including a total of 912 patients who were confirmed to have early COVID-19 infection or other respiratory infectious diseases, which may contribute to mining more potential clinical information and improve generalization ability of diagnostic models. Considering the indisputable role epidemiological features play in the diagnosis of infectious diseases in clinic [36], we specifically studied the role of epidemiological information in diagnostic models. We found that the lack of epidemiological information greatly affected the accuracy, specificity and sensitivity of the model. It means that epidemiological information is vital for building an accurate COVID-19 diagnostic tool, and makes the utility and reliability of the previously reported diagnostic models questioned.
Nevertheless, this study still has several limitations. First of all, the recruited participants are limited to Zhejiang Province, which causes certain regional restrictions in the application of the predictive models. Further extremely concerning about the epidemiological characteristics and nationwide studies are needed to access the generality of the suggested model. Secondly, there is a lack of information on the progression and prognosis of COVID-19 as well as asymptomatic infection cases. Finally, more information of infections should be recruited to improve the accurate of screening model.
Conclusion
In our study, ten representative factors with significant identification value were selected and constructed diagnostic models. The model established an algorithm based on logistic regression can be used as a simple, fast, and effective tool for diagnosing the early COVID-19 infection with significant clinical value.
Data Availability
all data are fully available without restriction
Supporting information
Acknowledgments
We thank all the persons who has been fighting in this outbreak. This work was supported by the grant from the emergency project of key research and development plan in Zhejiang Province (2020C03123-2) and National Science and Technology Major Project (2017ZX10204401).