The Role of Machine Learning Techniques to Tackle COVID-19 Crisis: A Systematic Review.

Background: The novel coronavirus responsible for COVID-19 has caused havoc with patients presenting a spectrum of complications forcing the healthcare experts around the globe to explore new technological solutions, and treatment plans. Machine learning (ML) based technologies have played a substantial role in solving complex problems, and several organizations have been swift to adopt and customize them in response to the challenges posed by the COVID-19 pandemic. Objective: The objective of this study is to conduct a systematic literature review on the role of ML as a comprehensive and decisive technology to fight the COVID-19 crisis in the arena of epidemiology, diagnosis, and disease progression. Methods: A systematic search in PubMed, Web of Science, and CINAHL databases was performed according to the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) guidelines to identify all potentially relevant studies published and made available between December 1, 2019, and June 27, 2020. The search syntax was built using keywords specific to COVID-19 and ML. A total of 128 qualified articles were reviewed and analyzed based on the study objectives. Results: The 128 publications selected were classified into three themes based on ML applications employed to combat the COVID-19 crisis: Computational Epidemiology (CE), Early Detection and Diagnosis (EDD), and Disease Progression (DP). Of the 128 studies, 70 focused on predicting the outbreak, the impact of containment policies, and potential drug discoveries, which were grouped into the CE theme. For the EDD, we grouped forty studies that applied ML techniques to detect the presence of COVID-19 using the patient's radiological images or lab results. Eighteen publications that focused on predicting the disease progression, outcomes (recovery and mortality), Length of Stay (LOS), and number of Intensive Care Unit (ICU) days for COVID-19 positive patients were classified under the DP theme. Conclusions: In this systematic review, we assembled the current COVID-19 literature that utilized ML methods to provide insights into the COVID-19 themes, highlighting the important variables, data types, and available COVID-19 resources that can assist in facilitating clinical and translational research.

COVID-19 is a worldwide health crisis, more than 16 million people are infected and caused over 666,000 deaths (up to 29 July 2020) across the globe [1]. The resulting impact on healthcare systems is that many countries have overstretched their resources to mitigate spread of the pandemic [2]. There is an urgent need for effective drugs and vaccines to treat and prevent the infection. Due to the lack of validated therapeutics, most of the containment measures to curtail the spread rely on social distancing, quarantine, and lockdown policies [2][3][4]. The transmission has been slowed but not eliminated; with ease of restriction, there is a fear of the second wave of infection [5,6]. To restrict the second potential outbreak, advanced containment measures such as contact tracing, identifying hotspots, etc., are now needed [7,8].
There is a high degree of variance in the COVID-19 symptoms ranging from mild flu to acute respiratory distress syndrome or fulminant pneumonia [9][10][11]. Machine Learning (ML) techniques have been employed on different scales ranging from prediction of disease spread trajectory to diagnostic and prognostic models development [12,13]. A wide range of data types including social media, radiological images, omics data, drug databases, and data collected from public health agencies, etc. have been used for the prediction [1,[14][15][16][17][18]. Several studies focused on reviewing published articles and papers that apply Artificial Intelligence (AI) to fight and support the coronavirus response [12,13,19,20]. One among them is a study by Wynants et al [13] that focused on critical appraisal of models that aimed to predict the risk of developing the disease, hospital admissions, and progression. However, a majority of epidemiological studies that aimed to model disease transmission or fatality rate, etc., were excluded.
The primary aim of this study is to conduct a systematic literature review on the role of ML as a technology to combat the COVID-19 crisis and to assess its application in the epidemiological, clinical, and molecular advancements. Specifically, we summarized the area of application, data types used, types of AI and ML methods employed and their performance, scientific findings, and challenges experienced in adopting this technology.
This systematic literature review followed the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework for preparation and reporting [21].
This study focused on peer-reviewed publications, as well as, preprints that applied ML techniques to analyze and address COVID-19 crisis on different scales including diagnostics, prognostics, disease spread forecast, omics, and drug development. with the guidance of a professional librarian and included the following search terms: "CORONAVIRUS", "COVID-19", "covid19", "cov-19", "cov19", "severe acute respiratory syndrome coronavirus 2", "Wuhan coronavirus", "Wuhan seafood market pneumonia virus", "coronavirus disease 2019 virus", "SARS-CoV-2", "SARS2", "SARS-2", "2019-nCoV", "2019 novel coronavirus", "novel corona", "Machine Learning", "Artificial Intelligence", "Deep . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 25, 2020. . https://doi.org/10.1101/2020.08. 23.20180158 doi: medRxiv preprint Learning", and "Neural Network". Refer to Multimedia Appendix 1 for search query syntax.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 25, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 25, 2020. . https://doi.org/10.1101/2020.08.23.20180158 doi: medRxiv preprint CE. Forty studies that applied ML techniques to detect COVID-19 using patients' radiological images or lab results were grouped into EDD. We identified 18 publications that focused on predicting disease progression, outcomes (recovery and mortality), Length of Stay (LOS), and the number of Intensive Care Unit (ICU) days for COVID-19 positive patients, which are grouped under DP theme. Overtime trend of COVID-19 articles by month and themes is shown in Figure 2, which depicts an initial surge of publications focusing on theme CE and then followed by EDD. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 25, 2020.    Table 2. Forty studies that focused on predicting COVID-19 peaks and sizes globally, and specific to a geographical location, estimating the impact of socioeconomic factors, and environmental conditions on the spread of  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the disease, and effectiveness of social distancing policies in containing the spread were grouped into the CDT. Twenty-two studies were grouped under the MADD based on the study approach in identifying existing drugs that have the potential to treat COVID-19, protein structure analysis, and predicting mutation rate in COVID-19 positive patients. The FCR includes 8 studies that emphasize on building tools to combat the ongoing pandemic such as building COVID-19 imaging repository, AI-enabled automatic cleaning and sanitizing tasks at healthcare facilities that might help clinical practitioners to provide timely services to the affected population. A majority of the studies in the CE theme used data either from social media (8 studies used data from Twitter, Weibo, or Facebook) or public data repositories such as NCBI, Drug Bank databases, and other health agencies data. Refer to Multimedia Appendix 2 for individual study details.  Publications focused on identifying existing drugs that have the potential to treat COVID-19, analysis of protein structure, and predicting mutation rate in the COVID-19 positive patients.

Facilitate COVID-19
Publications focused on building tools to combat the ongoing pandemic such as building COVID-19 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. .

Response (FCR)
imaging repository, AI-enabled automatic cleaning and sanitizing tasks at healthcare facilities that might help clinical practitioners to provide timely services to the affected population.
We identified forty studies that primarily focused on diagnosing COVID-19 in patients with suspected infection mostly using chest radiological images such as Computed Tomography (CT), X-Radiation (X-Ray), and Lung Ultrasound (LUS). As shown in Table 3   is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . https://doi.org/10.1101/2020.08.23.20180158 doi: medRxiv preprint ML methods and identified 3 themes: models developed to address issues central to Epidemiology; models that aid the diagnosis of patients with COVID-19; and models that help prognosis of COVID-19 patients.
On average, the conventional drug discovery process takes 10 to 15 years with very low success rates [146]. Instead, drug re-purposing attempts have been made to explore similarities between the COVID-19 and other viruses such as Severe Acute Respiratory Syndrome (SARS) and Acquired Immunodeficiency Syndrome (AIDS) [147]. With the rapid accumulation of genetic and other biomedical data in recent years, ML techniques facilitate the analysis of already available drugs and chemical compounds to find new therapeutic indications [148].
The main protease (Mpro) of COVID-19 is a key enzyme in polyprotein processing, it plays an important role in mediating viral replication and transcription [149]. Several studies have applied ML techniques to identify drug leads that target the Mpro of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), making it an attractive drug target [150,151]. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. 3C-like proteinase. Computational drug repositioning ML models provide a fast, and costeffective way to identify promising repositioning opportunities, and expedited approval procedures [148,152].
Genome sequencing of various viruses is done to identify regions of similarity that may have consequences for functional, structural, or evolutionary relationships [153]. Due to heavy computational requirements of traditional alignment-based methods, alignment-free genome comparison methods are gaining popularity [153,154] . A case study by Randhawa et al [83] proposed a ML based alignment-free approach for an ultra-fast, inexpensive, and taxonomic classification of whole COVID-19 virus genomes and can be used for classification of COVID-19 pathogens in real time.
COVID-19 lockdown and home-confinement restrictions is having adverse impact on the mental well-being of the general population and specific to high-risk groups including health care workers, children, and older adults [155]. Several studies were done to understand and respond to these public health emergencies. Li et al [62] conducted a study using ML (Support Vector Machine) model and sentiment analysis to explore the impacts of COVID-19 on people's mental health and to assist policymakers in developing actionable policies that could aid clinical . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . There is a growing trend of collaborations among researchers and scientists around the world to combat the COVID-19 pandemic [159]. The emergence of COVID-19 encouraged public health agencies and the scientific community including journals and publishers to promote and ensure that research findings and data relevant to this outbreak are shared rapidly and openly [13,160] . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . https://doi.org/10.1101/2020.08.23.20180158 doi: medRxiv preprint progress [161]. The N3C will support collection and analysis of clinical, laboratory and diagnostic data from hospitals and health care plans. N3C along with the imaging repository such as the COVID-19-CT-CXR will accelerate clinical and translational research.
During the initial days of the COVID-19 spread, most research was focused on building mathematical models for estimating the transmission dynamics and prediction of COVID-19 developments [162,163]. Specifically, Susceptible-Exposed-Infectious-Recovered (SEIR) and Auto-Regressive Integrated Moving Average (ARIMA) models and their extensions were widely adopted for projection of COVID-19 cases [164]. These models provided healthcare and government officials with optimal intervention strategies and control measures to combat the pandemic [164]. Yang et al [58] and Moftakhar et al [43] used ML models to fit statistical models SEIR and ARIMA respectively. Long-Short Term Memory (LSTM) model by Yang et al [58] and Artificial Neural Network (ANN) model by Moftakhar et al [43] had a good fit to the aforementioned mathematical models respectively. However, projections of both SEIR and ARIMA mathematical models had deviations less than ± 15% range of the reported data [164].
Future studies should try to fit ML techniques on both the SEIR and ARIMA models to reduce the projection error rate and to be prepared for second wave of COVID-19.
In early days of the pandemic, majority of the studies in our review predicted potential hotspots and outbreak trends using COVID-19 data from China expecting similar epidemic growth.
However, the projections were off target due to varying containment policies enforced by different countries [164,165]. Study by Yang et al [58] used ML technique to predict the COVID-19 epidemic peaks and sizes with respect to the containment polices. The study revealed that the continual enforcement of quarantine restrictions, early detection and subsequent isolation . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . https://doi.org/10.1101/2020.08.23.20180158 doi: medRxiv preprint was most effective in the containment of the spread. Relaxing these policies would increase the spread by three-fold for a five-day delay in implementation and could cause a second peak. Such policies should be strictly enforced to prevent a second coronavirus outbreak. Many countries ramped up the production of real-time reverse transcription polymerase chain reaction (RT-PCR) testing kits to diagnose COVID-19, and as of today, it remains the gold standard for confirmation [166]. However, the lab test suffers from low sensitivity as reported by several studies [166,167]. Radiological images (CT and X-Ray) have been used by clinicians to confirm corona positive cases and it serves as an important complement to the RT-PCR test [168]. Several studies have reported that the use of chest CT for early-stage detection has proven to have a low rate of misdiagnosis and provide accurate results even in some asymptomatic cases [169].
Researchers started applying ML techniques on radiological scans to distinguish between the Neural Network (CNN) models and then automated the detection of COVID-19 using chest X-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020.  [115] using transfer learning and CNN (Xception) architecture with 71 layers that were trained on the ImageNet dataset. Their model (CoroNet) achieved an average accuracy of 87% to detect COVID-19 on a dataset of 284 COVID-19, 657 pneumonia (both viral and bacterial), and 310 normal chest X-Ray images. While chest X-Rays are cost-effective and have a much lower dose when compared to chest CTs, they are less sensitive, especially in the early stages of the infection, and also in respect of mild cases [172]. We recommend newer studies to develop ML models that can detect COVID-19 from the combination of CT and X-Ray images to aid clinical practitioners.
LUS have been proven useful during the 2009 influenza epidemic (H1N1) by accurately differentiating viral and bacterial pneumonia, and were found to have higher sensitivity to detect avian influenza (H7N9) when compared to chest X-Ray [173]. Though clinicians recommend the use of LUS imaging in the emergency room for diagnosis and management of COVID-19, its role is still unclear [174]. In our review, we identified one study by Roy et al [126] who used a deep learning model on annotated LUS COVID-19 dataset to predict disease severity. The results of the study were reported to be "satisfactory".
In general, DL techniques are employed to improve prediction accuracy by training on large volumes of data [175]. In our review, several studies applied ML techniques, either using smaller imaging datasets specific to the organization, or mid-to-large dataset from publicly available repositories. However, there is a huge amount of cost associated with developing and maintaining such repositories [176]. To overcome the data size and cost limitations, Xu et al [108] proposed a decentralized AI architecture to build a generalizable ML model that is . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . https://doi.org/10.1101/2020.08.23.20180158 doi: medRxiv preprint distributed and trained on in-house client datasets, eliminating the need for sharing sensitive clinical data. The proposed framework is in the early phase of adoption and needs technical improvements before it is widely employed by participating healthcare organizations.
An alternative to RT-PCR test, study by Joshi et al [113] proposed ML approach that utilize only Complete Blood Count (CBC) and gender to predict COVID-19 positivity. The authors build a logistic regression model on retrospective data collected from a single institute and validated on multi-institute data. Prediction of coronavirus positivity was reported to be C-statistic 78% and sensitivity 93%. The goal of the study was to develop a decision support tool that integrates readily available lab results from EHRs. The novel coronavirus (COVID-19) pandemic has strained global healthcare systems, especially ICUs, due to hospitalized patients having higher ICU transfer rates [133]. Identification of hospitalized patients at high-risk in advance may help healthcare providers to plan and prepare for ICU resources (beds, ventilators, and staff, etc.) [177]. A study by Cheng et al [133] developed an ML-based model to predict ICU transfers within 24 hours of hospital admission.
The Random Forest model was used for prediction and was based on variables: vital signs, nursing assessment, lab results, and electrocardiograms collected during the hospitalization. The overall AUROC of the model was reported to be 79.9%. Similar work was done by Shashikumar et al [139] to predict the need for ventilation in hospitalized patients 24 hours in advance. The prediction was not only limited to COVID-19 patients but also for generally hospitalized patients. The authors used 40 clinical variables: 6 demographic and 34 dynamic variables (including lab results, vital signs, Sequential Organ Failure Assessment (SOFA), comorbidity, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . and length of stay). In contrast to the traditional ML model used by Cheng et al [133], Shashikumar et al [139] resorted to a DL model (VentNet) for prediction with an Area Under the Curve (AUC) of 0.882 for the general ICU population and 0.918 for patients with COVID-19.
Both the aforementioned studies relied on clinical variables for prediction whereas a study by Burian et al [130] combined clinical and imaging parameters for estimating the need for ICU treatment. The major finding of the study was, patients that needed ICU transfers had significantly increased Interleukin-6 (IL-6), C -reactive protein (CRP) and leukocyte counts and significantly decreased lymphocyte counts. All studies in this category applied ML techniques to facilitate the efficient use of clinical resources and help hospitals plan their flow of operations to fight the ongoing pandemic.
Developing a risk stratification mechanism among COVID-19 patients helps to facilitate timely assessments, allocate hospital resources, and appropriate decision making [178]. Jiang et al [136] model used demographics, vital signs, comorbidities, and lab results to predict patients that are likely to develop Acute Respiratory Distress Syndrome (ARDS). Of these variables, lab results alanine aminotransferase (ALT), the presence of myalgias, and elevated hemoglobin were the most predictive features. The overall accuracy of predicting ARDS was 80%. Moreover, using ALT alone, the model achieved an accuracy of 70%. However, the conclusion was drawn based on a limited patient (n=53) set.
Yadaw et al [144] evaluated different ML models to classify COVID-19 patients into deceased or alive classes. The classification was based on five features: age, minimum oxygen saturation during the encounter, type of patient encounter, hydroxychloroquine use, and maximum body temperature. The study revealed age and minimum oxygen saturation during encounters were the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020.  [137] built a pulmonary disease severity score using X-Rays and neural network models. The score was computed as the Euclidean distance between the patient's image and a pool of normal images using the Siamese neural network. The score predicted (AUROC 0.80) subsequent intubation or death within three days of hospital admission for patients that were initially not intubated.
This review has some inherent limitations. First, there is a possibility of studies missed due to the search methodology. Second, we removed five publications where full text was not available, and this may have introduced bias. Third, we included studies that were available as preprints.
Finally, a comparison of ML model performance was not possible in the quantitative descriptive analysis as variables, sample size, and source of data was diverse across the studies. The current systematic review includes studies that were available online as of June 27 th , as the pandemic progresses, we intend to write a second review on the studies published after the aforementioned date. themes: epidemiology, early detection, and disease progression highlighting the important . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 25, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 25, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 25, 2020.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 25, 2020.