PT  - JOURNAL ARTICLE
AU  - Wei Tse Li
AU  - Jiayan Ma
AU  - Neil Shende
AU  - Grant Castaneda
AU  - Jaideep Chakladar
AU  - Joseph C. Tsai
AU  - Lauren Apostol
AU  - Christine O. Honda
AU  - Jingyue Xu
AU  - Lindsay M. Wong
AU  - Tianyi Zhang
AU  - Abby Lee
AU  - Aditi Gnanasekar
AU  - Thomas K. Honda
AU  - Selena Z. Kuo
AU  - Michael Andrew Yu
AU  - Eric Y. Chang
AU  - Mahadevan “Raj” Rajasekaran
AU  - Weg M. Ongkeko
TI  - Using Machine Learning of Clinical Data to Diagnose COVID-19
AID  - 10.1101/2020.06.24.20138859
DP  - 2020 Jan 01
TA  - medRxiv
PG  - 2020.06.24.20138859
4099  - http://medrxiv.org/content/early/2020/06/24/2020.06.24.20138859.short
4100  - http://medrxiv.org/content/early/2020/06/24/2020.06.24.20138859.full
AB  - The recent pandemic of Coronavirus Disease 2019 (COVID-19) has placed severe stress on healthcare systems worldwide, which is amplified by the critical shortage of COVID-19 tests. In this study, we propose to generate a more accurate diagnosis model of COVID-19 based on patient symptoms and routine test results by applying machine learning to reanalyzing COVID-19 data from 151 published studies. We aimed to investigate correlations between clinical variables, cluster COVID-19 patients into subtypes, and generate a computational classification model for discriminating between COVID −19 patients and influenza patients based on clinical variables alone. We discovered several novel associations between clinical variables, including correlations between being male and having higher levels of serum lymphocytes and neutrophils. We found that COVID-19 patients could be clustered into subtypes based on serum levels of immune cells, gender, and reported symptoms. Finally, we trained an XGBoost model to achieve a sensitivity of 92.5% and a specificity of 97.9% in discriminating COVID-19 patients from influenza patients. We demonstrated that computational methods trained on large clinical datasets could yield ever more accurate COVID-19 diagnostic models to mitigate the impact of lack of testing. We also presented previously unknown COVID-19 clinical variable correlations and clinical subgroups.Competing Interest StatementThe authors have declared no competing interest.Funding StatementUniversity of California, Office of the President/Tobacco-Related Disease Research Program Emergency COVID-19 Research Seed Funding Grant (R00RG2369) to W.M.O.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:N/AAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe datasets during and/or analysed during the current study available from the corresponding author on reasonable request.CRPC-reactive ProteinANOVAAnalysis of VatrianceSOMSelf-organizing mapXGBoostExtreme Gradient BoostingROCReceiver Operating CharacteristicAUCArea Under the CurvePRPrecision-Recall