PT - JOURNAL ARTICLE AU - Wei Tse Li AU - Jiayan Ma AU - Neil Shende AU - Grant Castaneda AU - Jaideep Chakladar AU - Joseph C. Tsai AU - Lauren Apostol AU - Christine O. Honda AU - Jingyue Xu AU - Lindsay M. Wong AU - Tianyi Zhang AU - Abby Lee AU - Aditi Gnanasekar AU - Thomas K. Honda AU - Selena Z. Kuo AU - Michael Andrew Yu AU - Eric Y. Chang AU - Mahadevan “Raj” Rajasekaran AU - Weg M. Ongkeko TI - Using Machine Learning of Clinical Data to Diagnose COVID-19 AID - 10.1101/2020.06.24.20138859 DP - 2020 Jan 01 TA - medRxiv PG - 2020.06.24.20138859 4099 - http://medrxiv.org/content/early/2020/06/24/2020.06.24.20138859.short 4100 - http://medrxiv.org/content/early/2020/06/24/2020.06.24.20138859.full AB - The recent pandemic of Coronavirus Disease 2019 (COVID-19) has placed severe stress on healthcare systems worldwide, which is amplified by the critical shortage of COVID-19 tests. In this study, we propose to generate a more accurate diagnosis model of COVID-19 based on patient symptoms and routine test results by applying machine learning to reanalyzing COVID-19 data from 151 published studies. We aimed to investigate correlations between clinical variables, cluster COVID-19 patients into subtypes, and generate a computational classification model for discriminating between COVID −19 patients and influenza patients based on clinical variables alone. We discovered several novel associations between clinical variables, including correlations between being male and having higher levels of serum lymphocytes and neutrophils. We found that COVID-19 patients could be clustered into subtypes based on serum levels of immune cells, gender, and reported symptoms. Finally, we trained an XGBoost model to achieve a sensitivity of 92.5% and a specificity of 97.9% in discriminating COVID-19 patients from influenza patients. We demonstrated that computational methods trained on large clinical datasets could yield ever more accurate COVID-19 diagnostic models to mitigate the impact of lack of testing. We also presented previously unknown COVID-19 clinical variable correlations and clinical subgroups.Competing Interest StatementThe authors have declared no competing interest.Funding StatementUniversity of California, Office of the President/Tobacco-Related Disease Research Program Emergency COVID-19 Research Seed Funding Grant (R00RG2369) to W.M.O.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:N/AAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe datasets during and/or analysed during the current study available from the corresponding author on reasonable request.CRPC-reactive ProteinANOVAAnalysis of VatrianceSOMSelf-organizing mapXGBoostExtreme Gradient BoostingROCReceiver Operating CharacteristicAUCArea Under the CurvePRPrecision-Recall