PT - JOURNAL ARTICLE AU - Vijendra Ramlall AU - Benjamin May AU - Nicholas P Tatonetti TI - Using machine learning probabilities to identify effects of COVID-19 AID - 10.1101/2022.07.02.22277179 DP - 2022 Jan 01 TA - medRxiv PG - 2022.07.02.22277179 4099 - http://medrxiv.org/content/early/2022/07/05/2022.07.02.22277179.short 4100 - http://medrxiv.org/content/early/2022/07/05/2022.07.02.22277179.full AB - COVID-19, the disease caused by the SARS-CoV-2 virus, has had and continues to have extensive economic, social and public health impacts in the United States and around the world. To date, there have been more than 500 million reported cases of SARS-CoV-2 infection worldwide with more than 6 million reported deaths, more than 80 million of those cases and more than 1 million of those deaths have been reported in the United States. Retrospective analysis throughout the pandemic, which identified comorbidities, risk factors and treatments, has underpinned the response COVID-19. As the situation transitions from a pandemic to an endemic, retrospective analyses using electronic health records will be increasingly important to identify long term effects of COVID-19. However, these analyses can be complicated by the incompleteness of electronic health records, which in turns makes it difficult to differentiate visits where the patient has COVID-19. To address this, we trained a random forest classifier to assign a probability of a patient having been diagnosed with COVID-19 during each visit using demographic data, temporal data and visit-specific diagnoses (Training AUROC = 0.9867, Training OOB AUROC = 0.8957, Evaluation AUROC = 0.8958). Using these probabilities, we identified conditions associated with higher COVID-19 probabilities irrespective of clinical history and when accounting for previous diagnosis and estimated the hazards ratio for myocardial infarction (Hazards ratio = 121.736 (87.375, 169.611), p = 3.796E-177 and Hazards ratio = 80.262 (4.134, 4.637), p = 4.543E-256, respectively), urinary tract infection (Hazards ratio = 72.021 (58.116 - 89.253), p < 2.225E-308 and Hazards ratio = 61.380 (51.273 - 73.479), p < 2.225E-308, respectively), acute renal failure (Hazards ratio = 1.264E4 (9.278E4 - 1.724E4), p < 2.225E-308 and Hazards ratio = 6.333E3 (4.947E3 - 8.108E3), p < 2.225E-308, respectively) and type 2 diabetes (Hazards ratio = 345.730 (283.180 - 422.098), p < 2.225E-308 and Hazards ratio = 217.271 (187.898 - 251.235), p = 1.39E-22, respectively) when accounting for demographics and the ten most common clinical conditions.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was funded by US National Institutes of Health grant R35GM131905 to NT.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The study is approved by the Columbia University Irving Medical Center Institutional Review Board (IRB) no. AAAL0601 and the requirement for informed consent was waived. A data request associated with this protocol was submitted to the Tri-Institutional Request Assessment Committee of New York-Presbyterian/Columbia and Cornell and approved.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesData availability All supplementary tables are available from GitHub as .csv files (https://github.com/tatonetti-lab/predict-covid-effects). Code availability All scripts used for data preparation and analysis are available from GitHub as Jupyter Notebooks (https://github.com/tatonetti-lab/predict-covid-effects). https://github.com/tatonetti-lab/predict-covid-effects