Machine learning based prognostic model and mobile application software platform for predicting infection susceptibility of COVID-19 using health care data

R Srivatsan; Prithviraj N Indi; Swapnil Agrahari; Siddharth Menon; S Denis Ashok

doi:10.1101/2020.10.09.20165431

Abstract

From public health perspectives of COVID-19 pandemic, accurate estimates of infection severity of individuals are extremely valuable for the informed decision making and targeted response to an emerging pandemic. This paper presents machine learning based prognostic model for providing early warning to the individuals for COVID-19 infection using the health care data set. In the present work, a prognostic model using Random Forest classifier and support vector regression is developed for predicting the Infection Susceptibility Probability (ISP) score of COVID-19 and it is applied on an open health care data set containing 27 field values. The typical fields of the health care data set include basic personal details such as age, gender, number of children in the household, marital status along with medical data like Coma score, Pulmonary score, Blood Glucose level, HDL cholesterol etc. An effective preprocessing method is carried out for handling the numerical, categorical values (non-numerical), missing data in the health care data set. The correlation between the variables in the health care data is analyzed using the correlation coefficient and heat map with a color code is used to identify the influencing factors on the Infection Susceptibility Probability (ISP) score of COVID-19. Based on the accuracy, Precision, Sensitivity and F-scores, it is noted that the random forest classifier provides an improved classification performance as compared to Support vector regression for the given health care data set. Android based mobile application software platform is developed using the proposed prognostic approach for enabling the healthy individuals to predict the susceptibility infection score of COVID-19 to take the precautionary measures. Based on the results of the proposed method, clinicians and government officials can focus on the highly susceptible people for limiting the pandemic spread

Methods In the present work, Random Forest classifier and support vector regression techniques are applied to a medical health care dataset containing 27 variables for predicting the susceptibility score of an individual towards COVID-19 infection and the accuracy of prediction is compared. An effective preprocessing is carried for handling the missing data in the health care data set. Correlation analysis using heat map is carried on the health care data for analyzing the influencing factors of Infection Susceptibility Probability (ISP) score of COVID-19. A confusion matrix is calculated for understanding the performance of classification of the based on the number of True-Positives, True-Negatives, False-Positives and False-Negatives. These values further used to calculate the accuracy, Precision, Sensitivity and F-scores.

Results From the classification results, it is noted that the Random Forest classifier provides an classification accuracy of 99.7% precision of 99.8%, sensitivity of 98.8% and F-score of 99.29% for the given medical data set.

Conclusion Proposed machine learning approach can help the individuals to take additional precautions for protecting people from the COVID-19 infection, clinicians and government officials can focus on the highly susceptible people for limiting the pandemic spread.

View this table:

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

There was no funding done for this research work

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The research involves no human subjects.

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

Abstract updated.

Data Availability

The datasets presented in this study can be found online in open source repositories. Code Availability For the reproducible code, please check out the GitHub repository at: https://github.com/srivatsanrr/autonom_covid Following open source repositories are used for implementation of our work: Keras: https://keras.io; Sklearn: https://scikit-learn.org/stable/. statsmodels: https://www.statsmodels.org/stable/index.html

https://github.com/srivatsanrr/autonom_covid

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.