ABSTRACT
Introduction COVID-19 can rapidly lead to severe respiratory problems and can result in an overwhelming burden on healthcare systems worldwide, making it imperative to identify high-risk patients and predict survival and need for intensive care (ICU). Most of the proposed modes are not well reported making them less reproducible and prone to high risk of bias.
Methods In this study, the performances of seven classical machine (Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), k-Nearest Neighbor (KNN), XGBoost, Linear Discriminant Analysis (LDA) and Gaussian Naïve Bayes (NB)) and two deep leaning models (Deep Neural Network (DNN) and Long Short-Term Memory (LSTM)) in combination with two widely used feature selection methods (random forest and extra tree classifier) were investigated to predict “last status” representing mortality, “ICU requirement”, and “ventilation days”. Fivefold cross-validation was used for training and validation purposes. In each fold, 80% data were used for training the models and the rest 20% were preserved for validation. To minimize bias, the training and testing sets were split maintaining similar distributions. Before splitting, k-nearest neighbour (KNN) imputation algorithm was employed to resolve the issue of missing data. On the other hand, bootstrapping technique was used for both oversampling and undersampling to address the issue of data imbalance. Publicly available 122 demographic and clinical features of 1384 patients were used. The performances of the models were evaluated using accuracy, sensitivity, specificity, and AUC (Area Under the Curve) of Receiver operating characteristic curves (ROC).
Results Only 10 features out of 122 were found to be useful in prediction modelling with “Acute kidney injury during hospitalization” feature being the most important one. Blood pH presents a decent discrimination capability especially in predicting “ICU requirement”, and “ventilated days”, Whereas gender and age are found to be vital in predicting “last status”. It was observed that selecting more than 10 features lower the prediction accuracy. The performances of different algorithms depend on number of features and data pre-processing techniques. LSTM with the with balanced data and 10 features performs the best in predicting “last status” as well as “ICU requirement” with an average of 90%, 92%, 86% and 95% accuracy, sensitivity, specificity, and AUC respectively. DNN performs the best in predicting “Ventilation days” with 88% accuracy. For “ICU requirement” which is a binary prediction task, data pre-processing technique does not have any influence in making prediction and performances of different methods are comparable (89%, 98%, 78% and 95% accuracy, sensitivity, specificity, and AUC respectively). However, the number of features selected vary with data pre-processing technique.
Conclusion Considering all the factors and limitations including absence of exact time point of clinical onset, LSTM with carefully selected features can accurately predict “last status” and “ICU requirement” with approximately 90% accuracy, sensitivity, and specificity. DNN performs the best in predicting “Ventilation days”. Appropriate machine learning algorithm with carefully selected features and balance data can accurately predict mortality, ICU requirement and ventilation support. Such model can be very useful in emergency and pandemic where prompt and precise decision making is crucial.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study did not receive any funding
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
https://www.cancerimagingarchive.net/collection/covid-19-ny-sbu/
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in the present study are available upon reasonable request to the authors