Novel clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health

Chang Su; Yongkang Zhang; James H Flory; Mark G. Weiner; Rainu Kaushal; Edward J. Schenck; Fei Wang

doi:10.1101/2021.02.28.21252645

Data Availability

All data studied in this work can be downloaded from INSIGHT clinical research network at https://insightcrn.org/our-data/, via request. Implementation of our work is based on Python 3.7 and R 3.6. More specifically, clustering models were implemented based on Python packages scikit-learn 0.23.2 (https://scikit-learn.org/stable/) and scipy 1.5.3 (https://www.scipy.org). Supervised predictive modeling was based on XGBoost 1.2.1 (https://xgboost.readthedocs.io/en/latest/) and SHAP 0.35.0 (https://shap.readthedocs.io/en/latest/). Data dimension reduction and visualization were performed based on Python package UMAP-learn 0.3.9 (https://umap-learn.readthedocs.io/en/latest/). R package NbClust (https://cran.r-project.org/web/packages/NbClust/NbClust.pdf) was used to calculate measures of clusters to determine the optimal cluster number in agglomerative hierarchical clustering. Chord diagrams were created using R package circlize (https://cran.r-project.org/web/packages/circlize/index.html). All statistical tests and survival analyses were performed based on R.