Development of Predictive Risk Models for All-cause Mortality in Pulmonary Hypertension using Machine Learning

Background: Pulmonary hypertension, a progressive lung disorder with symptoms such as breathlessness and loss of exercise capacity, is highly debilitating and has a negative impact on the quality of life. In this study, we examined whether a multi-parametric approach using machine learning can improve mortality prediction. Methods: A population-based territory-wide cohort of pulmonary hypertension patients from January 1, 2000 to December 31, 2017 were retrospectively analyzed. Significant predictors of all-cause mortality were identified. Easy-to-use frailty indexes predicting primary and secondary pulmonary hypertension were derived and stratification performances of the derived scores were compared. A factorization machine model was used for the development of an accurate predictive risk model and the results were compared to multivariate logistic regression, support vector machine, random forests, and multilayer perceptron. Results: The cohorts consist of 2562 patients with either primary (n=1009) or secondary (n=1553) pulmonary hypertension. Multivariate Cox regression showed that age, prior cardiovascular, respiratory and kidney diseases, hypertension, number of emergency readmissions within 28 days of discharge were all predictors of all-cause mortality. Easy-to-use frailty scores were developed from Cox regression. A factorization machine model demonstrates superior risk prediction improvements for both primary (precision: 0.90, recall: 0.89, F1-score: 0.91, AUC: 0.91) and secondary pulmonary hypertension (precision: 0.87, recall: 0.86, F1-score: 0.89, AUC: 0.88) patients. Conclusion: We derived easy-to-use frailty scores predicting mortality in primary and secondary pulmonary hypertension. A machine learning model incorporating multi-modality clinical data significantly improves risk stratification performance.


Introduction
Pulmonary hypertension is a progressive lung disorder characterized by elevated pulmonary arterial pressure, which can have different etiologies 1 . Patients may experience symptoms like breathlessness and loss of exercise capacity that can be highly debilitating and adversely affect their quality of life. This is exacerbated by complications, such as cavitation and infection of the lungs, alveolar hemorrhage and heart failure, which can result in premature mortality 2 . However, mortality risk differs between etiologies 3 , and more accurate risk stratification strategies could potentially improve clinical management. To this end, several studies have developed predictive risk models based on different variables. For example, the first prognostic equation, which was based on pulmonary haemodynamics (right atrial pressure, mean pulmonary artery pressure and cardiac index at diagnosis), was derived from the National Institutes of Health (NIH) registry study of 194 patients with primary pulmonary hypertension from 32 centers 4 . This model was applied in a contemporary cohort from the Pulmonary Hypertension Connection (PHC) registry, which showed better survival 5 .
The primary outcome was all-cause mortality. Descriptive statistics were used to summarize patients' characteristics of each primary diagnosis and mortality outcome. Continuous variables were presented as median (95% confidence interval [CI] or interquartile range [IQR]) and categorical variables were presented as count (%). The Mann-Whitney U test was used to compare continuous variables. The χ 2 test with Yates' correction was used for 2×2 contingency data, and Pearson's χ 2 test was used for contingency data for variables with more than two categories. To evaluate the significant prognostic risk factors and the effects of drug therapies associated with disease group status and primary outcomes, univariate logistic regression model was used with adjustments based on baseline characteristics. Multivariate logistic regression was conducted further to identify the important mortality factors with significant univariable predictors as input (Figure 1). Frailty scores without medications were derived to predict adverse events of primary and secondary pulmonary hypertension, Odds ratios (ORs) with corresponding 95% CIs and P values were reported accordingly. All significance tests were two-tailed and considered statistically significant if P values were <0.05. Data analyses were performed using RStudio software (Version: 1.1.456) and Python (Version: 3.6). Experiments were simulated on a 15-inch MacBook Pro with 2.2 GHz Intel Core i7 Processor and 16 GB RAM.

Development of a machine learning model
In this study, we develop a factorization machine (FM) model 14 for pulmonary hypertension mortality risk prediction based on baseline characteristics. We observed that some categorical . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; https://doi. org/10.1101org/10. /2021 variables, such as comorbidities and drug prescriptions, are sparse after one-hot encoding, even leading to some missing pairs of

Baseline characteristics
The study cohort has 2562 patients, of whom 1009 and 1553 had primary and secondary pulmonary hypertension, respectively (Table 1). Amongst the primary pulmonary hypertension patients, 574 deaths occurred during follow-up until the end of 2019. Those who passed away were . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) [500-1750] mg/per day), alpha-adrenoceptor blocker (median: 1.7, IQR: [0.9-2.9] mg/per day) and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; endothelin receptor antagonists (median: 125, IQR: [67-231] mg/per day). Similar patterns were observed for patients with secondary pulmonary hypertension.

Machine learning results
The performance of FM model was compared to that of multivariate logistic regression, SVM, random forests, and multilayer perceptron, using significant univariable characteristics (without medications) as input to avoid overfitting (Figure 1). All of the models were trained with a randomly selected 80% (n=2048) of patients and tested with five-fold cross-validation approach using the remaining 20% (n=512) patients. The comparative performance evaluation results with metrics of recall, precision, F1-score and AUC were reported in Table 7.
In the cross-validation, the FM model demonstrates significant improvement in pulmonary hypertension mortality risk prediction compared to other baselines. We can see that FM model outperforms baseline models with superior improvements for mortality prediction amongst both primary (precision=0.8966, recall=0.8876, F1-score=0.9106, AUC=0.9093) and secondary pulmonary hypertension patients (precision= 0.8693, recall= 0.8564, F1-score= 0.8864, AUC= . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; 0.8887). As the important hyperparameters in the baseline models to improve prediction performance, for the SVM model the radial kernel parameters gamma and cost of constraints violation were tuned to 0.01, and 11, respectively. The number of trees and tree depth were tuned to be 732 and 9, respectively, for the random forest model. For the multilayer perceptron model, the number of units in the hidden layer was set to 5, and the decay was set to 0.073. The hyperparameter settings are finished with widely used grid and randomized search approach 15 . The observations about model performance are consistent with previous studies that FM produced the most accurate predictions when compared to other models 14, 16 .

Discussion
In this study, we developed a predictive risk model for all-cause mortality in primary and secondary pulmonary hypertension patients incorporating baseline demographics, healthcare utilization metrics, comorbidities and drug prescription records using a population-based administrative database. A FM model was introduced as a multi-parametric mortality risk evaluation approach, which significantly improved risk prediction when compared to several baseline models.
The use of administrative databases for data mining in healthcare research has been a recent focus over the last decades. Specific to pulmonary hypertension, existing predictive risk models have largely relied on registry data. Whilst this can provide important insights by incorporating clinical parameters, such an approach can be difficult in the case of large patient numbers. In our study, we examined a territory-wide study of patients with both primary and secondary pulmonary . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.16.21249934 doi: medRxiv preprint hypertension, and used a multi-parametric approach incorporating data from different domains. We found common predictors of all-cause mortality for both primary and secondary causes, despite important differences in their aetiology, physiological basis and disease life course. The implications are that even without specific haemodynamic or physiological data, accurate predictions can be made.

Pharmacotherapy and association with all-cause mortality
Regarding pharmacological treatment, diuretics are used in secondary pulmonary hypertension to reduce afterload 17 . This is supported by their use to lower pulmonary arterial pressure in patients with right heart failure 18 . In our study, the use of loop diuretics was significantly associated with a lower all-cause mortality. Moreover, whilst the efficacy of cardiac glycosides has not been extensively studied in pulmonary hypertension, one study found that the short-term use of digoxin can increase cardiac output for pulmonary hypertension patients with right ventricular failure 19 but not mortality 20 . However, in contrary, our study found that their use was associated with higher mortality, which may be attributed to greater disease severity for patients who are prescribed these medications. Interestingly, the use of anti-arrhythmic drugs was also associated with higher mortality.
This may suggest that patients with pulmonary hypertension die from causes other than cardiac arrhythmias. Unlike other drug groups in this study, the use of vasodilator antihypertensive drugs was not statistically significant in the prediction of all-cause mortality. A study found that patient response, i.e. pulmonary vasodilation, to the use of different vasodilators was highly variable . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.16.21249934 doi: medRxiv preprint between individuals 21 . Furthermore, vasodilators such as calcium channel blockers are not considered empiric treatment for pulmonary hypertension and may only be effective as long-term treatment for a minority of patients who demonstrate an acute vasodilatory response 22 . The highly individualized responsiveness and efficacy of vasodilators are hence a likely explanation as to why vasodilator antihypertensives were not a predictor of mortality.

Factorization machine model for risk prediction
FM 14 as an efficient machine learning model has the main advantage stemming from its generality: a generic classifier working with any real-valued variable vector for supervised learning, in contract to matrix factorization 23 that only models the relation of two entities and over traditional machine learning models such as SVM 24 , random forests 25 , multilayer perceptron 26 that are quite difficult to capture the hidden interactions among characteristics in latent space. The main reason is that FM can learn meaningful embedding vectors for each variable as long as the variable itself appears enough times in the data, allowing the dot product a good estimator of pair-wise interaction effects even if two variables never or seldom co-occur. Previous experimental results demonstrated the better discrimination superiority of FM model in significantly improving prediction accuracy to be applied for multi-parametric mortality risk stratification in clinical practice.

Strengths and limitations
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; The main strength of this study is the inclusion of a cohort of patients with pulmonary hypertension over an 18-year period with comprehensive laboratory, comorbidity and drug, healthcare utilization and follow-up data. This was complemented by machine learning analysis using characteristics of demographics, hospitalization, comorbidities and drug prescription data.
Important informative indicators to predict mortality risk are detected using FM model, which demonstrate superior predictive performance over several baseline models.
However, several limitations should be noted. Firstly, long-term pulmonary hypertension mortality related comorbidity and disease onset sequence patterns are not uncovered. Secondly, clinical parameters from echocardiography, 6-minute walk tests and other physiological tests were not available in the administrative database and these variables could not be incorporated into our predictive risk models. These remain our future investigations to be explored.

Conclusion
A machine learning model incorporating multi-modality clinical data significantly improves risk stratification performance and identify important indicators in predicting mortality in pulmonary hypertension.

Conflicts of Interest
None.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ;

Funding
None. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021.    R  i  c  h  S  ,  S  e  i  d  l  i  t  z  M  ,  D  o  d  i  n  E  ,  O  s  i  m  a  n  i  D  ,  J  u  d  d  D  ,  G  e  n  t  h  n  e  r  D  ,  M  c  L  a  u  g  h  l  i  n  V  ,  F  r  a  n  c  i  s   G  .  T  h  e  s  h  o  r  t  -t  e  r  m  e  f  f  e  c  t  s  o  f  d  i  g  o  x  i  n  i  n  p  a  t  i  e  n  t  s  w  i  t  h  r  i  g  h  t  v  e  n  t  r  i  c  u  l  a  r  d  y  s  f  u  n  c  t  i  o  n  f  r  o  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ;

Prior Hospitalization
No  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ;