Abstract
We employed machine learning (ML) approaches to evaluate 2,199 clinical features and disease phenotypes available in the UK Biobank as predictors for Atrial Fibrillation (AF) risk. After quality control, 99 features were selected for analysis in 21,279 prospective AF cases and equal number of controls. Different ML methods were employed, including LightGBM, XGBoost, Random Forest (RF), Deep Neural Network (DNN),) and Logistic Regression with L1 penalty (LR). In order to eliminate the black box character of the tree-based ML models, we employed Shapley-values (SHAP), which are used to estimate the contribution of each feature to AF prediction. The area-under-the-roc-curve (AUROC) values and the 95% confidence intervals (CI) per model were: 0.729 (0.719, 0.738) for LightGBM, 0.728 (0.718, 0.737) for XGBoost, 0.716 (0.706,0.725) for DNN, 0.715 (0.706, 0.725) for RF and 0.622 (0.612, 0.633) for LR. Considering the running time, memory and stability of each algorithm, LightGBM was the best performing among those examined. DeLongs test showed that there is statistically significant difference in the AUROCs between penalised LR and the other ML models. Among the top important features identified for LightGBM, using SHAP analysis, are the genetic risk score (GRS) of AF and age at recruitment. As expected, the AF GRS had a positive impact on the model output, i.e. a higher AF GRS increased AF risk. Similarly, age at recruitment also had a positive impact increasing AF risk. Secondary analysis was performed for the individuals who developed ischemic stroke after AF diagnosis, employing 129 features in 3,150 prospective cases of people who developed ischemic stroke after AF, and equal number of controls in UK Biobank. The AUC values and the 95% CI per model were: 0.631 (0.604, 0.657) for XGBoost, 0.620 (0.593, 0.647) for LightGBM, 0.599 (0.573, 0.625) for RF, 0.599 (0.572, 0.624) for SVM, 0.589 (0.562, 0.615) for DNN and 0.563 (0.536, 0.591) for penalised LR. DeLongs test showed that there is no evidence for significant difference in the AUROCs between XGBoost and all other examined ML models but the penalised LR model (pvalue=2.00 E-02). Using SHAP analysis for XGBoost, among the top important features are age at recruitment and glycated haemoglobin. DeLongs test showed that there is evidence for statistically significant difference between XGBoost and the current clinical tool for ischemic stroke prediction in AF patients, CHA2DS2-VASc (pvalue=2.20E-06), which has AUROC and 95% CI of 0.611 (0.585, 0.638).
Introduction
Atrial fibrillation (AF) is the most common cardiac arrythmia, which is characterised by a rapid and irregular heartbeat [1-5]. The incidence of AF is increasing rapidly with 12.1 million people expected to be affected by 2030 [6]. This is mainly attributed to the ageing of the population, along with changes in lifestyle [5, 7, 8]. AF, besides doubling the risk of cardiovascular mortality, is associated with increased risk of stroke, ischemic heart disease, heart failure and cognitive dysfunction [1, 4, 8, 9]. More specifically, AF quintuple the risk for ischemic stroke, independent of age [6, 10]. Additionally, the contribution of AF to ischemic stroke increases exponentially with age; a 1.5% attribution at 50-59 years reaches 23.5% for the age range 80-89 [6]. However, AF is sometimes asymptomatic, and thus remains undetected [3, 5], and subsequently the ischemic stroke risk attributed to AF is under-estimated [6, 11].
In recent years, machine learning (ML) algorithms have gained popularity in the field of medicine and specifically in disease prediction, classification of medical images and diagnosis. ML models use a hypothesis-free approach; there are no prior assumptions either among the input features or between the features and the outcome. Thus, it is possible to reveal new features, along with non-linear associations amongst them, which would have not been discovered using traditional statistical models. ML models have been proven to improve the accuracy of disease prediction, although they have a “black box” character and a different way of interpretating results than the traditional models [12-15].
There have been several studies that employed ML methods for prediction of circulatory diseases. A recent study by Raghunathan et al. [16] employed Deep Neural Networks (DNN) in 430,000 patients recorded in Geisinger’s clinical MUSE database between 1984 to 2019 with no history of AF, within 1-year of an ECG, and reported a model for AF prediction with an area under the receiver operating characteristic (AUROC) of 0.85. They also reported that 62% of patients who had a stroke caused by AF within 3 years of an ECG, with no prior AF diagnosis, would have been identified by their prediction tool before the stroke occurred. Another study by Su et al. [17] employed four ML models to predict modified Rankin Scale (mRS) at hospital discharge and in-hospital deterioration for acute ischemic stroke patients enrolled on the Stroke Registry in Chang Gung Healthcare System (SRICHS). Random Forest (RF) performed well in both outcomes; the AUROC was 0.829 for discharge mRS and 0.710 for in-hospital deterioration.
The aim of this study is to develop ML models for prediction of: 1) AF and 2) ischemic stroke in patients with AF, using UK-Biobank’s real-world clinical data, questionnaires, hospital episode statistics data and genomic data. To achieve this, five types of ML models, including extreme gradient boosting (XGBoost) [18], light gradient boosting machine (LightGBM) [19], RF [20], support vector machine (SVM) [21], DNN [22], and penalised logistic regression (LR) [23] as a baseline model, were constructed, and their predictive performances were compared. Besides the comparison of the model performances, we also focused on the ranking of feature importance by employing the SHapley Additive exPlanations (SHAP) [24], in order to unravel each feature’s contribution both to AF and to ischemic stroke prediction in AF patients.
Methods
Overview of the research framework
We included clinical data, phenotypes, lifestyle, and medications from the UKB. We imputed the missing values and employed a feature selection process, described in more detail at Data pre-processing section, in order to reduce the number of features employed to the ones relative to the outcome. Then six models, including XGBoost, LightGBM, RF, SVM, DNN and penalised LR were used to create the predictive models. The model’s hyperparameters were optimised using 10-fold cross validation at the training dataset, which was the 80% of the original one. The ML models were validated on the test dataset and their performances were compared. Lastly, we employed the SHAP explanations to reveal the features’ contributions to the prediction.
Data pre-processing
We examined the UKB, a prospective cohort of 502,492 participants, aged 37-73 years old, recruited between 2006 and 2010. The dataset includes blood measurements, clinical assessments, anthropometry, cognitive function, hearing, arterial stiffness, hand grip strength, sociodemographic factors, lifestyle, family history, psychosocial factors and dietary intake [25]. Related individuals were removed, and the remaining dataset for analysis included 454,118 participants. Furthermore, we incorporated medications as features, derived from field 20003 (https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=20003). Additionally, clinical data were employed, coded in ICD10, derived from the Hospital Episodes Statistics (HES), which are linked to the UKB. From these, we constructed phenotype codes or “phecodes”, using a phecode map [26], which are aggregated ICD10 codes defining specific diseases or traits. We employed only the umbrella phecode categories. Detailed list of all the features that we examined can be found at the Supplementary_Material.xlsx (Table_S1, Table_S2, Table_S3). Moreover, we created two polygenic scores (PGS) which were included as features for the prediction of ischemic stroke in people with AF. The first one is the AF score, based on 94 genome-wide variants derived from the Roseli et al. [27] genome-wide association study (GWAS) for AF. The second is the Ischemic STROKE score, based on 28 genome-wide variants derived from the Malik et al. [28] GWAS for ischemic stroke. The AF SCORE was also employed as a feature both for the prediction of AF and for the ischemic stroke in AF patients.
The investigator phenotypes dataset from UKB includes 2,199 fields for 454,118 participants. We set answers “Do not know” and “Prefer not to answer” as NA and removed features that had more than 25% missingness, resulting in 390 investigator phenotypes. Afterwards, we imputed the missing values using a multivariate imputer that estimates each feature from all the others, using IterativeImputer from Python [29]. Then, we added 419 phecodes, available for 278,177 participants, derived from HES in UKB. Lastly, we added the medications from field 20003 (https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=20003), after applying one-hot-encoding, resulting in 1,289 medications for 294,698 participants.
Next, we decided to balance the outcome sample size, since imbalanced data has a negative impact on ML procedures [30]. The classification algorithms have the tendency to get biased estimates towards the majority class, ignoring the minority class. This happens because most of the classifying methods aim to maximize the accuracy rate, meaning the number of correctly classified observations [31, 32]. Therefore, we employed the fixed under-sampling technique from Python [33], which is a process for reducing the number of samples in the majority class; the control group in our case. The algorithm randomly selects samples from the control group, in order to have equal representation of both classes. After balancing the outcome, we used VarianceThreshold from Python [29], which eliminates all features whose variance does not meet a threshold of 90%. Additionally, we removed the continuous correlated fields using Pearson correlation, at a 0.8 threshold; features strongly correlated with the outcome were maintained. Next, we performed feature selection in order to reduce the computational cost via dimensionality reduction [34], achieve higher classification accuracy by eliminating the noise, and include the most relevant features for the disease prediction. A recent paper by Ramos-Pérez et al. [35], suggests that the best practice is for the fixed under-sampling technique to precede the feature selection. Therefore, we filtered all the remaining features using recursive feature elimination with cross-validation from Python [29] in order to find the optimal number of features to include in the ML models.
Outcome-AF
We removed participants from the UKB that had cardiac dysrhythmias before the time of enrolment, with one or more of the following codes: non-cancer illness code, self-reported (1471, 1483); operation code (1524); diagnoses – main/secondary ICD10 (I44, I44.1-I44.7, I45, I45.0-I45.6, I45.8-I45.9, I46, I46.2, I46.8-I46.9, I47, I47.0-I47.2, I47.9, I48, I48.0-4, I48.9, I49, I49.0-I49.5, I49.8-I49.9, R00.0, R00.1, R00.2, R94.3, Z86.7, Z95.0, Z95.8-Z95.9); underlying (primary/secondary) cause of death: ICD10 (I44, I44.1-I44.7, I45, I45.0-I45.6, I45.8-I45.9, I46, I46.2, I46.8-I46.9, I47, I47.0-I47.2, I47.9, I48, I48.0-4, I48.9, I49, I49.0-I49.5, I49.8-I49.9, I60-I61, I63-I64 (NOT I63.6), R00.0, R00.1, R00.2, R94.3, Z86.7, Z95.0, Z95.8-Z95.9); diagnoses – main/secondary ICD9 (4273, 430, 431, 4339, 4340, 4341, 4349, 436); operative procedures – main/secondary OPCS (K57.1, K62.1-4). In total, 20,584 participants were excluded, having at least one of the above conditions, before enrolment in the UKB.
AF cases were defined when having one or more of the following codes: non-cancer illness code, self-reported (1471, 1483); operation code (1524); diagnoses – main/secondary ICD10 (I48, I48.0-4, I48.9); underlying (primary/secondary) cause of death: ICD10 (I48, I48.0-4, I48.9); operative procedures – main/secondary OPCS (K57.1, K62.1-4). In total, 21,279 people developed one of the conditions described above, after enrolment in the UKB.
Based on the data pre-processing described above, 21,279 prospective AF cases and equal number of controls, along with 99 features were included in the ML models (Supplementary_Material.xlsx, Table_S4). Table 1 includes the baseline characteristics of the examined participants.
Baseline characteristics for the 21,279 prospective AF cases and equal number of controls.
Outcome-AF & Stroke
Cases were defined as participants who developed ischemic stroke after AF diagnosis in the UKB with one or more of the following codes: diagnoses – main/secondary ICD10 (I63, I63.0-9, I64); diagnoses – main/secondary ICD9 (434, 436); underlying (primary/secondary) cause of death: ICD10 (I63, I63.0-9, I64). Thus, 3,150 people developed ischemic stroke after AF diagnosis and were included as cases, and the controls were people diagnosed with AF and did not develop stroke, as far as the data allow us to know.
Based on the data pre-processing described above, 3,150 prospective cases who developed stroke after AF diagnosis and equal number of controls were included, along with 129 features were included in the ML models (Supplementary_Material.xlsx, Table_S7).
Machine learning models
XGBoost
XGBoost is a “scalable machine learning system for tree boosting” [18]. This machine learning technique handles sparse data, incorporating a novel tree learning algorithm, runs ten times faster than similar algorithms, using parallel and distributed computing, and employs out-of-core computation, allowing the manipulation of massive datasets [18, 36]. In more detail, XGBoost uses regression trees in a sequential learning process as weak learners into a single strong model, where each tree attempts to correct the residuals in the predictions made by previous trees. Regression trees include a continuous score on each leaf, which is the last node once the tree has grown. For a specific observation, the algorithm uses the decision rules in the trees to classify it into the leaves. The sum of the scores on each leaf is the final prediction [15, 18, 37].
LightGBM
Machine learning methods relying on Gradient Boosting Decision Tree (GBDT) scan all the data instances, for all the features, in order to calculate the information gain for each possible split. As a result, the computational time and complexity will increase as the features accumulate. To this end, LightGBM [19] is introduced, which is an improved and lighter version of XGBoost. There are two techniques incorporated at LightGBM algorithm that contribute towards this improvement. Firstly, in the Gradient-based One-Side Sampling (GOSS) technique, instances that have larger gradients contribute more to the information gain, and the instances with smaller gradients are randomly sampled to provide accurate and fast estimation. Secondly, the Exclusive Feature Bundling (EFB) technique reduces the number of effective features. For datasets that are sparse, many features are mutually exclusive; they will rarely take nonzero values at the same time. Therefore, to reduce the dataset’s dimensions, such features are tied into one, reducing complexity of the algorithm [17, 19, 38, 39].
Random Forests (RF)
Random forest is a popular ensemble learning method using bootstrap aggregation (bagging) and feature randomness, in order to build an uncorrelated forest of several decision trees [20]. At the bagging step, each one of the decision trees is constructed from a random sample, drawn with replacement, from the training set. Once the tree is built, then a random subset of features is employed to split each node, which results in low correlation among the decision trees. Afterwards, the final prediction of RF is the result of the majority voting of all decision trees, leading to more accurate results [12, 17, 40].
Deep Neural Networks (DNN)
Deep learning is a subdomain of ML attempting to learn many levels of representation using multiple layers. These layers transform the data in a non-linear way, and as a result, more complex structure and relationships can be discovered. This method is inspired by the human brain, using a series of connected layers of neurons that constitute a whole network, including at least three layers: input, hidden and output. The input layer consists of multiple neurons, which use as input the original features. Then, the hidden layers can be more than one, depending on the complexity of the dataset. Each layer includes multiple nodes, and each node from the previous layer is connected to each one from the next layer, constituting a fully connected or dense network. Lastly, the output layer, using a sigmoid activation function, concludes in a number between 0 and 1, which represents the probability belonging to one of the two classes [22, 41-44].
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a high accuracy ML model, which can deal with non-linear spaces. The basic idea is to map the input data into a higher dimension feature space, using a kernel function, which can be either linear, polynomial, radial basis function (rbf) or sigmoid. Then, a linear decision surface is created to classify the outcome, with properties that satisfy the generalisation of the algorithm. The linear decision surface is more commonly called hyperplane; the optimal one classifies the data by using the maximal margin of the hyperplane, employing a small percentage of the training data, which are named support vectors. It is supported that if the optimal hyperplane is created from a few support vectors, then the algorithm can be generalised, even in a space with infinite dimensions [21, 45, 46]. The SVM model is not easily interpretable, however it is included in our study in order to compare the predictive performance with the rest of the ML models [47].
Logistic regression-L1 penalty
One of the most common and easily interpretable models is the logistic regression, which is used to predict the outcome when it is classified in one of two classes. The Least Absolute Shrinkage and Selection Operator (Lasso) [48] - L1 regularisation - efficiently reduces the number of features of large datasets, and it has been proven to produce optimal sparse estimates when the true vector is sparse. To achieve this, it shrinks the coefficients of correlated and redundant features to zero. This method performs shrinkage and automatic feature selection in parallel. L1 regularisation has been proven to be effective when selection of relevant features, among plenty irrelevant ones, needs to be conducted [49, 50].
Cross-validation
The ML model aims to optimise the general model performance on datasets different from the ones used to train them [44]. Therefore, evaluating the generalisation of ML methods requires the data to be split in three non-overlapping sets of training/validation/test, using grid search, combined with stratified 10-fold cross-validation (CV), maintaining the same proportion of cases and controls in each fold. Grid search is performed using 9 sets for the parameter tuning, and the 1 remaining set is used for validation. This process is repeated 10 times, until every set is used once for training and once for validation. The best parameters for the model correspond to the highest score, which is calculated by averaging the results from all repetitions. The test dataset is used to check for overfitting and unbiased evaluation of the final model [29, 41, 42]. Tables with the hyperparameter values that were examined for each model can be found in Supplementary_Material.xlsx.
SHAP
ML models, although accurate and capable of capturing the non-linear relationships, are complex to interpret. A more widespread method for interpretation is SHapley Additive exPlanations (SHAP) [32, 37]. This is important, since ML models until recently were treated as “black boxes” [51]. We want to understand each feature’s contribution to the prediction, by calculating their explanations, using cooperative game theoretic tools [40, 52].
The SHAP values are in theory the best solution up to now, however time-consuming, since all possible combinations need to be calculated. TreeExplainer is an expansion of SHAP, employing tree nodes instead of linear models for the estimation of Shapley values. The Shapley values of a tree-based algorithm are calculated as the weighted average of the Shapley values corresponding to individual trees. Thus, it is commonly used to explain tree-based machine learning models, such as random forests and gradient boosted trees, reducing tremendously the computation time. In parallel, consistency and local accuracy are preserved [51, 53, 54].
Additionally, SHAP values seem to overcome the interpretability issue by employing both global and local interpretation for analysis methods that use trees. Global explanation relies on the effect of input features on the whole model, and local interpretation depicts the effect of input features on single predictions [24, 51, 53].
For the methods presented above, Python programming language was employed [55].
Results
AF
We examined 21,279 prospective AF cases and an equal number of controls in UKB including 99 features (Supplementary_Material.xlsx (Table_S4)) and using five ML models to predict AF. The results of the ML models presented in this section correspond to the optimal hyperparameters, derived after 10-fold cross-validation from the examined values included in Supplementary_Material.xlsx (Table_S5). SVM did not converge after running 10 days and utilising 16 cores in Queen Mary’s Apocrita HPC facility1.
For the test dataset, Table 2 summarises AUROC, accuracy, precision, recall and F1 score for each model. The best AUROC value was achieved with LightGBM (0.79; Table 2) albeit De-Long’s test (Table 3) showed that there is no evidence for significant difference in the AUROCs between LightGBM and XGBoost, DNN, or RF. In contrast, DeLong’s test showed that there was statistically significant difference in the AUROCs between LightGBM and penalised LR (pvalue=1.38E-02), after considering multiple correction. Actually, the AUROC of penalised LR differed from the AUROC of all other examined ML models based on DeLong’s test and this was statistically significant (Table 3). The AUROC curves for the 5 ML models in the test dataset are shown in Figure 1.
Performance of the ML models for AF prediction, on the test dataset, under various metrics.
DeLong’s test for the ML model comparisons for AF prediction.
AUROC for each ML model for AF prediction in the test dataset.
To estimate the contribution of each feature in each of the 5 models we assessed for prediction of AF, we employed the TreeExplainer SHAP analysis, which is accurate, fast and stable (see Methods). Figure 2 displays the top 20 features, ranked according to their SHAP value, for the LightGBM model; features are listed in descending order, starting with the most significant for AF prediction. SHAP values depict the distribution of the effect of each feature on the model output.
In both plots the top 20 features are depicted, in descending order, for the AF prediction on the test dataset, employing LightGBM model. On the top is the feature importance plot of the mean absolute SHAP values (x-axis) for the top 20 features (y-axis). On the bottom is the summary plot of the SHAP values (x-axis) for the top 20 features (y-axis), showing the distribution of the impact that each feature has on the model. Each dot represents a participant. The red dots represent a high feature value and blue dots represent a low feature value for each participant. For example, the AF SCORE had a positive impact on the model output, i.e., a higher AF SCORE increased AF risk.
Based on Figure 2, SHAP analysis reveals that the top 3 most important variables contributing to the model were “Records in HES inpatient diagnoses dataset” (fieldID 41234), “Age at recruitment” (fieldID 21022) and “AF SCORE”, using the unweighted sum of increasing alleles from Roseli et al. [27]. All the features’ contributions, based on SHAP analysis, can be found in Supplementary_Material.xlsx (Table_S6).
AF & Stroke
We examined 3,150 prospective cases who developed ischemic stroke after being diagnosed with AF, and an equal number of controls in UKB including 129 features (Supplementary_Material.xlsx (Table_S7) and using six ML models to predict ischemic stroke in AF cases. As indicated previously results correspond to the optimal hyperparameters (Supplementary_Material.xlsx (Table_S8)).
For the test dataset, Table 4 summarises the AUROC, accuracy, precision, recall and F1 score for each of the six models assessed for prediction of ischemic stroke in AF cases. The best AUROC value was achieved for XGBoost (Table 4). DeLong’s test (Table 5) showed that there is no evidence for significant difference in the AUROCs between XGBoost and all other examined ML models but the penalised LR model (pvalue=2.00 E-02).
Performance of the ML models for the prediction of ischemic stroke in AF patients, on the test dataset, under various metrics.
DeLong’s test for the ML model comparisons for ischemic stroke prediction in AF patients.
As shown in Figure 4, SHAP analysis revealed that the 3 most important variables contributing to prediction of ischemic stroke in AF cases in the model were “Records in HES inpatient diagnoses dataset” (fieldID 41234), “Age at recruitment” (fieldID 21022), and “Glycated haemoglobin (HbA1c)” (fieldID 30750). Supplementary_Material.xlsx (Table_S9) lists the contribution of each of the 129 features in the model based on SHAP analysis.
AUROC for each ML model for predicting the development of ischemic stroke in AF patients, on the test dataset.
In both plots the top 20 features are depicted, in descending order, for the development of ischemic stroke in AF patients, on the test dataset, employing XGBoost model. On the top is the feature importance plot of the mean absolute SHAP values (x-axis) for the top 20 features (y-axis). On the bottom is the summary plot of the SHAP values (x-axis) for the top 20 features (y-axis), showing the distribution of the impact that each feature has on the model. Each dot represents a participant. The red dots represent a high feature value and blue dots represent a low feature value for each participant.
Comparison with the CHA2DS2-VASc
The current tool used for the prediction of ischemic stroke occurrence among AF patients is CHA2DS2-VASc [56] which considers multiple risk factors; age, sex, heart failure, hypertension, stroke, vascular disease, diabetes. Thus, we decided to compare the performance of our best ML model, XGBoost (Table 4), with CHA2DS2-VASc in UKB. To construct the CHA2DS2-VASc we employed the codes described in Supplementary_Material.xlsx (Table_S10). The AUROC and 95% CI for the CHA2DS2-VASc and XGBoost was 0.611 (0.585 – 0.638) and 0.631 (0.604 – 0.657) in the test set, respectively. The improved AUROC in the XGBoost model compared to CHA2DS2-VASc was statistically significant based on DeLong’s test for difference between the two models (pvalue=2.20E-06). Furthermore, SHAP analysis for the XGBoost model (Figure 4), shows that there is a significant number of peripheral blood markers associated with ischemic stroke, which are overlooked from CHA2DS2-VASc.
Discussion
Comparison of the performance of ML models for prediction of AF or ischemic stroke in patients with AF
We assessed six ML models in total for prediction of AF (XGBoost, LightGBM, RF, DNN, penalised LR) or ischemic stroke in patients with AF (XGBoost, LightGBM, RF, DNN, SVM, penalised LR) and employed SHAP analysis to rank features for predictive importance in a model. To the best of our knowledge, this is the first study using ML models to predict AF and ischemic stroke occurrence in AF patients in UKB. SHAP analysis was successful in the visualisation of non-linear relationships between the features used for prediction and the outcome. Additionally, the direction of the SHAP values for the top 20 features is in agreement with what has been reported so far in the literature. We found that the ensemble learning models LightGBM (best for AF prediction) and XGBoost (best for prediction of ischemic stroke in patients with AF) achieved higher AUROCs compared to the other examined models, suggesting that these models have better generalisation. In the case of models for AF prediction, DeLong’s test showed that penalised LR model had a lower AUROC compared to all other models and these differences were statistically significant (e.g., pvalue=1.38E-32 with LightGBM), indicating that ML models capture useful information by modeling non-linear associations. However, we cannot exclude that the performance of the examined ML models in this study may differ when applied to another dataset. For this reason, validation in datasets with different patient characteristics will be required in order to generalise these findings.
AF results
Advancing age has been shown to be one of the most important risk factors for AF [3, 4, 57-60] which in our LightGBM model for AF prediction was ranked as the second most important feature. The third most important feature in our model was the AF SCORE, a set of 94 genome-wide variants associated with AF and explaining 42% of the heritability in Europeans [27], which as expected had a positive impact on the model output, i.e. the higher the AF score the higher the risk of developing AF. Thus, our results endorse the likely clinical utility of an AF score in disease prediction [61]. However, an optimised AF score for prediction in multi-ethnic populations such as the UK population will be required prior to considering clinical use. Interestingly, standing height was ranked as the fourth most significant feature in our best performing model for AF prediction. Greater height has been identified as a risk factor for AF in several studies [2, 9, 62-64] and in both males [10, 65] and females [65, 66]. Some studies report that taller people have greater heart chamber size [9, 10, 63, 65, 66], meaning a larger left atrial size, which may be potential explanation albeit not a very robust one as AF is driven by left atrial stretch and fibrosis. Two other anthropometric traits, weight and waist circumference, ranked just below standing height. Obesity is associated with increased risk of left atrial enlargement, atrial fibrosis, electrical derangements of the atria, impaired diastolic function, inflammation and accumulation of pericardial fat, which are all key mechanisms in the pathogenesis of AF [10, 62, 65, 66]. The ranking of sex as the 7th most significant feature in the model is also in agreement with epidemiological studies reporting sex differences in AF, males are at higher risk which is in agreement with our results, along with the electrophysiologic properties of the atria and structural remodelling[59, 60]. Our analysis also found that participants with lower albumin levels (feature ranked 9th) had an increased risk of AF. This is in agreement with findings in two recent studies: Liao et.al. [67] found that albumin was inversely associated with risk of AF in the Atherosclerosis Risk in Communities (ARIC) study and the meta-analysis from Wang et al. [68] revealed that an increase in albumin level decreased the risk of AF. However, low albumin levels are associated with poor health overall and therefore we cannot exclude confounding. Among the remaining 20 most significant features in the model it is worth noting that (i) direct bilirubin (13/20) has been reported as an important independent risk factor for AF development in both thyrotoxic patients and a study in postoperative cardiac surgery[69, 70], (ii) urate (14/20) has been reported to increase the risk of AF [71] and be causally associated to AF through MR analysis in Koreans [72], and (iii) the positive effect of increased testosterone (17/20) on risk of AF has been reported in males but not in females in the ARIC study [73]. Finally, only two of the 20 top features have some conflicting data in the literature. FEV-1 levels (16/20) has an increased risk of AF as shown in other studies [74] but the Korean National Health and Nutritional Examination Survey reported an adverse association between FEV-1 and AF development [75] whereas triglycerides (20/20) contribute to increased risk of AF, but a study in Chinese participants showed no evidence of association between triglycerides and incidence of AF [76].
AF & Ischemic stroke results
In our study, the XGBoost model was the best in predicting stroke in AF patients (AUROC 0.631) and showed that it performs better, albeit marginal this result was statistically significant (pvalue 2.20E-06 for DeLong’s test), than CHA2DS2-VASc. Unexpectedly, the genetic risk score for stroke (STROKE score [28]) was not among the top 20 features of our model although ischemic stroke is highly heritable [77, 78]. In the top 20 most significant features, medium to high feature values of HbA1c ranked third after sex and was associated with increased risk of stroke in AF patients. This is in agreement with the Clalit Health Services electronic medical records database from Israel, where participants with diabetes and AF were found to have an increased risk of stroke when their HbA1C levels were ranging from medium to high [79]. The fourth most significant feature was albumin which ranked 9th in the AF prediction model suggesting a stronger relationship with ischemic stroke in AF patients than AF per se. A study in Japanese, has reported lower albumin levels to be associated with an increased risk of ischemic stroke in both sexes independently of AF status [80]. Four other blood biomarkers, creatinine, alkaline phosphatase, LDL cholesterol, and Lipoprotein A (Lp(a)) ranked among the top 20 features (6th,7th,10th, and 17th respectively). These results are in agreement with the China National Stroke Registry reporting an association between high levels of alkaline phosphatase with recurrent stroke [81] and the Copenhagen General Population Study showing that high levels of Lp(a) were associated with increased risk of ischemic stroke [82, 83]. It is worth noting that the association of Lp(a) to increased risk of ischemic stroke although true for all examined ancestries it varies in strength e.g. higher in African than European Americans [84]. Interestingly, the use of creatinine as marker for increased risk of ischemic stroke in AF patients has not been previously reported and will merit further investigation. Lastly, the 20th feature identified from the SHAP analysis – time spent watching television – could be considered as a surrogate marker for luck of sleep and physical inactivity. A study by Katzmarzyk et. al. [85], showed that physical inactivity increases the risk of stroke risk whereas a study in UKB, showed a dose-response joint association of sleep scores and physical inactivity with ischemic stroke mortality [86].
In summary, we present ML models for the prediction of AF and stroke in AF patients (XGBoost) respectively that have the potential for clinical use but validation in further independent studies is required. Importantly, the models will need to be validated across all ancestries as some features vary by ethnicity e.g., Lp(a) and AF genetic score. Our results endorse the incorporation of a number of routinely measured blood biomarkers whereas they support the inclusion of a genetic score only in the model for AF prediction.
Data Availability
The data are from UK Biobank
Footnotes
↵1 This research utilised Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT. http://doi.org/10.5281/zenodo.438045
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵