Actionable absolute risk prediction of atherosclerotic cardiovascular disease: a behavior-management approach based on data from 464,547 UK Biobank participants

Cardiovascular diseases (CVDs) are the primary cause of all global death. Timely and accurate identification of people at risk of developing an atherosclerotic CVD and its sequelae, via risk prediction model, is a central pillar of preventive cardiology. However, currently available models only consider a limited set of risk factors and outcomes, do not focus on providing actionable advice to individuals based on their holistic medical state and lifestyle, are often not interpretable, were built with small cohort sizes or are based on lifestyle data from the 1960s, e.g. the Framingham model. The risk of developing atherosclerotic CVDs is heavily lifestyle dependent, potentially making a high percentage of occurrences preventable. Providing actionable and accurate risk prediction tools to the public could assist in atherosclerotic CVD prevention. We developed a benchmarking pipeline to find the best set of data preprocessing and algorithms to predict absolute 10-year atherosclerotic CVD risk. Based on the data of 464,547 UK Biobank participants without atherosclerotic CVD at baseline, we used a comprehensive set of 203 consolidated risk factors associated with atherosclerosis and its sequelae (e.g. heart failure). Our two best performing absolute atherosclerotic risk prediction models provided higher performance than Framingham and QRisk3. Using a subset of 25 risk factors identified with feature selection, our reduced model achieves similar performance while being less complex. Further, it is interpretable, actionable and highly generalizable. The model could be incorporated into clinical practice and could allow continuous personalized predictions with automated intervention suggestions.

Introduction tools(10) (13). A relatively new area of screening is self-screening, carried out by proactive 74 individuals, using smartphone or smartwatch app based screening tools, which may use built in 75 app-linked sensors, or screening chat-bots (14-16). There is public demand for reliable, 76 actionable, explainable and usable health information tools (17), including for disease 77 screening. 78

79
The risk to build up atherosclerotic plaque varies and is determined by multiple factors such as 80 genetics, environment and lifestyle (11,(18)(19)(20)(21). With genetics being unmodifiable and the 81 environment being difficult to change, the risk of developing atherosclerotic plaque can be 82 reduced based on an individual's lifestyle which is modifiable (19,20). 83 Thus, atherosclerotic CVD is actionable and preventable by addressing behavioral risk factors, 84 such as smoking, physical activity and nutrition (1,11,19,20). Previous studies in this area use an outdated or very limited set of risk factors and outcomes for 96 their analysis (7,25). In recent years, the knowledge of behavioral risk factors and of the 97 pathophysiology of atherosclerotic CVDs have advanced tremendously (11,25). Current 98 absolute risk prediction models have limited predictive capability as they have not been trained 99 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint 7 The UK Biobank is a long-term prospective large-scale biomedical database including over 150 500,000 participants aged 40-69 years (when recruited between 2006 and 2010). The database 151 is globally accessible to approved researchers undertaking research into the most common and 152 life-threatening diseases and continuously collects phenotypic and genotypic data about its 153 participants, including data from questionnaires, physical measures, blood, urine and saliva 154 samples, lifestyle data (39). This data is further linked to each participant's health-related 155 records, accelerometry, multimodal imaging, genome-wide genotyping and longitudinal follow-156 up data for a wide range of health-related outcomes (39,40). The UK Biobank study protocol is 157 available online (41). 158 The North West Multi-centre Research Ethics Committee approved the UK Biobank study and 159 all participants provided written informed consent prior to study enrollment. Our research is 160 covered by the UK Biobank's Generic Research Tissue Bank (RTB) Approval and was 161 approved by the UK Biobank Access Management Team (42). 162 163 We excluded participants with atherosclerotic CVDs present before or during baseline, 164 participants who chose to leave the UKB study and participants who were lost due to various 165 reasons. The resulting cohort consisted of 464,547 participants. The last available date of 166 participant follow-up was March 5th, 2020. 167 168 Risk factor definition 169 We curated a list of all generally known risk factors and outcomes for atheroscelortic CVDs from 170 the medical literature and from validated risk prediction models. This preliminary list of risk 171 factors was reduced through curation to focus on those factors that were clearly involved in the 172 pathophysiology of atherosclerosis and those that are modifiable through behavioral change. 173 The curation was carried out by three medical doctors with experience in diagnosing or 174 scientifically modelling cardiovascular diseases. We consolidated all relevant UKB columns into 175 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint supplementary data (S1 Table). 185 186 Outcome definition 187 In the same manner as described above, an initial list of atherosclerotic CVDs was further 188 reviewed and curated by the same team of medical doctors. All resulting CVDs of interest are 189 associated with atherosclerotic plaque build-up, are modifiable and relate to the collected risk 190 factors only. Thus, we disregard brain haemorrhages due to accidents and congenital and 191 pregnancy-related CVDs, which are not actionable. The curated list of all ICD-10 and ICD-9 192 outcomes meeting the above criteria consists of 193 total (125 unique) CVD outcomes, e.g. 193 coronary/ischaemic heart disease, heart attack, angina, stroke, cardiac arrest, congestive heart 194 failure, left ventricular failure, myocardial infarction, aortic valve stenosis, cerebral artery 195 occlusions, nontraumatic haemorrhages. A list with all outcome codes used in our analysis is 196 provided in the supplementary data (S2 Table). An atherosclerotic CVD event was defined as 197 the first occurrence out of the following: any of the atherosclerotic CVD outcome diagnosis 198 codes, also as primary or secondary death cause during the 10-year follow-up period. 199 200 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The cohort consists of 10.56 million patients between the age of 25 and 84 years, where 75% of 224 the patients were used for training and 25% for validation. Patients with a pre-existing CVD, 225 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint missing Townsend score or using statins were removed from the baseline. Patients were 226 classified as having a positive CVD outcome when any of the following outcomes was present 227 during follow-up in the GP, hospital or mortality records: coronary heart disease, ischaemic 228 stroke, or transient ischaemic attack. QRisk3 used the following ICD-10 codes: G45 (transient 229 ischaemic attack and related syndromes), I20 (angina pectoris), I21 (acute myocardial 230 infarction), I22 (subsequent myocardial infarction), I23 (complications after myocardial 231 infarction), I24 (other acute ischaemic heart disease), I25 (chronic ischaemic heart disease), I63 232 (cerebral infarction), and I64 (stroke not specified as haemorrhage or infarction). The utilized 233 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint several simpler practical risk predictors, as determined by an iterative feature elimination 252 procedure outlined below. L1 regularization for logistic regression implements a strong penalty 253 for non-zero feature weights, resulting in a feature selection procedure that discards features 254 that are likely to be non-predictive. Random Forest is an ensemble method that fits many 255 decision trees independently to a subset of the data. We implemented both methods using their 256 scikit-learn library implementation. Finally, we evaluated Extreme Gradient Boosting: Gradient 257 boosting is an ensemble tree-based machine learning method that combines many weak 258 classifiers to produce a stronger one. It sequentially fits a series of classification or regression 259 trees, with each tree created to predict the outcomes misclassified by the previous tree (45). By 260 sequentially predicting residuals of previous trees, the gradient boosting process has a focus on 261 predicting more difficult cases and correcting its own shortcomings. Extreme Gradient Boosting 262 (XGB / XGBoost) is a specific implementation of the gradient boosting process, and uses 263 memory-efficient algorithms to improve computational speed and model performance (38,46). 264 For completeness, we evaluated a number of other standard classifiers, but discarded them due 265 to too high computational complexity or inferior performance so we do not report their 266 reporting. 275 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint We implemented all models using their respective scikit-learn library or xgboost library 276 implementation using the Python programming language (37,38). Details on the used Python 277 libraries and methods are provided in the supplementary data (S3 and S4 Tables).  278 Categorical values were one-hot encoded. Data normalization was performed by removing the 279 mean and scaling to unit variance. Data imputation was performed for all models using a simple 280 mean imputation. The models' hyper-parameters were determined using grid search and 281 stratified k-fold cross validation using 3 folds to avoid overfitting. 282 Finally, we assessed model performance mainly using the AUROC. 283 284 Iterative feature elimination 285 We employed an iterative feature elimination procedure based on the regularized logistic 286 regression for finding the best trade-off between predictive performance and number of risk 287 factors, with the aim of creating a risk prediction algorithm that is applicable in the clinical 288 context. We used the standard L1 regularization (also known as Lasso) proposed by (54); it 289 implements a strong penalty on non-zero feature weights of our logistic regression model, 290 resulting in a sparse feature set for prediction. 291 A logistic regression coefficient value can be interpreted as the expected change in log odds 292 of having the outcome per unit change in the feature x . Therefore, increasing the feature by 293 one unit multiplies the odds of having the outcome by e β . This means that we can interpret the 294 coefficients as feature importance values in the sense that the feature with the smallest 295 coefficient has the least importance on model predictions. Importantly, this holds only true in the 296 context of the parameters contained in the current model. Thus, we re-estimate the model after 297 each feature elimination round. 298 In each iteration, we re-estimated the logistic regression model on the remaining parameters, 299 and then discarded all parameters that were set to zero by the L1 regularization; finally, we also 300 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

323
Of 502,551 patients in the UK Biobank, we filtered out 7.6% who already experienced a relevant 324 CVD outcome (during or before baseline) and the participants being lost or who withdrew from 325 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint the biobank. This resulted in 464,547 participants who met the inclusion criteria. 28,561 (6.1%) 326 of those participants developed at least one of the relevant CVD outcomes during their 10-year 327 follow-up period. We used a common 70% of the data as a training set and 30% as a hold-out 328 test set. Table 1    In order to better evaluate the clinical implications and significance of our results, we compared 350 the results of our benchmarked models with our baseline models Framingham and QRisk3. 351 Table 2 shows that both, our XGB and Logistic Regression classifiers achieved superior 352 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  We assessed the generalizability of our models with the aforementioned approach of re-training 409 the two previously best performing models only on a white cohort and testing them on a non-410 white cohort. Table 4  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Fig 5. AUROC of Logistic Regression with L1 regularization and XGBoost when trained 420 on whites and tested on non-whites. 421
Predictive ability of individual variables in UK Biobank. 422 Table 5 shows the relative regression feature weights of the 25 most informative risk factors in 423 descending order. A full list is provided in the supplementary materials (S5 Table). Based on our 424 previous manual curation of risk factors and outcomes, we can see that the most informative 425 risk factors are distributed across 5 categories (  showing high predictive power, which are social visits, walking pace and overall health rating. 447 The frequency of social visits could be indicative of someone's current mental health status, 448 which has been shown to be a relevant CVD risk factor (55,56). These and other non-laboratory 449 risk factors could be collected by means of a questionnaire or passively deduced using data 450 analytics from data sources such as GPS, calendar and sensors from smartphones, 451 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint smartwatches and fitness trackers. 452 Additionally, our best performing models, XGBoost and Logistic Regression, showed marginal 453 differences when trained and tested on particular sub-populations, which is indicative of good 454 generalizability to other ethnicities. 455 As there was little performance difference between the best performing models, we primarily 456 discuss the simplest model, Logistic Regression with L1 regularization. This model has the 457 inherent benefit of offering reasoning for its predictions, through analyzing the learned 458 coefficients for every risk factor and having feature selection performed by the L1 regularization.  (11,19,21,25). Innovative approaches are needed in order to 482 tackle the increasing prevalence and mortality of CVD-related diseases (2), and the associated 483 healthcare systems' financial burdens. This is especially required in low and middle income 484 countries where CVD prevalence has also been increasing and is expected to increase as a 485 consequence of an aging and growing population (2). 486 487 There is potential for novel disruptive approaches to affordably improve CVD outcomes. Areas 488 where this may have an impact is in novel approaches to screening, lifestyle coaching and 489 prevention (2). Screening will become more accessible and widespread by more (near-)medical-490 grade sensors being integrated into smartphones and smartwatches, enabling continuous 491 monitoring of relevant behavioral CVD risk factors, as well as biomarkers such as heart rate, 492 blood pressure and blood glucose. By gathering a wider spectrum of relevant risk factors for 493 cardiovascular disease automatically and continuously, an ongoing and personalized 494 cardiovascular disease risk prediction could be enabled. Through linking personalised 495 information on an individual's CVD risk with app-based programmes for sustained behavioural 496 modification, it may be possible to lower the incidence and mortality of CVDs (58). Combined 497 with a companion smartphone-based app, an AI or healthcare provider-generated personalised 498 intervention program could be provided, and targeted at those people who need it the most. 499 Many studies have shown that digital health interventions are cost effective for managing CVD 500 (for a review see (59)). One report found that a community-based prevention program could 501 have a mean return on investment (ROI) on medical cost savings of $5.60 for every $1 spent 502 within a 5 year timeframe by improving physical activity and nutrition and reducing tobacco 503 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The improvement achieved by our models might be partially attributed to being trained and 509 assessed on the UK Biobank dataset, whereas the baseline Framingham model was derived 510 from a different population. The population and many of the data sources used in the QRisk3 511 model are similar, being the general UK population and using their GP, hospital and mortality 512 records. However, our risk model generation approach and QRisk3's approach were designed 513 with different aims and objectives and the modelling strategy was different. For these reasons, 514 direct comparison between the models is limited. Notable differences between the approaches 515 include a more limited set of risk factors included in Framingham and QRisk3's and a focused 516 and wider range of atherosclerotic CVDs included in our approach. 517

518
The results from our generalizability subanalysis indicate that our XGB and Logistic Regression 519 models might generalize well to other ethnicities and do not overfit to our cohort, however, this 520 needs to be further evaluated with more data from diverse ethnicities. 521

522
Our results show that our models have improved performance over the baseline models 523 Framingham and QRisk3 (Table 2). This is because the selection of the appropriate disease 524 modelling approach, classifiers and careful tuning of the model's hyperparameters are crucial 525 steps for realizing the potential benefits of ML. Our pipeline automates some of these steps 526 which makes the tuning and discovery of new disease risk models easily accessible for clinical 527 research. Our prospective cohort modelling approach, which is rooted in precision medicine, is 528 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The UK Biobank only admitted participants for their initial signup from the ages 40 and up. This 533 might limit the applicability of the risk score for younger populations and further tests with data 534 from younger populations need to be conducted. 535 536 There are many missing data values related to the potential risk factors for many participants. 537 Having more unimputed data of relevant CVD risk factors could improve the predictive 538 performance of all our benchmarked classifiers and could also lead to changes in the classifier 539 ranking from Table 2 and relative risk factor importances in Table 5. However, the use of 540 imputed data is highly unlikely to have an impact on our conclusion that a holistic set of risk 541 factors and an exhaustive atherosclerotic CVD outcome definition could improve atherosclerotic 542 and actionable CVD risk prediction. 543 544 An additional limitation of our study is that the UK Biobank dataset consists of participants of 545 predominantly (88%) British ethnicity, with an even larger portion having a white background 546 (91%). Therefore, further assessments of the influence of the ethnicity predictor need to be 547 carried out to enable a generalizable tool. Previous work in this area indicates that the plaque 548 growth process seems to be independent of ethnicity (21). 549 A further limitation of this UK focused dataset is that socio-economic and other environmental 550 factors differ between countries. This is another potential bias that needs to be further evaluated 551 with datasets from other countries with different socio-economic characteristics. 552 553 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint Disease risk prediction models which include subjective non-laboratory risk factors, such as the 554 self-reported health rating and usual walking pace, should be cautiously evaluated to minimize 555 self-reported bias. These risk factors have been found to be good predictors of someone's 556 overall CVD risk in another study using UK Biobank data (28). 557 558

559
We benchmarked multiple classifiers to predict an individual's 10-year risk of developing an 560 atherosclerotic CVD, using a holistic set of risk factors and a specific definition of atherosclerotic 561 CVDs. Our reduced Logistic Regression with L1 regularization classifier, a simple and 562 interpretable model, is amongst our best prediction models, includes actionable lifestyle factors, 563 has great predictive power and requires 13 unique features. Our experiments showed that a two 564 feature-questionnaire is as accurate as the Framingham models and a 16 feature-questionnaire 565 is as accurate as QRisk3 for 10-year atherosclerotic CVD risk prediction. Both prediction 566 models, XGBoost and Logistic Regression, generalize well to non-white people, which might 567 indicate that our models generalize well to other (western) countries. Framingham and QRisk3, 568 which are well established and validated absolute risk prediction models, do not perform as well 569 on predicting individuals' 10-year risk of developing an atherosclerotic CVD. With our Logistic 570 Regression model, we created a promising new interpretable, actionable and accurate risk 571 prediction tool that could assist individuals and public health in CVD risk reduction. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266742 doi: medRxiv preprint