Leveraging Machine Learning for Enhanced and Interpretable Risk Prediction of Venous Thromboembolism in Acute Ischemic Stroke Care

Abstract Background: Stroke is the second leading cause of death globally, with acute ischemic strokes constituting the majority. Venous thromboembolism (VTE) poses a significant risk during the acute phase post-stroke, and early recognition is critical for preventive intervention of VTE. Methods: Utilizing data from the Shenzhen Neurological Disease System Platform to develop multiple machine learning models that included variables such as demographics, clinical data, and laboratory results. Advanced technologies such as K nearest neighbor and synthetic minority oversampling technique are used for data preprocessing, and algorithms such as gradient boosting machine and support vector machine are used for model development.Feature analysis of optimal models using SHapley Additive exPlanations interpretable algorithm. Results: In our study of 1,632 participants, in which women were more prevalent, the median age of patients with VTE was significantly older than that of non-VTE individuals. Data analysis showed that key predictors such as age, alcohol consumption, and specific medical conditions were significantly associated with VTE outcomes. The AUC of all prediction models is above 0.7, and the GBM model shows the highest prediction accuracy with an AUC of 0.923. These results validate the effectiveness of this model in identifying high-risk patients and demonstrate its potential for clinical application in post-stroke VTE risk management. Conclusion: This study presents an innovative, machine learning-based approach to predict VTE risk in acute ischemic stroke patients, offering a tool for personalized patient care. Future research could explore integration into clinical decision systems for broader application.


Introduction
Stroke remains a paramount health challenge worldwide, ranking as the second leading cause of mortality and contributing to approximately 6.1 million deaths each year.It also stands as a principal cause of long-term disability (1).Among its subtypes, acute ischemic strokes (AIS) predominate, significantly influencing the convalescence and functional recovery of patients.A critical concern in the aftermath of an ischemic stroke is the development of venous thromboembolism (VTE), a severe complication typically arising within the first two weeks post-event, with its highest risk noted in the initial seven days (2,3).The occurrence of VTE is closely linked with an escalated risk of mortality within three months following a stroke, highlighting its importance as a preventable aspect of post-stroke care management (4,5).
The arena of VTE risk prediction in acute ischemic stroke patients is riddled with complexities.This is largely due to the dependence on traditional risk assessment models such as the Caprini scoring system and the Padua Prediction Score, which inadequately address the multifaceted risk factor interplay inherent to this patient group.These models often overlook critical elements such as demographic nuances, clinical manifestations, and the detailed medical histories specific to individuals who have suffered a stroke (6)(7)(8).Given the heterogeneous nature of stroke survivors, accurately predicting VTE risk necessitates consideration of a myriad of determinants, including age, gender, pharmacotherapy, and chronic conditions like hypertension and diabetes, which are pivotal in assessing VTE risk (9)(10)(11).Acknowledging these challenges, our research introduces an innovative machine learning-based predictive model tailored for acute ischemic stroke patients.This model transcends traditional assessment tools by integrating an expansive range of risk factors-from demographic specifics to comprehensive clinical and laboratory data-leveraging advanced algorithms to enhance interpretability.Our approach not only aims to achieve superior accuracy in VTE risk prediction but also seeks to shed light on the relative importance of various risk factors, thereby offering a more nuanced and personalized risk assessment for stroke survivors (12).Furthermore, the literature reveals a conspicuous scarcity of studies employing advanced machine learning techniques in conjunction with SHapley Additive exPlanations (SHAP) interpretability algorithms to forecast VTE subsequent to AIS.There exists a notable gap in the research landscape concerning the development of machine learning models that not only predict VTE with high precision but also adhere to the stringent criteria set forth by the Predictive Model Bias Risk Assessment Tool (PROBAST) (13).This study aims to bridge this gap through rigorous data collection and preprocessing efforts, alongside the application of sophisticated machine learning methodologies.By establishing a robust, interpretable framework, our research endeavors to empower clinicians with the tools necessary for identifying patients at elevated risk of VTE following a stroke, thereby enabling the formulation of precise interventions to diminish the incidence of this life-threatening complication.

Study Population
The dataset for this investigation was derived from the Shenzhen Neurological Disease System Platform, an extensive repository that has been methodically aggregating detailed data on ischemic stroke patients since 2021.This dataset encompasses a wide range of information, including sociodemographic characteristics, precise details regarding pre-hospital onset,

Predictor Variables and Outcomes
The primary objective of this investigation was to construct a predictive model that accurately forecasts the occurrence of VTE following after AIS using a comprehensive dataset representative of a broad spectrum of AIS patient profiles.Patients from the Shenzhen Neurological Disease System Platform, enrolled consecutively from December 2021 to December 2023, formed the basis of this analysis.The selection of potential predictive variables was confined to characteristics documented within the initial three days of hospital admission.These variables, meticulously captured via dedicated case report forms, encompassed a diverse array of factors: gender, ethnicity, age, height, weight, smoking, drinking, diabetes mellitus (DM), hyperlipidemia, atrial fibrillation, history of cerebral infarction, anemia, other comorbid conditions, electrocardiogram (ECG), prehospital medication, pre-morbid mRS, mRS after Admission, NIHSS at onset, NIHSS during hospitalization,glasgow coma scale (GCS), Trial of ORG 10172 in acute stroke treatment(TOAST) classification, weakness, dysarthria, dizziness, paresthesia, headache, convulsion, consciousness status, symptomatic treatment, endovascular treatment (EVT), thrombolytic therapy, lymphocyte count, high-sensitivity c-reactive protein (hsCRP), international normalized ratio (INR), fibrinogen, D-dimer, alanine aminotransferase, low-density lipoprotein cholesterol (LDLC), aspirin, clopidogrel, heparin, enoxaparin, low molecular weight heparin, unfractionated heparin, warfarin, rivaroxaban, sulfonylureas, glycosidase inhibitors, anti-infective treatment, traditional Medicine, lipid medication, anti-platelet therapy, anticoagulant therapy, anti-lipidemic drugs, anti-diabetic yreatment, chinese medicines, and intracranial artery stenosis.
All VTE diagnoses were conclusively determined via color Doppler ultrasound and pulmonary CT angiography (PCTA), serving as the diagnostic benchmarks.The diagnosis of VTE was grounded on a thorough assessment of clinical manifestations coupled with imaging outcomes, primarily focusing on the management of hospitalized patients.A stratified screening methodology was employed for stroke patients manifesting potential symptoms of VTE during their hospital stay, including but not limited to leg pain or swelling, localized warmth, dyspnea, or chest discomfort.This tailored screening approach facilitated the identification and subsequent evaluation of .individuals exhibiting clinical presentations suggestive of high-risk VTE, employing color Doppler ultrasound or PCTA for comprehensive examination when warranted.

Data processing and feature selection
In this study, we harnessed advanced machine learning techniques, notably the K-nearest neighbor (KNN) algorithm and the synthetic minority oversampling technique (SMOTE), to refine our dataset, thereby augmenting the predictive accuracy for VTE risk.This meticulous strategy significantly enhanced the model's reliability and its capability to generalize across diverse clinical scenarios pertaining to VTE risk assessment.The KNN algorithm was deployed to impute missing values, capitalizing on the proximity of each data point to its nearest "neighbors" within the feature space.This method proved invaluable in preserving the integrity and richness of our dataset, allowing for the retention of complex and multidimensional clinical data without resorting to invasive data collection methodologies.The integrity of the dataset was thus maintained, ensuring minimal loss of information and retaining the nuanced interplay of clinical variables.
Regarding the missing number, 30% of the variables were eliminated in this study and were not included in the data analysis(S1 Table ).SMOTE addressed the challenge of imbalanced classification inherent in our dataset.By generating synthetic samples for minority classes, it amplified the representation of these underrepresented groups within the dataset.This enhancement was pivotal in improving the model's sensitivity and accuracy in predicting VTE occurrences, a relatively rare but clinically significant event.The incorporation of SMOTE thus balanced the dataset, ensuring a more equitable representation of all classes and enhancing the model's predictive precision.Employing these sophisticated data processing and feature selection techniques, our study laid a robust foundation for the development of a highly accurate and generalizable VTE risk prediction model.This approach not only preserved the complexity and diversity of our clinical data but also addressed critical challenges related to data imbalances, setting a new benchmark in the predictive modeling of VTE risks.

Model development and performance evaluation
In the quest to devise an accurate predictive model for VTE risk following AIS, our study employed a two-pronged approach for feature selection: stepwise forward logistic regression and Least Absolute Shrinkage and Selection Operator (LASSO) analysis.The former method was instrumental in pinpointing variables significantly correlated with VTE risk, whereas LASSO analysis played a critical role in mitigating model overfitting and enhancing variable selection through the imposition of regularization penalties.The construction of the model was underpinned by the application of ten-fold cross-validation, a technique that bolsters both the performance and generalizability of the model.By segmenting the dataset into ten discrete parts, and iteratively training on nine while testing on the remaining one, this procedure guaranteed that each segment was utilized as a test set once, thereby enabling a more nuanced and accurate assessment of the model's efficacy.To encompass a broad spectrum of analytical perspectives and augment predictive accuracy, the study harnessed an array of machine learning algorithms, including Logistic Regression (LR), Naive Bayes (NB), Decision Trees (DT), Random Forest (RF), Gradient Boosting Machines (GBM), Extreme Gradient Boosting (XGB), and Support Vector Machines (SVM).The distinct attributes and suitability of each algorithm were leveraged to furnish a comprehensive analysis of the data from multifarious angles.Data preprocessing and model building are performed in the python3.9environment.

Statistical analysis
.
Our dataset was subjected to a rigorous descriptive statistical analysis, employing Chi-square tests for categorical variables, T-tests for normally distributed continuous variables, and rank-sum tests for variables deviating from normal distribution.This phase aimed to isolate independent predictors of distal DVT in patients experiencing acute stroke, with variables demonstrating P-values <0.05 advancing to the subsequent feature selection stage.This latter stage utilized both stepwise forward logistic regression and LASSO for variable selection, with the model's construction synthesizing insights from both methodologies.Visualization techniques, such as scatter plots for illustrating SMOTE sampling outcomes and ROC curve area under the curve (AUC) analysis, provided a graphical representation of cross-validation results and real-world application efficacy for each algorithm.Additionally, SHAP algorithm analysis was employed to enhance the interpretability of model features, particularly for the optimal model.Data analysis is performed in python3.9.

Ethics approval and consent to participate
This research received the endorsement of the Ethics Review Committee of Shenzhen Longhua District People's Hospital.In adherence to ethical standards, neurology nurse specialists collected patient data for the Shenzhen Neurological Disease System Platform (SNDSP).The inclusion of patients into the system was contingent upon their ability or willingness to provide written informed consent, either personally or through their representatives, thereby ensuring adherence to ethical research practices.

Characteristics of study population
The cohort comprised 1,632 subjects, among which the incidence of VTE was 4.17% (n = 68), .with a notable predominance of female patients.The median age of individuals diagnosed with VTE was 69.00 years, significantly older than their non-VTE counterparts, who had a median age of 58.00 years (p < 0.001).Detailed demographic and clinical characteristics of the study population are delineated in Table 1.Through ten-fold cross-validation, we meticulously evaluated the performance of seven distinct machine learning models.The results showcased the Gradient Boosting Machine (GBM) model leading with an AUC score of 0.974, followed by Random Forest (RF) at 0.925, Decision Trees (DT) at 0.883, Extreme Gradient Boosting (XGB) at 0.879, Naive Bayes (NB) at 0.858, Logistic .CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 13, 2024.; https://doi.org/10.1101/2024.04.11.24305689 doi: medRxiv preprint Regression (LR) at 0.854, and Support Vector Machines (SVM) at 0.853 (Fig 2A).Testing these models on an independent set validated their robustness, with five models exhibiting AUC values above 0.8, highlighting the GBM model's superior predictive accuracy with an AUC of 0.923 (Fig 2B).

Model Interpretation and SHAP Analysis
Ensuring the interpretability of predictive models, particularly in clinical settings, is crucial for their acceptance application by healthcare professionals.To address this, our study employs SHAP methodology, enabling a transparent evaluation of how each variable influences the model's predictions.This approach affords a dual-layered interpretation: global insights, which elucidate the model's overall decision-making process, and local insights, which provide individualized explanations.The global interpretative framework is visualized through SHAP summary plots (Fig 3A and B), where the mean SHAP values of each feature are calculated and ranked.This hierarchy underscores the relative importance of predictors such as D-dimer levels, Prehospital medication, and Age, among others, in determining VTE risk.SHAP dependency plots further dissect the relationship between specific features and the prediction outcome, offering a granular understanding of feature impact.

Prognostic implications
The exemplary performance of the GBM model culminated in its integration into a user-friendly web application, designed to predict VTE risk in AIS patients based on the model's key variables.This digital tool, accessible at https://youlijiang236.shinyapps.io/myapp/,empowers clinicians to leverage our predictive model in real-time, facilitating personalized patient care and informed risk management strategies (Fig 5).

Discussion
Our investigation has culminated in the development of an array of machine learning models adept at predicting VTE risk among patients suffering from AIS.A distinguishing feature of our models lies in their capacity to amalgamate an extensive array of predictive variables, spanning from elementary demographic information to intricate clinical observations and laboratory findings.This comprehensive approach substantially surpasses the precision and predictive power of conventional risk assessment methodologies (14,15).Crucially, our models incorporate a broad spectrum of demographic, clinical, and laboratory parameters, thereby offering a holistic risk evaluation framework.Unique to our research is the inclusion of variables previously overlooked in model studies, such as EVT, alongside other variables that reflect the routine diagnostic and therapeutic practices in stroke management.This ensures the practicality and applicability of our .model in real-world clinical settings.The employment of the gradient boosting machine (GBM) algorithm marks a significant enhancement over existing predictive approaches (16), not merely in terms of predictive accuracy but also in handling voluminous and readily available variables.A pivotal aspect of our model is its interpretability, facilitated by SHAP value analysis, which demystifies the impact of each predictor on VTE risk.This interpretability equips clinicians with a profound understanding of the prediction process, thereby empowering them to make well-informed and precise clinical decisions.Our model improves the accuracy of VTE risk prediction and enriches the transparency and understandability of prediction results.This dual advantage heralds a novel era in the management of VTE risk post-stroke, spotlighting the potential of machine learning in transforming clinical decision-making and patient care strategies.
A review of extant literature reveals a predominant reliance on various forms of logistic regression models for constructing VTE risk prediction frameworks.Notable examples include the DVT risk model developed by Pan et al. using data from patients with acute stroke, which reported a final AUC of 0.785 (17), and the post-AIS model by Bonkhoff et al., utilizing L1 regular logit, achieving an AUC of 0.730 (18).These studies underscore the conventional approach towards VTE risk prediction, often limited by the methodologies employed.In stark contrast, our study introduces a machine learning model that exhibits superior predictive performance, as evidenced by a notably higher AUC, particularly with the implementation of the GBM algorithm, which attained an AUC of 0.923.This significant improvement in predictive accuracy is further augmented by our innovative approach to feature selection, combining LASSO with stepwise logistic regression.This methodological synergy not only enhances variable selection accuracy and model stability but also considerably reduces the risk of overfitting, thereby bolstering the .model's generalizability-a critical advancement corroborated by multiple preceding studies(19).Furthermore, our meticulous adherence to PROBAST (Predictive Model Biomarker Assessment Tool) standards throughout the processes of data collection, preprocessing, and modeling underscores our commitment to bridging the existing gaps in the literature.Our objective was to construct a model of heightened accuracy capable of identifying patients at elevated risk of VTE more effectively.This ambition was rooted in the recognition of the limitations inherent in previous studies and driven by the potential of advanced machine learning techniques to revolutionize the predictive modeling landscape in VTE risk assessment following acute ischemic stroke.
Our study's integration of SHAP analysis has illuminated the critical roles of individual risk factors in the genesis of VTE.Among these, prominent factors such as heightened D-dimer levels, prior medication use, patient age, pre-morbid mRS, GCS, the presence of large vessel occlusion, and specific medical interventions emerged as significant predictors.These findings resonate with varying degrees of corroboration from prior research endeavors, underscoring their empirical validity(20).
Notably, our analysis underscored the pivotal biomarker role of D-dimer in forecasting VTE risk, establishing an early warning threshold at a D-dimer level of 0.72 ug/ml.The elevation of D-dimer, a fibrin degradation product indicative of thrombosis and fibrinolytic activity, has been validated in earlier studies as a crucial diagnostic marker for thrombotic events(21, 22).
Concurrently, the GCS score, which gauges a patient's consciousness level, has been linked to VTE risk, with lower scores suggesting compromised brain function and consequently an elevated VTE risk(23).Furthermore, our analysis corroborates the heightened VTE susceptibility among .older patients and those with diminished pre-stroke functional capacity, as indicated by their premorbid mRS scores(24, 25).Such susceptibility is attributed to diminished mobility and circulation, factors that significantly contribute to thrombosis risk in this patient cohort.
Prehospital medications, encompassing antidiabetic and antihypertensive drugs, antibiotics, and traditional Chinese medicine, imply pre-existing health conditions that, coupled with AIS-induced mobility constraints, predispose patients to thrombotic complications(26).The administration of antiplatelet and anticoagulant therapies during hospitalization not only serves as a preventive strategy against thrombotic events but also signals a higher baseline embolic risk.Moreover, the use of antilipidemic medications underscores an indirect VTE risk associated with cardiovascular diseases, highlighting the intricate interplay between cardiovascular health and VTE incidence(27).Furthermore, our findings indicate that stroke patients who consume alcohol are at an elevated risk of developing post-stroke VTE compared to those who do not drink.This heightened risk may be attributable to alcohol's indirect influence on the coagulation and fibrinolytic systems, disrupting the delicate balance necessary for normal blood clotting and dissolution(28).The nuanced effects of different treatments on VTE risk during the acute phase of AIS patients have been somewhat overlooked in previous research(29).Nonetheless, our analysis reveals that patients undergoing EVT may face a greater VTE risk compared to those receiving thrombolytic therapy or symptomatic care alone.This increased risk could stem from the prolonged immobilization and delayed return to daily activities post-EVT, as EVT, despite its efficacy in managing acute large vessel occlusion strokes, necessitates extended bed rest(30).This shows that post-operative prevention of stroke patients receiving EVT should pay more attention to the prevention of VTE

Conclusion
Our study advances the prediction of VTE risk in patients with acute ischemic stroke through the application of the GBM algorithm.This approach offers a refined risk assessment tool that can significantly enhance early detection and management of VTE in this vulnerable population.The in-hospital diagnostic findings, treatment records, and laboratory test outcomes from 20 affiliated hospitals.The study's inclusion criteria targeted individuals hospitalized due to stroke between December 2021 and December 2023, who were diagnosed using magnetic resonance imaging (MRI) or computed tomography angiography (CTA), aged 18 years or older, and admitted within seven days of symptom onset as per the International Classification of Diseases, Tenth Revision (ICD-10) criteria.Exclusions were made for patients with transient ischemic attacks, subarachnoid hemorrhage, brain tumors, cerebral venous thrombosis, those diagnosed with distal deep vein thrombosis (DVT) prior to admission, or with a history of pulmonary embolism.For the development and validation of our model, data collected from December 2021 to June 2023 and from July 2023 to December 2023 were utilized, respectively.The platform's data management and quality control measures were stringently overseen by specialized data managers and quality assurance staff to ensure the anonymization, quality, preservation, and exportation of data met high standards.The data for this research were accessed on January 24, 2024, ensuring all patient data, directly sourced from diagnostic and treatment documentation, were meticulously recorded by experienced neurology specialists for reliability and precision in the dataset used in our research.

.
The LASSO model, employing a cross-validation mechanism, fine-tuned the regularization strength (alpha) over a logarithmic scale from 10 -6 to 10 1 , facilitating precise feature selection.A specified random_state parameter ensured the reproducibility of the findings.The model's comprehensive analysis underscored the significance of variables such as Pre-morbid mRS, unexplained stroke, in-hospital medications, among others, affirming their relevance to the study's aims (Fig1 E and F).Concurrently, stepwise forward logistic regression was utilized to identify pertinent variables for univariate analysis, complementing the LASSO model's feature selection to guarantee a comprehensive set of predictors for model development.The outcomes of this meticulous variable screening process are presented in S2 Table .
Development and validation of predictive modelsOur study employed t-distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction, facilitating a detailed visualization of the distribution patterns between VTE and non-VTE cases within our training dataset (Fig 1).The initial dataset displayed a mixed distribution (Fig 1A), which became sparser following random undersampling (Fig 1B).Conversely, distributions post-oversampling and the application of SMOTE-NC illustrated a more dispersed pattern (Fig 1C and 1D), indicating the significant impact of sampling techniques on data representation.
Fig 4A and B, highlight the model's ability to integrate individual patient data to predict VTE risk accurately.For instance, one patient was identified with a 99.7% VTE risk, with significant factors being premorbid mRS and age.Conversely, another patient presented a low risk of 1.1%, with premorbid mRS contributing negatively to VTE risk, indicating how diverse variables can influence individual risk profiles differently.Such analyses enable tailored patient care and informed risk management. .Furthermore, Fig 4C reveals a nonlinear association between D-dimer levels and VTE risk, pinpointing a threshold beyond which VTE risk escalates significantly.This insight is critical for identifying patients who might benefit from closer monitoring or preventive interventions.The SHAP dependency graph (Fig 4D) elaborates on the effect of individual variables across the patient cohort, providing a comprehensive overview of the model's predictive dynamics.

.
and strengthen the accessibility of VTE monitoring and examination after EVT.By amalgamating advanced machine learning techniques with SHAP value analysis, we have developed a server-based web calculator.This innovative tool is poised to revolutionize VTE risk screening post-AIS by enabling the input of essential patient characteristics into the model.It equips healthcare professionals with the capability to precisely identify patients at elevated risk of VTE, thereby facilitating the formulation of targeted prevention and intervention strategies grounded in a meticulous evaluation of risk factors.For instance, the early identification of patients exhibiting high D-dimer levels and diminished GCS allows for the swift adoption of more aggressive preventive approaches, such as anticoagulant therapy, alongside enhanced physical assessments and nursing care, especially for patients receiving EVT.Furthermore, the interpretability afforded by the model's SHAP values grants physicians deeper insights into the influence of each predictor on VTE risk.This enhanced understanding enables a more nuanced consideration of various treatment modalities' benefits and drawbacks, guiding clinical decision-making towards optimized antithrombotic treatment strategies and personalized patient care plans.
incorporation of SHAP values for interpretability strengthens the model's applicability in clinical decision-making, allowing for the development of tailored treatment plans.The potential integration of this model into clinical decision support systems represents a promising direction .for future research, aiming to improve clinical efficiency and patient outcomes in a practical healthcare setting.cross-validation AUC scores for various models, (B) ROC Curves for different models on the test set.

Figure 4 .
Figure 4. SHAP Value Analysis for Model Prediction.(A)Average impact on model output.

which was not certified by peer review)
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.(

which was not certified by peer review)
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.(

Table 1 .Demographic and Baseline Characteristics by VTE Status.
test set) and of Han ethnicity (97.20% in the training set, 96.33% in the test set), with other demographic and clinical attributes showing no significant differences between the two groups, .thusensuring a balanced representation of stroke-related outcomes (Table2).

Table 2 . Distribution of Demographic and Clinical Variables in Training and Test Sets.
International licenseIt is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.(

Correlation of variables with clinical outcome
CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.( 0.001), NIHSS at onset (p < 0.001), NIHSS after admission (p < 0.001), GCS (p < 0.001), TOAST classification, weakness, consciousness status, EVT, D-dimer, and LDLC, among other variables..

which was not certified by peer review)
These detailed comparison results are provided in Table1.