Abstract
Cardiovascular diseases (CVD), primarily coronary heart disease and stroke, rank amongst the leading causes of long-term disability and mortality. Providing accurate disease risk predictions and identifying genes associated with CVD are crucial for prevention, early intervention, and the development of novel medications.
The recent availability of UK Biobank Proteomics data enables the investigation of the blood proteome and its association with a wide variety of diseases. We employed the Explainable Boosting Machine (EBM), an interpretable machine learning model, for CVD risk prediction. The EBM model using proteomics outperforms traditional clinical models with an AUROC of 0.767 and an AUPRC of 0.2405. Adding clinical features further improves the AUROC to 0.785 and the AUPRC to 0.2835. Our models demonstrate consistent performance across sexes and ethnicities.
While most prior studies using proteomics data for disease prediction have primarily focused on maximizing the accuracy at the population level, our model provides additional enriched insights into individualized disease risk predictions and in-depth biological insights into biomarkers. Our analysis also uncovers nonlinear risks linked to varying feature values. We further corroborate our findings using statistical approaches and evidence from the literature.
In conclusion, we present a highly accurate and explanatory framework for proteomics data analysis, offering comprehensive and in-depth molecular and clinical insights. Our findings support future approaches that prioritize individualized disease risk prediction and the identification of target genes for drug development.
Competing Interest Statement
H.C.-G., W.G., M.T., S.H., G.T., N.K and J.M.M.H. are employees of Novo Nordisk Research Centre Oxford. M.O., U.C., R.H., S.M., P.P.D.V., L.D., R.A. and C.L. are employees of Microsoft Corporation. E.V. is an employee of Novo Nordisk A/S.
Funding Statement
This research was funded by Novo Nordisk A/S and Microsoft Corporation.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This research has been conducted using the UK Biobank Resource under Application Numbers 53639 and 65851.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
No new data was generated in the present study