ABSTRACT
Objective To develop a pragmatic model to predict total knee replacement (TKR) in knee osteoarthritis (OA) using non-imaging clinical, genetic, and lifestyle data with machine-learning (ML)–guided feature selection.
Methods We analyzed 3,790 Osteoarthritis Initiative (OAI) participants. Nested ML feature selection on the training set identified 15 informative variables. Classifiers were benchmarked, then a multivariable logistic regression was fit on the full cohort. Performance was summarized by discrimination (AUC with 95% CI) and calibration (Brier score). To assess the incremental value of genetics, we refit an otherwise identical Clinical model excluding the polygenic risk score (PRS) and compared specificity at fixed sensitivities using Bonferroni-adjusted McNemar tests. A pre-specified analysis examined performance by baseline Kellgren–Lawrence (KL) grade (KL 0–1 vs KL ≥2).
Results On the test set, classifier AUCs ranged 0.716–0.748, with Elastic Net and XGBoost performing best. The final logistic model fit on the full cohort achieved AUC 0.765 (95% CI 0.736–0.793) with acceptable calibration (Brier 0.097). Performance remained robust by disease stage, with higher discrimination in pre-radiographic knees (KL 0–1: AUC 0.827) and moderate discrimination in KL ≥2 (AUC 0.720); decile plots indicated broadly aligned observed vs predicted risks. PRS added modest, statistically significant gains in specificity at several fixed sensitivities without materially changing AUC.
Conclusions We present a pragmatic, non-imaging, ML-informed model that predicts TKR with clinically acceptable discrimination and calibration using routinely collected data. This framework provides a practical basis for individualized risk stratification and decision support without reliance on imaging.
What is already known on this topic Risk of total knee replacement (TKR) in knee osteoarthritis (OA) is multifactorial and many existing models depend on imaging markers such as Kellgren–Lawrence grade or MRI findings. Established non-imaging predictors include symptoms and function (WOMAC), age, BMI, knee alignment or prior injury. Genetic scores have been explored in OA but, to date, have shown limited standalone utility compared with routine clinical factors.
What this study adds This study presents a clinic-friendly, non-imaging prediction model guided by a transparent ML pipeline—nested random-forest feature selection with in-fold preprocessing and SMOTE, repeated cross-validation, and SHAP-based interpretation—that achieves acceptable discrimination and calibration in the OAI cohort. It reinforces the relevance of routine clinical factors, identifies an inverse association between Mediterranean-diet adherence and TKR risk, and evaluates the incremental—though limited—contribution of genetic risk via a polygenic risk score (PRS), with a signal that persists in pre-radiographic knees despite few events.
How this study might affect research, practice or policy The model offers a practical pathway for risk stratification where imaging is unavailable or costly, supporting shared decision-making and prioritization of follow-up. It encourages precision-medicine workflows that integrate clinical and genetic information cautiously and transparently, and it sets clear directions for future work: external validation across settings, assessment in early-stage OA populations, and refinement of genetic predictors before any policy or guideline incorporation.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study has been funded by Instituto de Salud Carlos III (ISCIII) through the projects RD21/0002/0009, RD24/0007/0026, PMP22/00101, PMPTA22/00115, PI17/00210, PI22/01165, PI22/01155 and PI23/00913 and co-founded by the European Union. This work was also funded by grants IN607A 2021/07 and IN607D 2021/13 from Axencia Galega de Innovacion-Xunta de Galicia. IRP is supported by Contrato Miguel Servet-II Fondo de Investigacion Sanitaria (CPII17/00026) SERGAS-stabilized. JVG is supported by grant IN606A 2022/048 from Xunta de Galicia, Spain.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
We used data from the Osteoarthritis Initiative (OAI), a well-characterized prospective cohort of knee OA patients with publicly available data and biospecimens(https://nda.nih.gov/oai)
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
Data are available upon reasonable request
Abbreviations
- AUC
- Area Under the Curve
- aMED
- Alternate Mediterranean Diet Score
- BMI
- Body Mass Index
- CI
- Confidence Interval
- dbGAP
- Database of Genotypes and Phenotypes
- GBM
- Gradient Boosting Machine
- GeCKO
- Genetic Components of Knee Osteoarthritis
- glmnet
- Generalized Linear Model Net
- GWAS
- Genome-Wide Association Study
- KL
- Kellgren & Lawrence
- MAF
- Minor Allele Frequency
- MD
- Mediterranean Diet
- ML
- Machine Learning
- MRI
- Magnetic Resonance Imaging
- mtDNA
- Mitochondrial Deoxyribonucleic Acid
- NSAIDs
- Non-Steroidal Anti-Inflammatory Drugs
- OA
- Osteoarthritis
- OAI
- Osteoarthritis Initiative
- OR
- Odds Ratio
- PCA
- Principal Component Analysis
- PRS
- Polygenic Risk Score
- QC
- Quality Control
- RF
- Random Forest
- ROC
- Receiver Operating Characteristic
- SHAP
- SHapley Additive exPlanations
- SNP
- Single Nucleotide Polymorphism
- SMOTE
- Synthetic Minority Oversampling Technique
- SVM
- Support Vector Machine
- TKR
- Total Knee Replacement
- VCF
- Variant Call Format
- WOMAC
- Western Ontario and McMaster Universities Osteoarthritis Index
- XGBoost
- Extreme Gradient Boosting





