Abstract
Background With the improved life expectancy in people living with HIV, predicting individual frailty is important for clinical care. DNA methylation (DNAm) has been previously linked to aging and mortality in non-HIV populations. We aim to establish a panel of DNAm biomarkers to predict frailty in HIV-positive population.
Methods Our samples consist of HIV-positive participants (Ntotal=1,081) from the Veterans Aging Cohort Study (VACS) and were divided into the training set (Ntraining=460), the validation set (Nvalidation=114), and the testing set (Ntesting=507). We used a well-established score, VACS Index, as a measure of frailty. Model training and fine-tuning were conducted using the ensemble method that aggregated prediction results of four base models. The final number of features were determined in the validation set and the model performance was assessed in the testing set. We conducted a survival analysis to assess whether the predicted frailty was associated with 10-year mortality and gene ontology enrichment analysis of the predictive CpG sites was performed.
Results We selected a panel of 393 CpGs to build the ensemble machine learning model. The prediction model showed an excellent performance on predicting frailty with Area Under Curve of 0.809 (95%CI: 0.767-0.851) and balanced accuracy of 0.653 in the testing set. The predicted frailty based on our model was significantly associated with 10-year mortality (hazard ratio=1.79, p=4E-05). These 393 CpGs were located within or near 280 genes that were enriched in biological pathways including immune and inflammation response.
Conclusions We identified a panel of predictive DNAm features associated with frailty and mortality among HIV-positive population. These DNAm features may serve as biomarkers to predict frailty people who live with HIV.
Introduction
Combination antiretroviral therapy has significantly improved life expectancy among HIV-positive individuals. The prolonged life expectancy increases frailty risk in the HIV population. The prevalence of frailty in people living with HIV is significantly higher and onset of frailty occurs at an earlier age compared to the general population (1). Frailty is characterized by marked vulnerability and is associated with increased mortality. Thus, prediction of frailty is important to identify vulnerable group and deliver clinical care to highly vulnerable HIV-positive patients. At present, frailty is usually defined by clinical symptoms (e.g., Fried’s frailty phenotype, Rockwood and Mitnitski’s Frailty Index) or a combination of lab tests that indicate organ damage such as the Veterans Aging Cohort Study index (VACS index) (2). No biomarkers available to capture the early stage of the frailty to predict individual vulnerability in HIV-positive patients.
A large body of evidence has demonstrated that epigenetic modification adapts internal and external environmental changes and that is associated with early stage of pathophysiological processes (3-6). DNAm, one type of the most widely studied epigenetic marks, is strongly related with aging (7-9), substance use (e.g. cigarette smoking and alcohol consumption) (10-16), and a variety of diseases (3-6, 17, 18). Because DNAm is relatively stable and easy to detect from body fluids biospecimen through non-invasive procedure, DNAm marks have emerged as robust biomarkers for cancer diagnosis (19), disease subtype classification (20, 21) and treatment response monitoring (22, 23).
DNAm may play an important role in frailty among HIV-positive individuals. Frailty is associated with elevated inflammation markers such as IL-6 and hsCRP in HIV-positive individuals (24). Genes involved in immune and inflammation processes are also subject to epigenetic modification in immune cells. We previously reported association of two CpG sites in the promoter region of NLRC5 with HIV infection (25), and NLRC5 is a major transcriptional activator of MHC class I gene. DNAm was also linked to HIV comorbid medical diseases such as HIV-positive diabetes and kidney function (26, 27). Furthermore, aging, a critical contributor to frailty, is significantly associated with thousands of CpGs in the epigenome, and the epigenetic clock and DNAm age are becoming widely recognized (7-9). DNAm age is significantly correlated with physical frailty in HIV-negative individuals (28) and HIV-positive individuals (29). In HIV-positive individuals, the average DNAm age is accelerated 5 to 10 years faster than HIV-negative individuals (30-33). DNAm accelerated 10 years faster in frailty individuals in comparing to non-frailty individuals in HIV-positive population (29). These age-related DNAm signatures are predictive of mortality (34-39). Therefore, we hypothesize that DNAm is associated with frailty and that DNAm signatures in blood can serve as biomarkers to predict frailty and mortality among HIV-positive individuals.
Machine learning methods have been applied to select informative DNAm features for clinical diagnosis and classification of complex diseases(15, 40, 41). Ensemble machine learning method can aggregate multiple machine learning models (base model) usually provide better prediction outcomes than single base model (42, 43). Ensemble method has been shown to perform well in personalized medicine and disease outcomes such as cancer and diabetes (44, 45).
In this study, by applying an ensemble learning approach, we aimed to identify DNAm features that can serve as biomarkers on frailty and to link frailty-associated DNAm to mortality among HIV-positive individuals. Here, VACS index, a well-established frailty score, was used as a measure of frailty in the HIV-positive population(46-48). Findings from this study provide a set of DNAm biomarkers to predict frailty for potential future clinic use and new insights into the epigenetic mechanism of frailty and mortality in HIV pathology.
Methods
As an overview, our analytical approach is shown in the flowchart in Figure 1. Briefly, we first applied an ensemble learning to improve machine learning results on predicting frailty, then examined the association of the selected CpG features with mortality by a survival analysis. Then, we conducted a gene-ontology enrichment analysis of the predictive features. Lastly, we tested a meta-analysis of epigenome-wide association (EWA) on frailty on the entire sample.
Study population
All samples in the sample set 1 and sample set 2 were from the VACS. The VACS is a prospective cohort study of veterans focusing on clinical outcomes in HIV infection (46). DNA samples were extracted from peripheral blood of 1,081 HIV-positive men from the VACS. Samples in sample set 1 was randomly partitioned into training set (80%, N=460) and validation set (20%, N=114), and samples in sample set 2 was used as independent testing set (N=507). Table 1 presents demographic and clinical information on age, race, smoking status, CD4 count, viral load, HIV medication adherence, VACS index, and mortality in the training set, the validation set, and the testing set. Comparing to the testing set, the training and validation sets included slightly older individuals, more African Americans, and frailty individuals. There were no significant differences in sex, smoking, HIV medication adherence, CD4 count, log10 HIV-1 viral load or 10-year mortality across three sample sets.
Frailty measurement
VACS index is a well-established measure for frailty among HIV population (46-48). The VACS index is a composite score constructed by a sum of pre-assigned points on age, CD4 count, HIV-1 RNA, hemoglobin, platelets, aspartate and alanine transaminase (AST and ALT), creatine, estimated glomerular filtration rate (eGFR), and viral hepatitis C infection (48). Frailty was defined as VACS index > 40 since the predicted 3-year mortality of this group was 10% based on the previous report, (49). Prediction models were developed by machine learning methods to predict frailty (VACS index > 40) and non-frailty group (VACS index ≤ 40).
Genome-wide DNAm profiling and quality control
DNA samples in the sample set 1 were profiled by Infinium Human Methylation 450K BeadChip (HM450K), and DNA samples in sample set 2 were profiled by the Infinium Human Methylation EPIC BeadChip. DNAm for the training and validation sets were evaluated using the same quality control (QC) protocol (50) in the R package minfi (51). In detail, CpG sites on sex chromosomes and within 10 base pairs of a single nucleotide polymorphism were removed. The detection p-value threshold was set at 10-12 for both the training and validation sets. After QC, 408,583 common CpG sites between HM450K and EPIC arrays were used for analysis to ensure the same coverage between two sets. Proportions of 6 cell types (CD4+ T cells, CD8+ T cells, Natural Killer T cells, B cells, monocytes and granulocytes) were estimated using the established method(52) for all samples in the sample set 1 and sample set 2.
Feature selection of CpG sites in the training set
We first pre-selected a panel of CpG sites associated with frailty based on EWA on frailty in the training set. CpG sites with p<0.001 were preselected for building predictive models. The marginal significance cutoff of p=0.001 was arbitrarily set to ensure sufficient number of CpG sites for building a model. The variable importance of each CpG in the preselected CpG sites was ranked by using Elastic-Net Regularized Generalized Linear Models (GLMNET) variable importance ranking (a score between 0-100) in caret based on 100 bootstraps (each bootstrap includes 70% of all samples). In GLMNET, importance of zero means the feature is neglected in model construction. CpG sites with zero variable importance for 80% of the bootstraps were removed from model development. The remaining CpG sites were ranked based on the median importance ranking among 100 bootstraps and were divided as different groups. The CpG sites were grouped based on importance rank of CpG sites. Each CpG group was used to build ensemble machine learning models.
Developing machine learning prediction models for frailty
Model development in the training set: We developed an ensemble classification model that aggregated the prediction models from four base machine learning models: random forest (RF), GLMNET, Support Vector Machines (SVM) and k-Nearest Neighbors (KNN). These four different methods are selected to enlarge the diversity of algorithms (53-56). Ten-fold cross-validation was used in the model training process to minimize data overfitting. The four Models based on the number of CpG groups were separately trained to predict frailty in the training set and then aggregated with greedy ensemble method by using R package caretEnsemble (ver. 2.0.1) (57). The prediction performance of each ensemble model was evaluated by using area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPRC).
Select final CpG group in the validation set: The CpG group with both highest auROC and auPRC in the validation set was selected as the final feature group for the ensemble model.
Independent evaluation in testing set: using ensemble model and final feature group, we predicted the frailty group in the testing set and evaluated by using auROC and balanced accuracy. Balanced accuracy is defined as the average accuracy obtained on each class as shown in the following formula (58). Balanced accuracy was used in this study to avoid biased accuracy due to imbalanced samples (58).
Survival analysis
By using the final ensemble model we predicted each individual with frailty or non-frailty in the entire samples. Kaplan-Meier survival curves presented 10-year follow-up to visualize the survival differences by participants with predicted frailty or non-frailty. Survival analysis was conducted by cox proportional hazards model on 10-year mortality comparing frailty and non-frailty groups. We used age as time scale t, and our model was adjusted for race, smoking, self-reported HIV medication adherence, log 10 of HIV viral load and CD4 count.
Biological interpretation of the predictive panel of CpG sites on frailty
We performed gene ontology (GO) enrichment analysis using missMethyl to adjust a bias from differing numbers of CpG site per gene(59, 60). Genes contained at least one predictive CpG site were used for GO analysis.
Epigenome-wide association analysis on frailty in all samples
Since DNA methylation of two sample sets was measured by two different platforms, we performed EWAs on frailty in sample set 1 and sample set 2 separately and then conducted a meta-analysis to detect epigenome-wide signals in the entire samples. In each EWA, we used a two-step linear model approach as previously described (50). The EWA model adjusted for confounding factors including age, race, smoking, cell type proportions and control principle components. Meta-analysis of sample set 1 and sample set 2 EWA was conducted using METAL(61). The weights of effect size estimates were the inverse of the corresponding standard errors for the meta-analysis(61). CpG sites with Bonferroni corrected p<0.05 were considered statistically significant.
Results
Feature selection and ensemble model training for frailty
A panel of 856 CpGs with p<0.001 was pre-selected based on EWA in the training set. We ranked these candidate predictors by median GLMNET importance ranking among 100 bootstraps using 70% of the training sample. We excluded 178 CpG sites that had variable importance of zero among 80% of the bootstraps. A final panel of 678 CpG sites were selected and were formed into 20 groups based on importance ranking to determine the best performing CpG group for the ensemble model (Figure 2).
In the training set, we developed our base models independently for the 20 groups of CpGs using four commonly used machine learning classification models (53-56): RF, GLMNET, SVM, and KNN. The performances of GLMNET, RF and SVM were mostly comparable in terms of auROC and auPRC plateaued to 1 with increasing number of CpGs in the training set (Figure 3, Figure 4). The performance of KNN was not as good as the other three methods, but its auROC and auPRC remained above 0.9 in the training set. Thus, an ensemble model combined the prediction results of all four base models was used to address imbalance performances from 4 models (Figure 3, Figure 4).
In the validation set, the performances of four base models varied. Three models of GLMNET, RF, and SVM showed good performance with both auROC and auPRC greater than 0.8, but the performance of KNN was poor with auROC or auPRC less than 0.8 (Figure 3, Figure 4). The ensemble model showed the best performance at 393 CpG sites with excellent performance of auROC (0.829) and auPRC (0.830). The accuracy of this prediction model was 0.807, and the balanced accuracy that took into account of class imbalance between frailty and non-frailty participants was 0.782. Thus, the ensemble model with 393 features was determined as the final prediction model (Table S1).
In the testing set, the ensemble model with 393 features showed excellent performance with auROC of 0.809 (95%CI: 0.767-0.851), prediction accuracy of 0.761 and balanced accuracy of 0.653 (Figure 5), suggesting that an ensemble model of a panel of 393 features enable to predict frailty.
Survival analysis
By using the ensemble model to predict participants with frailty and non-frailty groups, we found that 10-year Kaplan-Meier curves differed between frailty and non-frailty participants (Figure 6). The Kaplan-Meier curves showed that participants with frailty showed significantly lower 10-year survival probability compared with non-frailty participants. Using Cox proportional-hazards model to control for sex, baseline age, race, viral load, CD4 count and antiviral medication adherence, this difference remained significant with a hazard ratio (HR) of 1.79 (95%CI: 1.35-2.37, p=4E-05).
Biological interpretations of predictive CpGs by gene ontology enrichment analysis
The 393 predictive CpGs were located in or near 280 genes. The top 8 enriched pathways based these 280 genes included response to virus (p= 4.26E-05), defense response (p=1.29E-04), cytokine receptor binding (p=1.48E-04) and regulation of response to interferon-gamma (p= 4.15E-04) (Figure 7, Table 2). Our findings suggested that the selected 393 CpG sites are biologically relevant to HIV pathogenesis and progression.
Epigenome-wide association (EWA) on frailty
Meta-analysis on EWA of sample set 1 and sample set 2 identified 208 epigenome-wide significant CpGs after Bonferroni correction. These significant CpG sites were located in the 112 genes and included gene previously reported association with HIV pathogenesis. For example, cg07839457 in NLRC5 had been previously reported to be associated with HIV infection(25). Interestingly, 30 of 208 CpG sites were also in the panel of machine-learning predictive CpGs (Table 3). Some of the overlapped CpGs were located in viral response genes such as IFITM1 and PARP9.
Discussion
In this study, we present evidences that DNAm marks in blood are predictive of frailty and are associated with mortality in an HIV-positive population. We identified a panel of 393 CpG sites that were highly predictive for participant with frailty. We also found that ensemble predicted frailty was strongly associated with mortality in the HIV-positive population in our cohort. In addition, these machine learning-selected 393 DNAm features lied in genes involved in HIV pathogenesis and progression. Thus, we identified a panel of 393 DNAm biomarkers that add knowledge to the epigenetic mechanisms underlying frailty and mortality among people living with HIV.
We demonstrated that machine learning approach is robust to identify biologically relevant DNAm marks to predict frailty in HIV-positive individuals. One of the challenges for machine learning method is overfitting model that leads to a bias of prediction. We took the several steps to address potential overfitting in developing the ensemble prediction models: 1) model development and final model selection were conducted separately in training set and validation set; 2) 10-fold cross-validation was performed during the training; 3) the use of ensemble model that aggregate prediction results from multiple base models instead of arbitrary choice of a specific machine learning prediction model; 4) the final model was evaluated in an independent test set. We further observed that performance of different model generated by individual machine learning method varied. Ensemble-based modeling method appears outperform than individual modeling method and can adjust the imbalanced performance among individual models. Our results suggested that ensemble-selected panel of CpG sites was relatively stable and robust.
Our results show that the frailty predicted by the selected CpG biomarker is strongly associated with mortality in a HIV-positive population. The finding is consistent with previous literature showing that DNAm marks in blood predict mortality in a HIV-negative population (39). Machine learning enables to select informative CpGs under epigenome-wide significant threshold that have predictive values to a phenotype. This panel of 393 CpG sites are biologically meaningful and may shed light on our understanding of the biological mechanisms of frailty for HIV-positive individuals. We found that 30 out of 393 CpGs reached epigenome-wide significance. The majority of the 393 CpG sites are located within or near genes that are involved in known HIV pathology and progression. For example, cg22930808 on PARP9 and cg07107453 on IFI44L were selected by machine learning and also reached epigenome-wide significance, which both genes are involved in HIV pathogenesis. cg12359279 is located in the MX1 gene (Interferon-Induced GTP-Binding Protein Mx1). MX1 encodes a GTP-metabolizing protein induced by interferon I and II and is involved in interferon gamma signaling and Toll-like signaling pathway. The biological relevance of these 393 CpG sites was further supported by the gene ontology enrichment analysis. The top enriched pathways such as response to virus and cytokine receptor binding may point out important biological pathways that leads to frailty and increased risk of mortality among people living with HIV.
We acknowledge several limitations in this study. The generalizability of the selected CpG sites may be limited because our samples were predominantly middle-aged men, which may not generalize to frailty in a different age group. All samples in our study are HIV-positive and the identified CpGs have a limited application to HIV-negative population as we discussed above. Future studies in diverse populations is warranted to validate the selected methylation features.
Conclusions
We identified a panel of 393 predictive DNAm features in blood that was predictive of frailty and mortality among people living with HIV. These DNAm features may serve as biomarkers to identify frailty individuals and may help to prioritize genes to better understand the mechanisms of frailty in HIV-positive population. These DNAm features have potential for monitor HIV progression in future clinical care.
Data Availability
Data has been deposited in GEO
Figure 8: Epigenome-wide association (EWA) on frailty: quantile-quantile plot, Manhattan plot and overlapping CpG sites between EWA significant CpGs and predictive CpGs.
Table S1: Predictive CpG sites by variance importance
Table S2: Epigenome-wide significant CpG sites in meta-analysis of EWAS on frailty
- List of abbreviations
- AUC
- :Area Under Curve
- CI
- :Confidence interval
- DMR
- :differentially methylated region
- DNA
- :Deoxyribonucleic acid
- DNAm
- :DNA methylation
- DAVID
- :Database for Annotation, Visualization and Integrated Discovery
- EWA
- :epigenome-wide association
- FDR
- :False discovery rate
- FWER
- :Family-wise error rate
- GLMNET
- :Lasso and Elastic-Net Regularized Generalized Linear Models
- GO
- :Gene ontology
- HIV
- :Human immunodeficiency virus
- HM450K
- :Human Methylation 450K BeadChip
- NK
- :Natural killer
- PC
- :Principal component
- QC
- :Quality control
- SVM
- :Support Vector Machines
- VACS
- :Veterans Aging Cohort Study
- VACS index
- :Veterans Aging Cohort Study index
- XGBoost
- :Extreme Gradient Boosting Tree
Declarations
Ethics approval and consent to participate
The study was approved by the committee of the Human Research Subject Protection at Yale University and the Institutional Research Board Committee of the Connecticut Veteran Healthcare System. All subjects provided written consents.
Availability of data and materials
Demographic and clinical variables and DNAm data for the VACS samples were submitted to GEO dataset (GSE117861) and are available to the public. All codes for analysis are also available upon a request to the corresponding author.
Competing interests
The authors declare that they have no competing interests.
Funding
The project was supported by the National Institute on Drug Abuse (R03DA039745, R01DA038632, R01DA047063, R01DA047820).
Authors’ contributions
CS was responsible for data analysis and manuscript preparation. ACJ provided DNA samples, clinical data, and contributed to manuscript preparation. XZ was responsible for the bioinformatics data processing. VM involved clinical data collection and manuscript preparation. DH and EJ contributed to analytical approach and the manuscript preparation. KX was responsible for the study design, study protocol, sample preparation, data analysis, interpretation of findings, and manuscript preparation.
Acknowledgements
The authors appreciate the support of the Veteran Aging Study Cohort Biomarker Core and Yale Center of Genomic Analysis.