ABSTRACT
Prostate cancer (PCa) poses a significant global health challenge, particularly due to its progression into aggressive forms like neuroendocrine prostate cancer (NEPC). This study developed and validated a stemness-associated gene signature using advanced machine learning techniques, including Random Forest and Lasso regression, applied to large-scale transcriptomic datasets. The resulting 7-gene signature (KMT5C, MEN1, TYMS, IRF5, DNMT3B, CDC25B and DPP4) was validated across independent cohorts and patient-derived xenograft (PDX) models. The signature demonstrated strong prognostic value for progression-free, disease-free, relapse-free, metastasis-free, and overall survival. Importantly, the signature not only identified specific NEPC subtypes, such as large-cell neuroendocrine carcinoma, which is associated with very poor outcomes, but also predicted a poor prognosis for PCa cases that exhibit this molecular signature, even when they were not histopathologically classified as NEPC. This dual prognostic and classifier capability makes the 7-gene signature a robust tool for personalized medicine, providing a valuable resource for predicting disease progression and guiding treatment strategies in PCa management.
INTRODUCTION
Prostate cancer (PCa) remains one of the most significant health challenges for men globally, with a high incidence and mortality, particularly in advanced stages of the disease [1]. Despite advancements in early detection and treatment, accurately predicting which patients will experience aggressive disease progression remains a major challenge. A critical gap in the management of PCa is the lack of reliable prognostic biomarkers capable of identifying patients at the highest risk of developing more aggressive forms of PCa, such as neuroendocrine prostate cancer (NEPC), a subtype associated with poor prognosis [2,3]. Addressing this gap is essential for improving patient outcomes and guiding effective therapeutic strategies.
To overcome this challenge, we propose the identification of stem-like characteristics within prostate tumors. Cancer stem cells (CSCs) are a subpopulation of cells within tumors that possess the ability to self-renew, differentiate, and drive tumor growth, metastasis, and resistance to conventional therapies [2,4]. These cells have been implicated in the recurrence and progression of PCa, making them critical targets for both prognostic and therapeutic interventions [5]. Moreover, CSCs are believed to contribute to the heterogeneity of PCa, which complicates treatment and highlights the need for more refined biomarkers [6]. However, despite the recognized importance of CSCs, there is still a need for concise and clinically applicable biomarkers, such as transcriptomic signatures, that can reliably point out the presence of stemness traits in prostate tumors and their associated risk of progression.
In addition to their role in driving tumor growth, CSCs are more abundant after the neuroendocrine differentiation of prostate cancer, which results in NEPC [7]. This PCa subtype can arise either de novo or through the transdifferentiation of adenocarcinoma under selective pressures such as androgen deprivation therapy (ADT) [8,9]. This transdifferentiation process, driven by cellular plasticity and epigenetic changes, results in a highly aggressive cancer subtype that is associated with poor outcomes and limited treatment options [10,11]. Identifying biomarkers that can detect early shifts toward a neuroendocrine phenotype is crucial for managing treatment-resistant cases. However, existing NEPC-related gene signatures are often complex, including a large number of genes, limiting their practical use in clinical settings [12,13].
In this study, we aimed to address these challenges by developing a concise and robust stemness-associated gene signature using machine learning techniques. By analyzing large-scale transcriptomics data from multiple cohorts, we identified a 7-gene signature that predicts multiple PCa disease progression events. This signature was rigorously validated across independent datasets and further substantiated using patient-derived xenograft (PDX) models and a NEPC dataset, where we observed that our signature is able to classify samples as NEPC and, particularly, the large cell neuroendocrine carcinoma subtype. Our comprehensive approach provides a novel and clinically applicable tool for patient stratification and treatment personalization, offering new insights into the role of stem-like traits in PCa and their association with neuroendocrine differentiation.
MATERIALS AND METHODS
Stemness-associated genes
We gathered 144 stemness-associated genes from PCa literature [6,14–18]. We conducted transcriptomics analyses using publicly available PCa datasets (see below) to study differential gene expression across multiple comparisons, including normal/benign tissues, primary PCa tumors, CRPC tumors and metastatic samples. We performed univariable survival analysis to study the association between gene expression and different endpoints (progression, disease-free time biochemical-relapse, metastasis, and death). We also performed multivariable survival analyses that included clinico-pathological features as covariables.
Transcriptomics analyses
Dataset selection criteria
To study differential gene expression across different PCa datasets, we searched Gene Expression Omnibus (GEO) and the Genomic Data Commons Data Portal to identify eligible datasets that met the following criteria: (1) PCa tissue samples with available transcriptomic and clinico-pathological data; (2) The datasets must have ≥2 different tissue sample types (Table 1).
Differential Gene Expression Analyses
We used the limma package (Linear Models for Microarray Analysis) [27] to study differential gene expression from both microarrays and RNA-sequencing (RNA-seq). In the case of non-normalized data, quantile normalization was applied [27]. For RNA-seq data, the voom function in the limma package was used for processing [28]. We conducted pair-wise differential expression analyses within each dataset. For each available probe or gene, the fold changes (FC) between conditions were calculated, and expressed as log2FC. To correct for multiple testing, we used the Benjamini-Hochberg method to control the type I error, and reported adjusted p-values.
Survival analyses
Dataset selection criteria
To perform survival analysis, we searched Gene Expression Omnibus (GEO), cBioPortal and the Genomic Data Commons Data Portal to identify eligible datasets that met the following criteria: (1) PCa cases with available gene expression data and (2) available clinico-pathological features with ≥5 years of follow-up. Gene expression and clinical data were downloaded and analyzed for the resulting selected datasets. Samples with incomplete gene expression data or missing essential clinico-pathological metadata were not included. Datasets were randomly distributed in training (5 datasets, 7 survival analyses) and validation cohorts (4 datasets) (Table 2).
Survival analyses
We used the log-rank test to analyze differences in the risk of disease progression-events between different groups of patients [35]. To stratify patients according to high-or low-expression, we used the Cutoff Finder tool to find the optimal cutoff point for each gene [36]. The Cox proportional hazards model was used to estimate the risk of the disease progression-event for the different groups [37]. Multivariable analyses included clinico-pathological features as covariables. All modeling, calculations, and graphs were performed with the R packages survival [38] and survminer [39].
Selection of candidate genes for modeling a risk score
To identify the 15 most important genes for predicting events we used a machine learning ensemble based approach (i.e Random Forest Classifier) as implemented in the randomForestSRC R package [40]. The mtry and nodesize parameters were optimized through a grid search approach to minimize the out-of-bag error. We used the Breiman-Cutler variable importance (VIMP) measure to estimate the relative importance of each variable in predicting event-free survival within the training datasets. We applied the subsampling method [41] to estimate the standard error of the VIMP and to calculate the confidence intervals. Genes were ranked according to their variable importance. To facilitate the comparison across datasets, VIMP values were converted into fractions, with 1 representing the most important variable and 0 representing the least important variable within a given dataset.
Gene Signature and Risk Score Calculation
We modelled a risk score based on the gene expression of the 15 most important genes identified across training datasets using Random Forest. To develop this risk score, we calculated model coefficients through Lasso regression using TCGA-PRAD data. Patient scores were then calculated based on the expression of the selected genes following Lasso regression. The performance of this risk score was evaluated within each training dataset. Univariable Cox regression was used to estimate the risk of poor survival in patients with high-risk scores. Patients were stratified either by a dichotomized risk score (with the median as the cutpoint) or by a continuous risk score. The concordance index (CI) was used to measure the performance of the signature within each dataset.
In the validation stage, those same coefficients were used in all additional datasets. For each patient, the score was calculated, and its association with event-free survival were studied using univariable and multivariable Cox regressions.
Transcriptome analysis of MDA PCa PDXs
To assess the association of the stemness-signature and other clinico-pathological characteristics in an extensively annotated cohort, we used the MDA PCa PDX series, which was previously developed in the “Prostate Cancer Patient Derived Xenograft Program” at MD Anderson Cancer Center and the David H. Koch Center for Applied Research of Genitourinary Cancers [42]. Briefly, PCa tissue samples used were derived from various procedures and small pieces were then implanted into subcutaneous pockets of 6 to 8 week-old male CB17 SCID mice (Charles River Laboratories) [42]. RNA-Seq and transcriptome analysis on these samples was performed as previously described [43].
Unsupervised Clustering and Principal Component Analysis (PCA)
Unsupervised clustering analysis including the expression data of the stemness-associated genes included in the signature was performed using the pheatmap [44] package and Principal Component Analysis was performed using the factoextra package in R [45].
Receiver Operating Characteristic (ROC) curve for NEPC classification
pRoc package [46] was used for the estimation of receiver operating characteristic (ROC) curve and area under ROC curve (AUC).
NEPC patients samples dataset
To assess gene expression in NEPC samples, we downloaded the data from the Neuroendocrine Prostate Cancer (Multi-Institute, Nat Med 2016) dataset published by Beltran et al. [12] from cBioPortal [47–49]. Briefly, this dataset contains transcriptomics and histopathological data from 49 PCa samples (34 CRPC-Adeno and 15 CRPC-NE) obtained by RNA-Seq.
Statistical analyses
All bioinformatics analyses were performed using the R programming language [50] through the RStudio platform (RStudio, PBC, Boston, MA, USA) [51]. The tidyverse package was used for general data analysis and manipulation [52]. For graphics, the packages ggplot2 [53], ggpubr [54], and RColorBrewer [55] were used. Datasets available in GEO were downloaded with GEOquery [56]. All heatmaps were created with the pheatmap package [44]. Forest plots were created using GraphPad Prism (La Jolla, CA, USA). Student’s t test and ANOVA followed by Tukey’ test were used to assess differences in risk score values across groups. We used the log-rank test and Cox proportional hazard model regression to study the association between gene expression and patients’ survival. Multivariable analyses were performed in R and plotted in GraphPad Prism software (La Jolla, CA, USA). Statistical significance was set at p ≤ 0.05.
RESULTS
Dysregulation of stemness-associated genes across multiple PCa comparisons
We gathered 144 stemness-associated genes in PCa from literature [6,14–18] (Supplementary Table 1) and analyzed their expression and association with multiple survival endpoints (Figure 1A). First, we performed pair-wise differential gene expression analyses using 7 PCa datasets (n=1,259), which included 11 comparisons between normal prostate, primary tumor, metastatic and castration-resistant PCa (CRPC) samples (Figure 1A). Volcano plots evidenced dysregulation of 139 stemness-associated genes, with both up-(red, adjusted p<0.05, log2FC>0) and down-regulation (blue, adjusted p<0.05, log2FC<0) in all comparisons (Primary PCa vs. Benign/Normal/Adjacent, Metastatic PCa vs. Benign, Metastatic PCa vs. Primary PCa, CRPC vs. Benign; CRPC vs. Primary PCa) (Figure 1Bi). Figure 1Bii summarizes these results across comparisons. We observed that all stemness-genes were dysregulated in at least one dataset, with 29 genes consistently up-regulated and 26 genes consistently down-regulated (Figure 1Bii, Supplementary Table 2).
A) Schematic representation of gene selection, transcriptomics and survival analyses to define potential prognostic biomarkers. B) i) Volcano plots showing the results of the differential expression analysis of all available genes within the included transcriptomics datasets. Red = significantly upregulated stemness-associated gene. Blue = significantly downregulated stemness-associated gene. Dark gray = Non-significantly dysregulated stemness-associated genes. Light gray = other genes available in the dataset. ii) Summary heatmap of the transcriptomics analyses performed in multiple publicly available datasets (n=1259). Genes of interest and the results of the differential expression analysis for each dataset are displayed. Each row represents the results of a specific comparison. Annotation depicts the absolute number of comparisons in which each gene is up (red) or downregulated (blue). Red = significantly upregulated gene. Blue = significantly downregulated gene. White = not significant changes. Gray = non available. Datasets: GSE35988 (n=122); GSE3933 (n=103); GSE46602 (n=50); GSE6956 (n=87); GSE70768 (n=179); TCGA-PRAD (n=548); GSE21034 (n=150). Statistical significance was set to adjusted p value<0.05.
Association of stemness markers with PCa patients’ survival
We evaluated the association with different events in PCa patients, including progression-free survival (PFS), biochemical relapse-free survival (RFS), metastasis-free survival (MFS), overall survival (OS), and disease specific survival (DSS) for the 139 differentially expressed genes. Figure 2Ai shows representative Kaplan-Meier plots for three example genes (DBNL, UBTD2, and MBNL2) in the TCGA-PRAD dataset (n=497, PFS). Results showed that high expression of DBNL, and low expression of MBNL2 were significantly associated with poor PFS (HR=2, Log-rank P=0.0011 and HR=0.39, Log-rank P<0.0001, respectively. Figure 2Ai, left and right panels). No significant associations were observed for UBTD2 (Log-rank P=0.1062, Figure 2Ai, middle panel). Figure 2Aii shows a heatmap summarizing the results of the univariable survival analysis for each of the 139 candidate genes performed across the 5 training datasets including 5 different types of events (n=1,229; detailed in Supplementary Table 3). The results are color-coded as follows: red squares represent genes with high expression significantly associated with shorter times to the event, white squares indicate genes with no significant associations to the event, and blue squares represent genes with high expression significantly associated with a better outcome. Of note, there was a group of genes whose high expression was consistently associated with poor prognosis (Figure 2Aii, left, in red), while others were associated with a better outcome (Figure 2Aii, right, in blue).
i) Examples of Kaplan-Meier (KM) curves depicting the association of each gene to the risk of event (purple = high expression of a gene; green: low expression of a gene). HR: Hazard Ratio; Cox P: p-value from the Cox proportional hazards model. Log-rank P: p value of the log-rank test. ii) Summary heatmap of the univariable survival analyses performed on multiple datasets. The red box indicates that high gene expression is associated with a high risk of an event (HR>1 and Cox P<0.05), blue boxes indicate that high gene expression is associated with a low risk of survival-related events (HR<1 and Cox P<0.05) and white boxes indicate that there are no significant associations between gene expression and risk of an event. Gray = gene no available. Patients were stratified by the median expression of each gene. B) i) Examples of forest plots depicting the association of each gene to the risk of event adjusted for all available covariates using the TCGA-PRAD dataset. ii) Summary heatmap of the multivariable survival analyses performed on multiple datasets. The red box indicates that high gene expression is associated with a high risk of an event (HR>1 and Cox P<0.05), blue boxes indicate that high gene expression is associated with a low risk of survival-related events (HR<1 and Cox P<0.05) and white boxes indicate that there are no significant associations between gene expression and risk of an event. Gray = gene no available. All comparisons consider low-expression patients as the reference group. Annotation depicts the absolute number of comparisons in which high expression of each gene is associated with high (red) or low (blue) risk. OS: Overall Survival; DSS: Disease-Specific Survival; PFS: Progression-Free Survival; RFS: Relapse-Free Survival; MFS: Metastasis-Free Survival. Datasets: TCGA-PRAD (n=497 PFS, n=337 DFS); GSE70768 (n=111 RFS); GSE70769 (n=92 RFS); GSE116918 (n=248 RFS and MFS); GSE16560 (n=281 OS). Statistical significance was set at Cox P<0.05. **Cox P<0.01; ***Cox P<0.001.
Next, we performed multivariable Cox regression analyses for each of the 139 previously mentioned genes to evaluate their independence from other known risk factors for PCa progression in predicting an event (Table 2). For the three examples mentioned above, DBNL (HR=2.61, 95% CI 1.40-4.86, Cox P=0.002) and MBNL2 (HR=0.69, 95% CI 0.54-0.88, Cox P=0.003) displayed a significant association with high and low risk of PFS, respectively, independently from the other covariates available in the TCGA-PRAD dataset (PSA levels, ISUP grade, Clinical T Stage, and Targeted Molecular/Radiation Therapy; Figure 2Bi). No significant associations were observed for UBTD2 (Figure 2Bi). The overall results for the multivariable analyses are summarized as a heatmap in Figure 2Bii and detailed in Supplementary Table 4. Most associations observed in the univariable analysis (Figure 2Aii) lost statistical significance after adjusting for clinical covariates (Figure 2Bii).
Modeling a stemness-associated signature with prognostic value
We used a machine learning algorithm to identify the most relevant prognostic candidate genes to model a gene-expression signature that could stratify patients into risk groups of disease progression and death. We used a Random Forest algorithm to rank genes according to their relevance for event prediction in the training datasets and calculated the mean relative importance score for each gene (Figure 3A). The top 15 genes were: ALDH1A1, KMT5C, DPP4, RPS6KB1, TYMS, CCT3, IL1RAP, MICAL3, CDC25B, IRF5, MEN1, DNMT3B, CD24, RND3, and CASP9 (Figure 3A, purple square). Next, we used these genes to develop our stemness-associated risk signature. Model coefficients were calculated on the TCGA-PRAD cohort by Lasso regression, a feature selection method that keeps the most important predictors by shrinking the coefficients of less significant genes to zero. This analysis resulted in a signature of 7 significant genes, generating the following weighted linear model: 0.284×KMT5C + 0.2723×MEN1 + 0.2178×TYMS + 0.09×IRF5 + 0.0827×DNMT3B + 0.048×CDC25B – 0.0597×DPP4, where gene expressions are considered a continuous variable (Figure 3Bi).
Machine learning Random Forest algorithm for prognostic candidates’ selection. A) Heatmap summarizing the relative importance of the variables (genes) for all training datasets. The relative importance was converted into percentiles, where 1 represents maximum relative importance (red) and 0 indicates minimum relative importance (blue). Gray = gene not available in the dataset. The 15 top-ranked genes (purple) were selected as candidates for our stemness-associated risk signature. B) i) Example of Kaplan-Meier (KM) curve using the TCGA-PRAD dataset depicting the association of the 7-gene score to the risk of progression (purple = high 7-gene score; green: low 7-gene score). The coefficients for each gene were calculated by Lasso regression using TCGA-PRAD data, and the 7-gene score was constructed as follows: 0.284×KMT5C – 0.0597×DPP4 + 0.2178×TYMS + 0.048×CDC25B + 0.09×IRF5 + 0.2723×MEN1 + 0.0827×DNMT3B. Patients were stratified by the median of the score. HR: Hazard Ratio; p-value: p-value from the Cox proportional hazards model. Log-rank P: p value of the log-rank test. ii) Summary forest plot displaying the survival analysis of the association of the 7-gene signature with the risk of disease progression-events in the training datasets. Patients survival was analysed by either stratification by the median of the 7-gene score (circles) or taking the 7-gene score as a continuous variable (squares). On the right, heatmap depicting the concordance index value for each of the analyses. The concordance index is a performance measure of the signature within each dataset. Cox P = p-value of the Cox regression coefficient. HR = Hazard Ratio. (95% CI) = 95% Confidence Interval. PFS: Progression-Free Survival; DFS: Disease-Free Survival; RFS: Relapse-Free Survival; OS: Overall Survival; MFS: Metastasis-Free Survival. Statistical significance was set at Cox P<0.05. *Cox P<0.05; **Cox P<0.01; ***Cox P<0.001; ****Cox P<0.0001.
Next, in order to evidence this 7-gene signature prognostic performance, we calculated the risk score for each patient on the TCGA-PRAD dataset and stratified them into high-risk and low-risk groups using the median score as the cutpoint. As expected, we evidenced a shorter progression time for the high-risk group compared to the low-risk group (HR=3.36, 95% CI 2.11-5.35, Log-rank P<0.0001, Figure 3Bi). When we considered the score as a continuous variable, we observed a HR=4.34 (95% CI 2.95-6.37, Cox P<0.0001) for each unit increase in the score (Figure 3Bii, Supplementary Table S5). We then extended the score to the other training datasets and corroborated its prognostic significance within these cohorts. When analyzing the event-free survival in all the other training datasets, we observed significant associations with our model in 5 out of 6 analyses, both using the dichotomous and continuous score, suggesting our 7-gene signature is able to predict risk of multiple disease progression-events across our training cohorts (Figure 3Bii, Supplementary Table S5). The identified genes and the developed risk score model effectively stratify patients based on their risk of adverse outcomes, suggesting their potential as prognostic biomarkers.
Consistent performance across validation datasets
Next, we validated our model using datasets from independent cohorts (n=501). We calculated the 7-gene score for all patients in the different datasets and categorized them into high or low risk using the median as a cutoff. Interestingly, the risk score was significantly associated with event-free survival in all validation cohorts (Figure 4Ai-ii). Of note, in the SU2C dataset, which comprises metastatic PCa samples, patients with high score had near 2-fold risk of death compared to patients with low score (Figure 4Aii). This demonstrates that the 7-gene signature is a robust predictor of the risk of death even in advanced stages. Moreover, when analyzing the 7-gene signature as a continuous variable, all datasets presented significant results, with higher concordance indexes than the dichotomized analysis (Figure 4Aiii). Multivariable analyses demonstrated that our score predicts disease progression-events independently of the other clinico-pathological variables (Figure 4B), which highlights its potential utility in clinical decision-making.
i) Kaplan-Meier curves depicting the association of the 7-gene score to the risk of disease progression-events included in the validation datasets. The coefficients for each gene were calculated by Lasso regression using TCGA-PRAD data, and the 7-gene score was calculated as follows: 0.284×KMT5C + 0.2723×MEN1 + 0.2178×TYMS + 0.09×IRF5 + 0.0827×DNMT3B + 0.048×CDC25B – 0.0597×DPP4. Patients were stratified by the median of the score. HR: Hazard Ratio; Cox P: p-value from the Cox proportional hazards model. Log-rank P: p value of the log-rank test. ii) Summary forest plot displaying the survival analysis of the association of the 7-gene signature with the risk of disease progression-events in the validation datasets. Patients survival was analysed by either stratification by the median of the 7-gene score (circles) or taking the 7-gene score as a continuous variable (squares). On the right, heatmap depicting the concordance index value for each of the analyses. The concordance index (CI) is a performance measure of the signature within each dataset. RFS: Relapse-Free Survival; OS: Overall Survival. B) Forest plots depicting the association of each gene to the risk of event adjusted for all available covariates within each validation dataset. Cox P = p-value of the Cox regression coefficient. HR = Hazard Ratio. [95% CI] = 95% Confidence Interval. Datasets: GSE54460 (n=106); GSE94767 (n=233); DKFZ (n=81); SU2C-PCF (n=81).Statistical significance was set at Cox P<0.05. *Cox P<0.05; **Cox P<0.01; ***Cox P<0.001; ****Cox P<0.0001.
The stemness-associated gene signature captures neuroendocrine disease heterogeneity in the MDA PCa PDX series
Next, we sought to analyze the association between the 7-gene signature and other clinico-pathological characteristics available in the MDA PCa PDX series, which was developed in the Laboratory of Dr. Navone within the “Prostate Cancer Patient Derived Xenograft Program” at MD Anderson Cancer Center and the David H. Koch Center for Applied Research of Genitourinary Cancers. PCa tissue samples used for PDX development were derived from therapeutic or diagnostic procedures, namely, radical prostatectomies, orthopedic, and neurosurgical procedures to palliate complications, and biopsies of metastatic lesions [42] (Figure 5A). We analyzed the expression of the 7 stemness-associated genes selected in the present study using previously generated RNA-Seq data from the 44 MDA PCa PDXs [43]. Surprisingly, the expression of this signature was able to accurately cluster PDXs according to their histopathological classification (adenocarcinoma or sarcomatoid vs. neuroendocrine tumors) in an unsupervised clustering analysis (Figure 5Bi). Moreover, NEPC PDXs displayed significantly higher scores (Figure 5Bii). Specifically, CDC25B, TYMS, KMT5C and DNMT3B were significantly upregulated in NEPC vs. no-NEPC PDXs, while IRF5 and DPP4 were significantly downregulated (Figure 5Biii).
A) Schematic representation of the MDA PCa PDX series establishment and transcriptome analysis (n=44) (created with BioRender.com). B) i) Heatmap depicting unsupervised clustering analysis of RNAseq data from the 44 MDA PCa PDXs considering the expression of the 7-gene signature (KMT5C, MEN1, TYMS, IRF5, DNMT3B, CDC25B and DPP4). Red, white, and blue represent greater, intermediate, and lower gene expression levels. ii) Violin plot showing the 7-gene score levels in no-NEPC and NEPC samples from the MDA PCa PDX series. iii) Violin plots showing the expression levels (FPKM) of the genes included in the 7-gene score in no-NEPC and NEPC samples from the MDA PCa PDX series. C) i) PCA biplot considering the expression of the 7-gene signature using the MDA PCa PDX data assessed by RNA-seq. Each point represents one PDX. Samples are coloured according to the histopathological classification: adenocarcinoma (red), sarcomatoid (beige) and neuroendocrine (purple). ii) Bar plot showing the contribution (%) of each gene in the signature to the variance in the PC1 from the PCA. D) ROC curve showing the performance of the 7-gene score in classifying MDA PCa PDXs as NEPC. Statistical significance was calculated using Student’s t test and was set at p<0.0.5. *p<0.05; **p<0.01; ***p<0.001; ****p<0.0001.
These results were also observed in a Principal Component Analysis (Figure 5Ci), which highlighted KMT5C as the main gene in the signature contributing to the variance (PC1) between samples of different histopathological profiles (Figure 5Cii), followed by CDC25B and DNMT3B. Of note, KMT5C is also the gene that weighs higher in our score (Figure 3Ci). To evaluate the power of the signature in predicting whether a tumor is NEPC, we performed ROC analysis. The AUC of our 7-gene score was 0.92 (Figure 5D), highlighting its high performance for classifying NEPC samples.
Our stemness-score adds value to pre-existing NEPC score
To compare our risk score performance with a pre-established NEPC classification score, we analyzed the expression of the genes from the 70-gene signature by Beltran et al. [12], in the MDA PCa PDX series. We observed a good segregation of the PDXs according to their histopathological classification when using the 70 genes from Beltran et al. NEPC score; however, the 2 double-negative tumors (negative for AR and NE features) were clustered within the NEPC tumors group (Figure 6Ai). Nonetheless, when also including the expression of the 7 genes identified in this work alongside the genes from the NEPC classification score [12], clustering of the PDX was more accurate, not only grouping adenocarcinomas vs. NEPC tumors, but also sarcomatoid samples (Figure 6Aii).
i) Heatmap depicting an unsupervised clustering analysis of RNAseq data from the MDA PCa PDX series considering the expression of the 70-gene signature proposed by Beltran et al. [12] ii) Heatmap depicting an unsupervised clustering analysis of RNAseq data from the MDA PCa PDX series considering the expression of the 70-gene signature proposed by Beltran et al. plus the 7 genes (KMT5C, MEN1, TYMS, IRF5, DNMT3B, CDC25B and DPP4) from the risk score model propose in our work. B) i) Heatmap depicting an unsupervised clustering analysis of RNAseq data from human patients in Beltran et al., dataset (n=49) [12] considering the expression of the 7-gene signature. Red, white, and blue represent greater, intermediate, and lower gene expression levels. Expression values are presented as z-scores. ii) Violin plot showing 7-gene score levels in CRPC-Adeno and CRPC-NE samples from the Beltran et al., dataset. C) i) Violin plot showing risk score levels in samples from the Beltran et al., dataset according to the histological classification: prostate adenocarcinoma without neuroendocrine differentiation, prostate adenocarcinoma with neuroendocrine differentiation >20%, small-cell carcinoma, large-cell neuroendocrine carcinoma, and mixed small-cell carcinoma–adenocarcinoma. ii) ROC curve showing the performance of the 7-gene score in classifying PCa patient samples from Beltran et al. dataset as Large-Cell NEPC. Statistical significance was calculated using Student’s t test or ANOVA followed by Tukey’s test, and was set at p<0.05. **p<0.01; ****p<0.0001.
The 7-gene signature effectively classifies large-cell neuroendocrine carcinomas
To validate the association of our risk score model with NEPC, we analyzed the transcriptomics dataset from Beltran et al. (n=49) [12], which includes 15 samples from CRPC-neuroendocrine (NE) and 34 CRPC-adenocarcinomas tumors. Our signature was able to distinguish CRPC-NE tumors to a limited extent (Figure 6Bi), while, overall, our risk score was significantly higher in CRPC-NE compared to CRPC-Adeno (p<0.01, Figure 6Bii). However, we looked further into the available pathology classification (prostate adenocarcinoma with no neuroendocrine differentiation, n=34; prostate adenocarcinoma with neuroendocrine differentiation >20%, n=2; small-cell carcinoma n=4; large-cell neuroendocrine carcinoma, n=7; mixed small-cell carcinoma– adenocarcinoma, n=2) and observed that 6/7 samples of the large-cell NEPC clustered together (Figure 6Bi), while the 7-gene signature was particularly higher in that subtype (Figure 6Ci). Strikingly, the AUC=0.99 suggests that the signature of 7 stemness-associated genes proposed in this work is accurate in classifying samples as large-cell NEPC (Figure 6Cii). Since large-cell NEPC molecular characterization remains elusive [57], our findings set ground for future research on the implications of these genes in this subtype pathogenesis.
DISCUSSION
In this study we identified and validated a novel 7-gene signature that represents a significant advancement in the prediction of poor outcomes and molecular detection of NEPC. Our findings demonstrate that this signature not only reliably stratifies PCa patients based on their risk of progression but also reveals a crucial link between stemness-associated pathways and neuroendocrine characteristics. Importantly, this signature is particularly adept at identifying tumors within the Prostate Cancer Foundation (PCF) and World Health Organization (WHO)-defined large-cell neuroendocrine carcinoma [58,59], which is regarded as very rare and associated with very poor outcomes (mean survival of 7 months) [59].
Large-cell NEPC are high grade tumors that usually develop from treatment-resistant clones [60]; they are mainly diagnosed histopathologically, thus remaining a challenge and underrecognized [57,58,61]. Hence, there is a need for molecular biomarkers that could subclassify NEPC tumors for better clinical management [57]. The ability of our 7-gene signature to pinpoint this specific aggressive and challenging NEPC subtype underscores the clinical utility of our model in guiding more precise therapeutic interventions.
Our stemness-associated signature addresses a critical need for improving PCa prognosis, while also offering precise stratification of NEPC, which is often characterized by poor clinical outcomes and high proliferative indices [62]. NEPC is recognized as one of the most aggressive and treatment-resistant forms of PCa, often arising in the context of advanced CRPC after multiple rounds of ADT [63]. While most NEPC cases develop in patients with a history of extensive anti-androgen treatment, the disease can also manifest de novo, albeit rarely, in treatment-naïve patients [9,12]. Further, ADT-induced NE transdifferentiation could be explained by altered mast cell infiltration [64,65]. Maimaitiyiming et al. have established a mast cell gene signature with prognostic efficacy in PCa [66], and, interestingly, mast cells have been reported to support the stem phenotype of cancer cells [67]. Altogether, focusing on stemness-associated genes could offer insights into NEPC biology and potential targets.
The molecular landscape of NEPC has been increasingly clarified in recent years, with significant contributions from studies like those of Beltran et al., who have delineated the heterogeneity within NEPC and highlighted distinct molecular subtypes [12,58,68]. Their research highlights the genetic, epigenetic and molecular diversity of NEPC, particularly noting alterations such as RB1 and TP53 loss, MYCN overexpression, and the activation of the PI3K/AKT pathway, which contribute to the aggressive nature of these tumors [12,58,68]. Our study builds on these findings by focusing on a 7-gene stemness signature. Unlike previous signatures that include a broad array of genes, our streamlined 7-gene model achieves comparable or superior predictive accuracy, underscoring its practical utility in diverse clinical contexts.
The biological relevance of the genes in our signature — KMT5C, MEN1, TYMS, IRF5, DNMT3B, CDC25B and DPP4 (also known as CD26)— lies in their involvement in critical processes such as chromatin modification, DNA methylation, DNA repair, cell cycle regulation, immune escape and extracellular matrix remodelling [69–75]. These processes are fundamental to maintaining the plasticity and adaptability of cancer stem cells (CSCs) [5,76], which are enriched after transdifferentiation of prostate adenocarcinoma into more aggressive neuroendocrine phenotypes [77]. For example, KMT5C, DNMT3B and MEN1 play pivotal roles in chromatin remodeling and methylation [69–71], processes that are crucial for the epigenetic reprogramming observed in NEPC [78]. Additionally, TYMS has been previously associated with neuroendocrine differentiation in other types of cancer [79,80]. The integration of these stemness-associated genes into our model highlights the potential for characterizing NEPC-like tumors.
One of the key strengths of our study is the extensive validation of our signature across patient datasets and PDXs models. The latter, which faithfully replicate the histological and genetic features of human tumors, are widely regarded as the gold standard for preclinical studies [81]. Our findings demonstrate that the 7-gene signature consistently distinguishes NEPC from other PCa subtypes in these models, underscoring its clinical utility and its potential for identifying NEPC-like tumors. This aspect of our research not only validates the predictive power of the signature but also highlights its potential utility in translational research, particularly in the development of novel therapeutic strategies aimed at targeting the molecular underpinnings of NEPC.
LIMITATIONS
Despite the robustness of our findings, there are several limitations that must be acknowledged. Our study primarily relies on transcriptomic data from publicly available repositories, while comprehensive, may not fully represent the genetic diversity of PCa patients globally. Future research should focus on further validating our signature in ethnically and genetically diverse cohorts to ensure its broad applicability. Additionally, while our focus on transcriptomic data has provided valuable insights into NEPC biology, integrating multi-omics data, including proteomics and metabolomics, could enhance the predictive power of our model. Moreover, the scarce number of NEPC samples with transcriptomics data and, particularly, of large-cell NEPC (probably due to under-recognition and underreporting [57]) requires further validation in larger cohorts. Functional validation of the identified genes through in vivo studies will also be critical for determining their role in disease and translating findings into clinical practice.
CONCLUSION
This study presents a significant advancement in PCa prognosis and classification of NEPC, particularly for the challenging large-cell subtype. Importantly, PCa cases presenting this molecular signature, even when not histopathologically identified as NEPC, also exhibit a poor prognosis. This reinforces the clinical relevance of our model, which is capable of identifying aggressive tumor subtypes that may not yet display overt NE differentiation but still represent a high risk for adverse outcomes. Through the development of this novel stemness-associated 7-gene signature, our model offers a robust and practical tool with potential clinical application, paving the way for more personalized and effective therapeutic strategies in PCa.
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
DECLARATION OF COMPETING INTEREST
The authors declare no conflict of interest.
AUTHOR CONTRIBUTIONS
Conceptualization: AS, PS, EV, DA, FC, MM, JC, AT, EL, JB and GG.
Methodology: AS, PS, RS, GP, MM, JC, JB and GG.
Software: AS, PS, MM, JC, and JB.
Validation: AS, PS, RS, GP, EV, MM, JC, AT, JB and GG.
Formal Analysis: AS, PS, NA, DA, FC, EV, MM, JC, AT, JB and GG.
Investigation: AS, PS, RS, GP, AT, JB and GG.
Resources: EV, JC, AT, EL and GG. Data Curation: AS, PS and JB.
Writing – Original Draft Preparation: AS, PS, NA, DA, FC, EV, MM, JC, AT, EL, JB and GG.
Writing – Review & Editing: AS, PS, NA, DA, FC, EV, MM, JC, AT, EL, JB and GG.
Visualization: AS, JB, GG.
Supervision: EV, MM, JC, AT, JB and GG.
Project Administration: JB and GG.
Funding Acquisition: EV, JC, AT, EL and GG.
SUPPLEMENTARY TABLES
Supplementary Table S1. List of 144 stemness-associated genes gathered from PCa literature.
Supplementary Table S2. Results of differential expression analyses of the 144 stemness-associated genes across transcriptomics datasets.
Supplementary Table S3. Results of univariable survival analyses of the 139 dysregulated stemness-genes.
Supplementary Table S4. Results of multivariable survival analyses of the 139 dysregulated stemness-genes.
Supplementary Table S5. Score levels across training and validation datasets.
ACKNOWLEDGEMENTS
Funding: The present study was supported by Agencia Nacional de Promoción de la Investigación el Desarrollo Tecnológico y la Innovación (ANPCyT) PICT-RAICES-2021-III-A-00080; David H. Koch Center for Applied Research in Genitourinary Cancers at MD Anderson (Houston, TX); and NIH/NCI U01 CA224044. The funders of the study had no role in study design, data collection, data analysis, data interpretation, writing or decision to submit.