Abstract
Dementia syndromes are complex sequelae whose multifaceted nature poses significant challenges in the diagnosis, prognosis, and treatment of patients. Despite the availability of large open-source data fueling a wealth of promising research, effective translation of preclinical findings to clinical practice remains difficult. This barrier is largely due to the complexity of unstructured and disparate preclinical and clinical data, which traditional analytical methods struggle to handle. Novel analytical techniques involving Deep Learning (DL), however, are gaining significant traction in this regard. Here, we have investigated the potential of a cascaded multimodal DL-based system (TelDem), assessing the ability to integrate and analyze a large, heterogeneous dataset (n=7159 patients), applied to three clinically relevant use cases. Using a Cascaded Multi-Modal Mixing Transformer (CMT), we assessed TelDem’s validity and (using a Cross Modal Fusion Norm - CMFN) model explainability in (i) differential diagnosis between healthy individuals, AD, and three sub-types of frontotemporal lobar degeneration (ii) disease staging from healthy cognition to mild cognitive impairment (MCI) and AD, and (iii) predicting progression from MCI to AD. Our findings show that the CMT enhances diagnostic and prognostic accuracy when incorporating multimodal data compared to unimodal modeling and that cerebrospinal fluid (CSF) biomarkers play a key role in accurate model decision making. These results reinforce the power of DL technology in tapping deeper into already existing data, thereby accelerating preclinical dementia research by utilizing clinically relevant information to disentangle complex dementia pathophysiology.
Introduction
Dementia significantly reduces patient quality of life and represents a major source of economic and societal burden worldwide1. Despite an expected increase in the prevalence of dementia over the coming decades2, accurate diagnosis, prognosis, and identification of novel treatment avenues continue to pose significant challenges. After decades of clinical trial setbacks, the emergence of the first disease-modifying therapies (DMTs) targeting amyloid accumulation for Alzheimer’s Disease (AD - which represents a significant proportion of people with dementia3) have been reported4–8. However, challenges persist as these DMTs show only modest delays in disease progression, need real-world replication of clinical effects, and have a high prevalence of adverse effects9. While the roles of amyloid and tau in AD pathophysiology are widely recognized10 evidence suggests that the disease extends far beyond a simplistic interaction of these two proteins11,12. This multifactorial nature, which involves inflammation13, lifestyle factors14,15, and apolipoprotein E (APOE) 4 genotype and its complex interactions with brain function16,17 has contributed to the high rate of trial failure18 and the occurrence of significant adverse reactions to therapy19,20. As a result, the efficacy of these therapies remains controversial.
In this evolving landscape, quantifiable biomarkers have emerged as an indispensable resource for AD and dementia research21, offering insights into disease progression, differential diagnosis, and therapeutic response22. Nevertheless, a noticeable disparity exists between clinical endpoints such as cognitive decline and the preclinical indicators of these biomarkers. This disparity underscores the challenges of translating actionable biomarkers into tangible clinical benefits23. However, with more comprehensive and diverse data, new and more precise biomarkers can be identified. This, in turn, could allow for the development of more effective therapeutic targets and more accurate predictors of clinical outcomes. Recent decades have seen a proliferation of large-scale initiatives aimed at achieving this goal through datasets such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and its various incarnations24, including the Australian Imaging Biomarkers and Lifestyle Study (AIBL)25. These and similar large multimodal datasets contain a wealth of clinically relevant information that, if harnessed effectively, could revolutionize dementia research, enhance clinical trial design and execution, and ultimately improve standard of care and patient outcomes.
The full potential of these datasets, however, remains somewhat unexploited. This is due to challenges in analyzing data obtained from numerous sources, each having unique issues such as data-missingness and comparability26,27. Standard analytical techniques, although sophisticated and well-validated, are often ill-equipped to handle this lack of data harmonization and structure, particularly the multimodality 28 associated with large datasets. This absence of uniformity thus undermines the fruitful translation of preclinical research to clinical application. Recent years, however, have seen the advent of new technologies that could address these barriers. The rapidly advancing field of Deep Learning (DL) for example, presents such an opportunity and is gaining significant traction within regulatory frameworks in North America29, Europe30, and Asia31. Recent innovations, particularly in multimodal DL models, offer an unprecedented opportunity for more holistic analyses of these heterogeneous dementia datasets32, even in the presence of missing and diverse forms of data33 As a result, DL-powered analytical technologies could enhance how we approach, understand, and analyze large, complex, and disorganized data.
While promising in this respect, such multimodal DL systems require diligent validation due to (i) the often-subtle nature of novel biomarker identification, (ii) the need for alignment of model predictions with existing medical knowledge, and (iii) the “black box” nature of model decision-making. In this paper, we aimed to provide this validation of a DL-powered clinical decision support system designed to enhance dementia research and patient care. Termed the TelDem system, we evaluated the power of this DL approach in integrating multimodal data from disparate sources, handling missing data, and incorporating diverse modalities. Specifically, we focused on six datasets from the ADNI, AIBL and the Frontotemporal Lobar Degeneration Neuroimaging Initiative (FTLDNI) studies, comprising of over 7,000 patients with AD and three subtypes of frontotemporal lobar degeneration (FTLD). Additionally, we defined three distinct use cases, each reflecting real-world clinical scenarios based on expert opinion. In the first use case (UC1), we assessed TelDem in the context of the often-difficult differential diagnosis of AD and FTLD34,35 by evaluating the architecture’s ability to successfully identify cognitively normal individuals (CN), AD, and three FTLD subclasses (behavioral variant frontotemporal dementia – bvFTD, semantic variant Primary Progressive Aphasia – svPPA, and non-fluent agrammatic variant Primary Progressive Aphasia – nfvPPA). In the second use case scenario (UC2), we focused on AD and its stages including risk-states, evaluating whether the architecture could successfully classify participants into CN, mild cognitive impairment (MCI), or AD. Finally, in the third use case scenario (UC3), we evaluated the system in predicting the conversion of MCI to AD. Each use case was designed to reflect current challenges facing biomarker identification, with the aim of testing TelDem’s potential in accelerating dementia research and patient care.
Methods
Data used in this article were obtained from six open-source datasets; the Alzheimer’s Disease Neuroimaging Initiative (ADNI)36, releases 1, 2, 3, and ADNI-GO, the Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing (AIBL)25,37, and the Frontotemporal Lobar Degeneration Neuroimaging Initiative – (FTLDNI). Each dataset contains a variation of cognitively normal individuals (CN), people with MCI, AD, and FTLD, resulting in 7,159 participants included in our study (Table 1).
ADNI (adni.loni.usc.edu) was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). ADNI is a longitudinal multicenter neuroimaging study consisting of CN individuals, MCI, and AD across four separate studies: ADNI-1, ADNI-GO, ADNI-2, and ADNI-3. Participants underwent physical and neurological examinations, standardized neuropsychological tests, and provided blood, urine, and in a minority of cases (∼20%), cerebrospinal fluid (CSF) samples. All participants underwent either a 1.5T or 3T MRI scan, a Mini-Mental State Examination (MMSE), the Clinical Dementia Rating (CDR), and memory performance measured on the Wechsler Memory Scale (WMS-R) Logical Memory (LM-II subscale of the WMS-R) at screening. Eligibility criteria relied on a set of cognitive scores and included, for instance, an MMSE score between 24 and 30 in cognitively normal individuals and MCI, and between 20 and 26 in AD36.
AIBL is a two-site longitudinal study by the AIBL study group, in which participants were initially classified as AD, MCI, and CN and assessed over approximately 10 years. Participants were at least 60 years of age and were screened for neurological conditions including non-AD dementia and psychiatric illnesses including schizophrenia, depression, Parkinson’s disease, or stroke. AIBL study methodology has been reported previously37.
The FTLDNI study investigates sporadic and familial forms of FTLD, in particular bvFTD, svPPA, and nfvPPA. This includes both symptomatic individuals and asymptomatic or ambiguously symptomatic family members. The study aims to delineate brain function changes attributable to these disorders and differentiate them from normal aging. Participants were categorized after a comprehensive series of physical exams, cognitive evaluations, and interviews according to international consensus criteria38,39. Notably, FTLDNI includes two extra CDR readouts that pertain to a modified version of the assessment: the CDR Dementia Staging Instrument PLUS National Alzheimer’s Coordinating Center (NACC) Behavior and Language Domains (CDR plus NACC FTLD40). This assessment was specifically aimed at discriminating FTLD from AD. For a detailed breakdown of the compiled dataset endpoints, including full sample demographics, see supplementary table 1.
Clinical Use Cases
We identified three clinical settings that mirror realistic use cases (UCs) based on clinical expert opinion. The UCs were evaluated by in turn, each time increasing the number of input modalities for the model, thereby gradually transitioning from a unimodal to a multimodal model. We designed the unimodal application to simulate a real-world situation in which data are limited, inputting only MRI to the model. We then expanded the information set to data that is accessible in most clinical environments and does not require any invasive procedures. This clinical standard setting includes the patient’s demographics and behavioral and cognitive assessments, such as the MMSE. Finally, we evaluated the system in the most comprehensive UC (referred to hereafter as the Invasive/Research setting) where we included biomarkers or variables not acquired routinely. These included CSF biomarkers, plasma biomarkers, and APOE genotyping. In UC1, we tested the systems’ ability to distinguish between CN, AD, and the FTLD subclasses – bvFTD, svPPA, and nfvPPA). For UC1, we tested the system fits in an unimodal configuration with only T1-w MRI as an input, and subsequently in a multimodal configuration in which demographics, cognitive assessments, and APOE status were added. Secondly, UC2 assessed whether the system could stage AD disease, first in CN participants, MCI, and AD, and, in UC3, the progression from MCI to AD. For each UC, we conducted 5-fold cross-validation experiments by splitting the folds on a subject level to avoid data leakage when evaluating our models (see supplementary Fig. 1).
Data Selection
We applied quality control on the assigned labels in the ADNI and AIBL datasets by removing participants who had at least one diagnosis not related to AD. Notably, this step allowed us to exclude many MCI participants who were indicated to have FTLD or Lewy Body Dementia, which have very different features from MCI and AD (these participants appear as “Other” in Table 1, alongside people without a diagnosis and conditions outside of our scope).
Use Case 1 - Dementia Differential Diagnosis
Data selected for this task includes CN participants from all studies, AD patients from AIBL and ADNI, and FTLD subclasses from the FTLDNI dataset. We used structural imaging, demographics, behavioral assessments, and cognitive scores as modalities in modeling. Among these, we included the CDR, ensuring the score formulation was consistent across the FTLDNI Dataset, and ADNI/AIBL (see supplementary Fig. 2).
Use Case 2 - AD Staging
Dementia patients came from the AIBL and the ADNI studies while the CN group also included patients from FTLDNI. Modalities included were demographic, behavioral assessments, CSF and plasma biomarkers, and T1-weighted (T1w) Magnetization-Prepared Rapid Gradient-Echo (MPRAGE) MRI. We did not include any cognitive scores since they were used as screening tools and outcome metrics to define the conditions at baseline.
Use Case 3 - MCI conversion
Only ADNI and AIBL participants were included for this UC. We defined as progressive MCI (pMCI) all MCI patients who also received an AD diagnosis during the duration of the study, independently of the progression time. Data used for this application was demographics, T1w MRI, CSF and plasma biomarkers, behavioral scores, and cognitive scores since their adoption to define the condition only applied to the baseline encounter.
Preprocessing of Tabular Data
Cerebrospinal Fluid and Plasma Biomarkers
We obtained all data from the Imaging and Data Archive (IDA) from the Laboratory of Neuroimaging (LONI) portal and data processing is as described by ADNI. Briefly, CSF Gap43 and Neurofilament Light Chain (NFL) were preprocessed with an ELISA assay while plasma NFL was processed with a single molecule array (Simoa) technique. Plasma phosphorylated tau (p-Tau181) was assessed with an in-house Simoa array while plasma NT1-Tau was processed through the Quanterix Simoa Platform HD-1. CSF Amyloid β1-40 (Aβ1-40), Amyloid β1-42 (Aβ1-42), Tau, and p-Tau were processed by the University of Pennsylvania. We removed outliers by filtering out all samples beyond the 95th percentile of the respective distributions as their variability was reported to be considerable41, especially for older studies. Moreover, while ADNI relied on predefined cutoffs, this procedure was not applied consistently across biomarkers. Finally, we computed an additional variable, the ratio between Aβ1-42 and Aβ1-40 and rescaled all features between 0 and 1 to convert data to a suitable scale for machine learning.
Neurological and Behavioral Assessments
We selected the total scores from all relevant neurological assessments. We subsequently applied min-max rescaling as described for CSF data preprocessing above. For the Neuropsychiatric Inventory Questionnaire (NPI-Q), we identified and marked all questionnaires that were left completely unanswered as missing. When a screening question was answered negatively, we assigned a severity score of zero to that answer. We then extracted only the 12 severity scores for each questionnaire, excluding the screening questions. We normalized the scores by dividing them by 3 (highest severity score) and constructed dense vectors as input to our models. An overview of all included data for each UC is shown in Table 2.
T1w MRI preprocessing
We included T1w MRI images from the ADNI 1-3-Go, AIBL, and NIFD databases in our analyses by querying all relevant images (including both 1.5 and 3 Tesla acquisitions). We applied a deep neural network from ANTsPyNet to each T1w image to perform brain extraction. Each extracted brain was bias-corrected using ANTsPy and cropped to remove empty space. We then resampled all images and centered cropped images to 1283 voxels to fit all images to a uniform size and isotropic voxel resolution of 1.7 mm3.
Cascaded Multimodal Mixing Transformers
To overcome data-missingness, we used PyTorch 2.0.1 to implement a Cascaded Multimodal Mixing Transformer (CMT), adapted from a previously described architecture42. Unlike other DL models that process all inputs simultaneously, CMT builds its final representation sequentially, enriching a learnable representation with information from different modalities step-by-step (Fig.1). In our implementation, we used this sequential nature to effectively exclude missing modalities from the processing chain instead of feeding zero-like embeddings to the blocks.
Modular Architecture
Our CMT consisted of blocks designed to convert data from different sources into a unified latent representation with a dimension of 256. Each block processed the output of the previous one by integrating additional information42. Each block contained three main components: an embedder, a cross-attention layer, and a transformer encoder. The embedders transformed our input modalities into modality embeddings with unique encodings that could be processed by the model. This allowed for the independent embedding of each modality separately. We encoded categorical data (i.e., APOE status and sex) with a lookup table, while ordinal data (such as assessment scores and age) were linearly projected to the latent dimension. Finally, we adopted a convolutional feature extractor as encoding architecture for volumetric MRI data. We applied cross-attention to merge each modality embedding with the latent one, sequentially. In this step, we translated the individual modality information (e.g. T1w MRI) to the transformer’s space using the modality embedding as the Key and Value and specified the Query as the model’s latent embedding to attend to the incoming data. Following each fusion, we applied self-attention and a forward network with a dimension of 1024, consistent with a standard transformer encoding layer43 (Fig. 1).
Modality Dropout
We leveraged the original modality dropout technique42 to account for bias originating from different data availabilities across modalities and to improve model robustness to missing data. Here, we randomly masked batches of data with Not a Number (NaN) according to a well-defined dropout rate (see further sections), thereby applying missingness upon existing data to make the model more robust to unavailable information, independently of the modality.
Data Sparsity and Bias
To account for possible biases common to models trained on complex, incomplete, and multimodal data, we took bias-correcting quality control steps. To account for biases resulting in discrepancies in data availability across studies (such as CSF biomarkers only being measured in ADNI), we used variables that were available for all the conditions selected for each use case (for example, we included CSF in UC2 but not in UC1). We also developed a new training framework referred to as Cascaded Training to deal with unbalanced classes with different amounts of missing data for each modality. Additionally, we used an established balanced accuracy metric to evaluate model decision-making better44.
Cascaded Training Framework
We developed the Cascaded Training Framework to address suboptimal learning outcomes caused by different data-missingness patterns across modalities and frequency of the diagnoses. The model’s blocks were sequentially and individually trained with a modality-specific loss weight computed on the class distribution within each modality. Once a block was trained, its weights were frozen, allowing for the next block to be added in sequence (Fig. 2). We conceptually separated the training block CMTi, which is the one being trained at a given iteration, from the trailing blocks CMTj|j<i that appeared before CMTi and provided context to the block being trained. In our experiments, for simplicity, we trained all categorical and ordinal blocks with a learning rate of 1E-5 for 20 epochs. To train the T1w MRI block, we raised the learning 5E-5 and used a linear learning rate decay schedule for 70 epochs. We used the Medical Open Network for Artificial Intelligence (MONAI) libraries45 to load NIFTI, rescale intensities between 0 and 1, apply random affine transformations and flip along the z-axis (transversal) to improve robustness and generalizability46. Data augmentation can result in data leakage between train and validation splits. Therefore, we used MONAI augmentation pipelines that work in a streamlined fashion to avoid this.
Representation of cascaded training showing trailing blocks that are “locked” as their weights are frozen. Stage 1 shows the integration of “Age” to the model, whose weight is subsequently frozen for the following integration of “Sex” (Stage 2). This process of integration and preceding freezing of weights is repeated with the inclusion of all further features to the model (Stage N). Despite the freezing of feature weights, they remain capable of processing the input and providing context to the training block, which is shown placed at the end of the chain.
Modality Dropout Computation
We used modality dropout on the trailing blocks to mitigate the bias associated with different data availabilities. Let’s denote with CMTi as the CMT’s Block that processes the modality mi. Before CMTi began training, we filtered out from the original dataset all the observations where mi was missing, leading to a dataset Dmi where the training modality was complete. We then computed the dropout to be applied on the trailing blocks in a way to have homogeneous amounts of missing data. Let’s denote with rtarget the missing rate to be achieved in all trailing modalities. We removed from the target dropout the natural missing rate of the data (rdata) in the identified subset. This strategy allowed us to avoid over-dropping of sparse modalities since their missing rate would have been close to the target dropout, hence resulting in a minimal additional dropout.
We adopted an expected dropout rate of 70% for tabular data, and subsequently raised it to 90% when training T1w MRI, in order to boost unimodal performance. Additionally, we implemented a label-specific modality dropout strategy to address potential biases that could arise from uneven data distributions across different classes, ensuring that no label was appearing more frequently than the others within the same trailing modality. This, in turn, prevents the model from associating data patterns too closely with specific classes, which can occur if certain modalities are disproportionately linked to specific classes (for example, in ADNI, CSF was acquired more often in MCI and CN). To achieve this, for each epoch, we determined the extent of missing data for each modality within each label. For each modality, we identified the highest rate of missing data across all labels. Then, for labels with lower rates of missing data for a given modality, we artificially increased the amount of missing data to match the highest rate observed. We accomplished this by randomly selecting a subset of data points to be masked as if they were missing. In the hypothetical situation where a modality was present exclusively in one diagnosis, this dropout would have assumed the value of 1 (total dropout) for the other classes, resulting in no learning regarding that modality.
Explainability
We relied on graphical and quantitative methods to establish the origin of our model’s errors and to characterize the response of our model to different modalities. We adopted t-distributed stochastic neighbor embedding (t-SNE)47 to visualize high-dimensional features learned by the embedders of each block in a 2-dimensional space. We also observed that the input-output relationship of cross-attention carried value in explaining how the model was responding to different modalities, independently of their nature. We propose then a new cross-attention-based explainability metric, Cross-Modal Fusion Norm (CMFN), as follows:
In this formulation, Qin is the multimodal embedding, Qmodality is the embedding about to be merged and d is the latent dimension. We hypothesize that Qin encapsulates information about the patient, which is progressively enriched through the information flow, while Qmodality provides contextual information about the input modality. As a result, the proposed metric aims to describe the modality-specific cross-attention response given the participant’s characteristics which are comprehensively stored in Qin. We computed the average CMFN for each class to understand important mechanisms across features, and we plotted how this metric was reacting at different values within the same modality to get insights on what was considered meaningful. For high-dimensional modalities, we relied on t-SNE to reduce the dimensionality of the embeddings and highlighted the coordinates by CMFN magnitude as shown in Fig. 5.
Post-Hoc Analysis
Following model training on different folds, we evaluated each on its respective validation fold. This cross-validation process provided performance metrics and confidence intervals, which we aggregated to evaluate the model’s overall performance across the dataset, enabling global confusion matrix creation, post hoc analysis, and attention analysis. We also analyzed misclassified data to uncover sources of error. Where we could not plot the input values to interpret where the model misclassified, we again adopted t-SNE as a dimensionality reduction tool and colored correct and incorrect predictions with different markers.
Results
Comparison of our model performance to state-of-the-art results obtained from other literature (see supplementary section 3) show varying results for each UC (Fig. 3), with poorer performance in the unimodal approach of UC1 and equal or improved performance in UC2 and UC3, particularly in the Clinical Standard and Invasive/Research multimodal modeling approaches. For UC1, we report two performances: for the full differential diagnosis, 5-classes prediction (classes being: CN, AD nfvPPA, bvFTD, and svPPA; chance level 20%) and, when comparing with the literature, for a 3-classes prediction (CN, AD, FTLD; chance level 33%) obtained by grouping the FTLD subclasses predictions to one new class. This aggregation was necessary to provide a robust comparison with existing work.
Use Case 1: Differential Dementia Diagnosis
Unimodal
Using only T1-w MRI, our model achieved a balanced accuracy of 61.1±4.2% (confusion matrix in Fig. 4A; chance level 20%). As previously described, predictions for FTLD subclasses were aggregated into one class to facilitate comparison with the literature.
This resulted in 69.5± 2.6%. The Receiver Operating Characteristic (ROC) showed a robust area under the curve (AUC) for all classes, with the minimum AUC being 0.79 for AD versus all the other classes. However, this was accompanied by large confidence intervals (ROC curves color bands in Figure 4A). A t-SNE reduction of the generated features allowed the identification of clear clusters for svPPA and bvFTD. Still, nfvPPA was not well defined in the geometric space with major overlaps between bvFTD and CN. A partial overlap was also observed between AD and CN features.
Clinical Standard
When adding cognitive, behavioral, and demographic Clinical Standard information, the model achieved a balanced accuracy of 75±3.8% (confusion matrix in Figure 4B). Aggregation of predictions resulted in 90.1±2.0% balanced accuracy, marking a significant increase in disease classification accuracy compared to the unimodal approach. Sensitivity towards nfvPPA decreased due to substantial difficulty in distinguishing it from bvFTD. ROC analysis for the multimodal approach showed a marked improvement in AUC across all predicted classes, indicating better overall performance (Figure 4B). Note that nfvPPA retained a high AUC but in comparison with other syndromes, yielded the lowest AUC improvement. Additionally, the variability of performance across folds was qualitatively reduced. A t-SNE dimensionality reduction of the multimodal embeddings showed a complex space organization resulting from different modality values. The space showed clusters characterized by inferior variability compared to the unimodal scenario, particularly with AD being significantly better separated from CN.
Error Analysis
We analyzed model misclassifications and discovered the model generated false negatives when cognitive performance was better in AD, nfvPPA, and bvFTD. Likewise, the opposite was also true (Fig. 4C). When age was considered, we found most of the AD misclassifications happening in subjects above 70 years old while most of the errors below this threshold were happening with FTLD (Fig. 4D). In terms of psychological assessments, we observed abnormal depression levels in cognitively normal volunteers, thereby resulting in misclassification as AD, nfvPPA, and bvFTD.
Explainability
The Cross-Modal Fusion Norm (CMFN) analysis provided insights into how different modalities contributed to the classification (Fig. 4E-G, 5A). For cognitive assessment scores such as MoCA and MMSE, the model placed higher attention on lower scores. Higher scores were also attended to in cognitively normal volunteers (Figure 4F, 4G). Geriatric Depression Scale (GDS) scores also showed complex responses in that cognitively normal people received more attention for low scores but, as the score increased, other dementia classes overtook the attention over CN (Fig. 4D) showing that the metric can capture a stratified response exhibited by the model. The NPI-Q and T1-w MRI data showed similar patterns, with higher attention given to svPPA and bvFTD classes (Fig. 5A). Furthermore, when CMFN was inspected in MRI features, we found that elevated attention was placed on features belonging to FTLD conditions while AD features received much less attention in comparison (Fig. 5A).
Use Case 2: AD-Staging
Unimodal Approach
When the model input was limited to T1-weighted MRI data, the achieved balanced accuracy was 55.6±1.4% (chance level 33.3%), as reflected in the ROC curves (Fig. 6A), with specific challenges in detecting MCIs reliably. This was especially evident when looking at the t-SNE plots showing MCI overlapping with the other two classes (Fig. 6A). Note that MCI was not stratified into converters and non-converters to AD when training the model for this use case.
Clinical Standard
With the further inclusion of behavioral assessments and demographic data, the balanced accuracy increased to 60.9±2.6% (Fig. 6B). MCI sensitivity showed the biggest change since it improved from 10% to 40%. ROC analysis for this multimodal approach showed an improvement in AUC across all predicted stages, indicating a better overall performance (Fig. 6B).
Invasive/Research
In the most comprehensive setup including all multimodal data available, the balanced accuracy reached 64.1±2.9%. This setup provided the highest accuracy and sensitivity (Fig 6C). ROC analysis further supported this finding, with the highest AUC values observed in this setup, indicating superior classification performance. It is important to frame the improvement within the context of the sparsity of CSF data, which averaged 80.6% missingness in our dataset. Interestingly, we found that the sensitivity to AD and CN remained nearly unchanged across unimodal and multimodal settings since, also in this last case, the primary driver in accuracy was found in MCI sensitivity which improved by 6% compared to the Clinical Standard (as can be inferred from the confusion matrices in Fig. 6A-C).
Error Analysis
Though we trained our model in a three-class setting, we analyzed the differences in misclassifications in the pMCI and stable MCI (sMCI) populations separately to understand better when the model was failing (note that pMCI and sMCI correspond to converters and non-converters from MCI to AD in the literature48). We found differences in misclassification between the two groups: pMCI were overall more readily detected as MCI, with a sensitivity of 56%, however, 75% of misclassifications happened towards AD. On the other hand, sMCI proved to be harder to recognize, with a sensitivity of 47% and a disproportion in misclassifications (66%) towards the normal class. Misclassified cognitively normal participants were generally older and less educated in the case of MCI and AD predictions. Behaviorally, misclassified CN individuals had more severe GDS scores when misclassified as MCI and AD. Important differences across distinct prediction groups were especially found in biomarkers where observations predicted as AD had higher Tau, p-Tau, Plasma p-Tau 181, Plasma NFL and lower Aβ1-42, and CSF Aβ1-42/Aβ1-40; analogous but opposite patterns were observed for mistakes towards CN (lower Tau, p-Tau, higher Aβ1-42 and CSF Aβ1-42/Aβ1-40) (Figures 6D, 6E). In all biomarkers, pMCI distributions were more similar to AD than the stable group. This behavior was also mirrored by the sMCI group, where mistake clusters for the CN and the AD groups showed opposite patterns in CSF and plasma biomarkers (Figures 6D, 6E).
Explainability
The CMFN indicated plasma and CSF biomarkers as the most impactful information, with Aβ1-42 being the most important. The CMFN analysis revealed distinct patterns for various biomarkers. For Aβ1-42, the CMFN was higher for AD and MCI at lower levels and higher for CN at higher levels (Figure 6F). CSF Tau, p-Tau, NFL, and plasma NFL showed higher attention for cognitively normal individuals at lower biomarker levels compared to AD and MCI, with the trend reversed at higher values (Figures 6G, 6H). The model also showed significant attention to the APOE genotypes, particularly (2,3) and (4,4). High-dimensional attention analysis through t-SNE revealed that MRIs belonging to AD featured high CMFN, while the metric dropped in the middle region where the overlap between AD and CN was higher (Figure 5C).
Use Case 3: MCI Conversion
Unimodal Approach
Using only T1-w MRI for prognosis yielded moderate results regarding balanced accuracy (63±1.5%; chance level 50%) (Fig. 7A) and AUC (0.673) which were reflected in the t-SNE feature space where no clear separation emerged.
Clinical Standard
Incorporating cognitive and behavioral scores alongside demographics notably boosted the model performance, yielding a 69±1.8% balanced accuracy (Fig. 7B). We also observed this improvement in the ROC Curve, which showed an improved AUC of 0.756.
Invasive/Research
In line with other UCs, the accuracy and AUC peaked in the most comprehensive setting (Fig. 7C). Incorporating CSF, plasma, and APOE biomarkers allowed for improved confidence in the predictions which was reflected in the AUC (0.774) and balanced accuracy (70.8±2%).
Error Analysis
Analyzing where the model struggled in predicting the conversion to future AD highlighted many possible reasons for misclassification. Regarding demographics, we observed that persons misclassified as pMCI had significantly higher age, and lower education and that cognition played a key role given that misclassified pMCI had cognitive scores towards the lower end of the distribution. CSF and plasma biomarkers revealed similar results since the biomarker profile of the misclassified subgroup overlapped with the one from the antagonist group (Fig. 7D, 7E). Finally, in all settings, we discovered a longer time to conversion between the observations of false negative pMCI, as shown in Figure 8.
Explainability
We observed an anomalous behavior in the CMFN for demographic modalities where the model did not effectively make use of the information, causing local divergence in the early blocks. This resulted in an inflated CMFN that made the interpretation of important features impossible. The inflated CMFN issue primarily affected the modalities placed at the beginning of the CMT chain but gradually dissipated in later blocks. In the latter blocks, however, we observed that the CMFN was elevated in sMCI compared to pMCI. We observed the opposite pattern, however, for Aβ1-42 whereby lower levels were more attended in sMCI (Fig. 7 F-H).
Discussion
Both preclinical and clinical data are complex in nature and are often siloed and difficult to integrate. To assess the validity of emerging Deep Learning (DL) technologies in addressing this in the context of dementia research and healthcare, we have applied a CMT architecture to three neurodegenerative diseases Use Cases (UCs) using data from the six open-source data sets. In UC1, we evaluated TelDem, first using an unimodal and then a multimodal modeling approach, in the differential diagnoses of cognitively normal adults, AD, and three subclasses of FTLD. In UC2, we evaluated the model in the prognostic staging of AD, comparing model performance firstly in an unimodal approach, secondly in a clinical standard approach, and finally in a multimodal approach. In UC3, we investigate disease progression, testing the model’s ability to distinguish between progressive and stable MCI. Overall, our results show that the addition of multiple modalities improves model accuracy compared to the modelling of unimodal data.
Use Case 1 – Dementia Differential Diagnosis
Compared to other unimodal automated solutions49,50, TelDem achieved superior balanced accuracy (0.901) (CN vs AD vs FTLD) despite underperforming when only using MRI compared to previous studies51–53. The incorporation of multiple data modalities enhanced diagnostic accuracy by 14% and improved confidence of the predictions, with notable increases in the AUC across most diagnostic groups. This improvement highlights the system’s ability to leverage diverse diagnostic data effectively, though it exhibited a multimodal trade-off, particularly in nfvPPA, where a more frequent confusion with the behavioral trait suggested an over-reliance on behavioral assessments for this diagnosis. While sensitivity decreased, the AUC remained the same however, which indicates that the model is still capable of ranking positive cases with the same decision threshold. Additionally, not all participants underwent MRI at all encounters, which could moderate sensitivity in situations where MRI is critical, such as nfvPPA.
Several hypotheses could explain why the model struggled with the nfvPPA diagnosis. First, despite broad clinical and anatomical differences, both bvFTD and nfvPPA show focal neurodegeneration in the insulae54, longitudinal atrophy in dorsolateral and prefrontal regions55, and show elements of tau histopathology56. Secondly, nfvPPA shows locally unspecific and limited brain atrophy in comparison to svPPA, making detection with MRI challenging57. Including assessment of language comprehension could possibly improve nfvPPA classification. Additionally, nfvPPA had the least available data and while our model can handle this, it is possible that a limited training size in conjunction with these other ambiguating factors played a role in model underperformance.
In the context of explainability, CMFN analyses revealed that T1-w MRI and the NPI-Q, which is specifically designed to assess psychopathology in dementia58, were the most critical in model decision-making for UC1. One explanation for this relates to the variety of behavioral and psychological symptoms that occur in dementia (BPSD)59,60 and how these characterize different dementia disorders61. For example, changes in eating habits are reported to be more frequent in FTLD than in AD62 while delusions occur more in AD63. However, it’s also important to highlight that BPSD symptomatology is not static64 and there may be overlap across FTLD subtypes. This dynamic nature of BPSD could have contributed the conflation of nfvPPA with bvFTD.
Secondly, MRI was critical in differentiating FTLD subtypes, which is unsurprising given that MRI allows for the identification of AD-specific patterns of atrophy or of different subtypes of FTLD65,66. However, the model may have also relied on demographics and cognitive scores to classify AD as less attention was attributed to MRI for this diagnosis compared to FTLD. Interestingly, CMFN also showed that the CDR received less attention at intermediate values, which suggests that the model relied on this assessment to rule out dementia, rather than to distinguish between different subtypes. This is consistent with the literature, given evidence that the CDR has a limited utility in discriminating FTLD from AD67. For this reason, efforts to integrate more extensive neurological domains to the CDR are underway68. Moreover, model responses to the same information varied depending on the diagnosis. For example, CN participants and persons with dementia showed different response patterns in cognitive scores, suggesting the model can downweigh outlying pieces of information, such as low cognitive scores. For GDS ratings, we observed the opposite pattern where, for a total score of zero, CN participants received the most attention. Attention at higher scores was dominated by svPPA, bvFTD, and AD.
Use Case 2 – AD Staging
Regarding UC2, our system underperformed in unimodal scenarios when solely using MRI to stage AD. While this could be attributable to the limited image resolution, multimodal modelling including demographics and behavioral scores boosted performance, reinforcing the importance of including broader clinical metrics in the diagnostic process. Including these biomarkers also improved both AUC and sensitivity. However, errors with this diagnosis were still a major source of inaccuracy. This outcome was nevertheless anticipated due to the heterogeneous nature of MCI, which does not always manifest clearly in biomarker profiles69, possibly leading to the observed misclassification of pMCI as AD. While the inclusion of CSF and plasma biomarkers did show improvements in model performance, these gains were limited by the scarcity of these biomarkers in our dataset. While this would suggest that addition of these more fine-grained measures could help with classification, it is possible that increased availability could yield only minor improvements, given that diagnostic criteria for AD often rely heavily on cognitive readouts only70. Hence, while biomarkers provide valuable information that help with predicting cognitive decline, their potential to fully resolve misclassification remains constrained by both availability and the limitations of the current AD diagnosis71.
Similar to UC1, we also aimed to explore model explainability. Our analysis through CMFN revealed that CSF data received, on average, the most attention with Aβ1-42 being the most discriminative along with its ratio. Conversely, Aβ1-40, tau, NFL, and gap43 were considered less. This aligns with evidence showing that (i) Aβ1-42 is more diagnostically relevant compared to Aβ 72, that changes in CSF Aβ occur early in the course of preclinical AD73,74, and that Aβ is a reliable CSF-based diagnostic marker of AD75,76. MRI was also highlighted important with regards to model decision making but primarily in AD, suggesting that the model was relying on other information to exclude pathology. CMFN analyses also showed lower explainability with larger overlap between AD and CN. This may be due to MCI being an intermediate stage between normal cognition and AD77. Similar to cognitive scores in UC1, the model’s attention responded variably to CSF biomarker levels, suggesting an ability to contextualize biomarker data within the broader diagnostic picture. The model also focused more on certain genotypes (e.g., APOE 2,3 and 4,4), which again is consistent with literature showing the role of APOE genotype, in particular, in influencing AD risk78,79.
Use Case 3 – MCI Conversion
Predicting the progression of MCI to AD (UC3), our results suggest an enhanced prediction accuracy when using multimodal DL to integrativley model all available diagnostic data. Although the unimodal approach underperformed relative to benchmarks cited in other literature, our multimodal model returned an accuracy of over 60%. The integration of cognitive scores was particularly effective, boosting prediction confidence and overall accuracy from 61% to 69%. The addition of CSF markers and APOE genotyping further increased the sensitivity for detecting progression to AD, achieving a global balanced accuracy of 71%. This observation would again buttress the diagnostic role of CSF biomarkers and could explain the small improvement from the clinical standard model to the full multimodal model as being a result of large amounts of unavailable CSF data.
We also observed higher sensitivity for faster conversion using all multimodal data compared to unimodal MRI. In fact, our model proved able to sustain above 75% sensitivity and up to 80% during the 35 months before receiving an AD diagnosis and stabilizing at around 30% six years before conversion. However, in the clinical standard scenario, sensitivity dropped faster suggesting that biomarkers indeed drove the gain. Interestingly, we did not observe the same drop in sensitivity when relying on unimodal MRI. We interpret this finding with caution, however, as it is likely due to an increased false-positive rate. Specifically, while the model demonstrated better sensitivity over longer periods, it did not perform as well over shorter periods, indicating that the increased sensitivity might be accompanied by a higher rate of incorrect predictions. In other words, the model identified stable individuals as likely to progress to AD when they were not, thus inflating the sensitivity metric at the cost of specificity. This is a known issue when relying solely on MRI data, for example, as many individuals with MCI exhibit brain atrophy suggestive of AD progression without progression within the expected timeframe80. This discrepancy could explain why the model did not achieve exceptionally high sensitivity at shorter prediction intervals.
Misclassification
Additionally, we also aimed to understand cases in which our model did not accurately classify participants. For example, in UC3 we observed that some sMCI observations were erroneously predicted as pMCI. This observation would seem to challenge the accuracy of our model, these participants exhibited CSF Aβ1-42 levels and other biomarker profiles more closely resembling those of pMCI participants. This is notable given that diagnoses in the ADNI sample were made without reference to biofluid markers81. Previous research suggests that use of only cognitive scores can generate inconsistent results, with low memory scores, for example, being common in older individuals though varying significantly across different populations82,83. Our CMFN analysis shows that our model attributed high importance to CSF markers in its decision making, suggesting that these misclassifications may represent a more fine-grained labeling than that provided by the ADNI data set, rather than poor model performance per se.
Limitations
These interpretations, however, should be considered in the light of several limitations. First, the absence of multimodal neuroimaging in our approach could limit the power of our modelling, particularly given recent landmark studies that have made impressive strides forward in integrating and modelling multimodal imaging data84. Moreover, both Aβ and tau PET could be particularly informative in this regard given the associations of these metrics with cognitive decline and the robust predictive ability of PET-assessed tau accumulation for disease progression85–88. We chose only T1-w, however, first for feasibility, and second to better match the most widely available clinical routines. Second, we applied only cross-sectional predictions. Modeling of biomarkers longitudinally and capturing their multimodal interactions would likely enhance model performance, particularly in UC2 and UC3. We focused on cross-sectional data, however, as longitudinal modeling would unavoidably expand model complexity, thus obscuring interpretability and explainability. Third, CMFN analysis is possibly misleading when models are trained on less informative data, such as with the use of demographics in the early training stages of UC3. This likely resulted in a lack of divergence of initial blocks, inflating the CMFN norm. Hence, caution is needed when associating this metric with “feature importance”. Placing such variables later in the data integration sequence could reveal clearer patterns, suggesting that the order of data integration obscures model interpretability.
Fourth, we observed variability in reported metrics compared to other studies which could be due to data leakage. Leakage can bias evaluation of real-world model performance and recent literature suggests data leakage may explain why DL models achieve exceptional performance on one dataset but fail to replicate on another89. However, it must be noted that we deliberately specified our unimodal baselines to include features that showed less leakage through more robust methods90 or provided results based on independent datasets. Fifth, we did not account for medication use, which has been shown to modulate cognitive responses in ADNI and other similar data91. Future studies aiming to replicate our results should include this information to assess the degree to which medication use may affect model performance. Finally, we did not account for patient race or ethnicity. This is of critical importance in ensuring inclusion of people that are typically underrepresented in clinical research92. To ensure that TelDem is applicable to all patients, future studies should avail of ongoing efforts93 to include patients spanning a spectrum of racial, linguistic, and geographic backgrounds.
Conclusion
DL applications represent an unprecedented opportunity to accelerate dementia research and patient care. Nevertheless, stringent validation of DL-based systems is required. Here, we have deployed, evaluated, and assessed a multimodal DL architecture in the context of three UCs. Our results show that multimodality (i.e. the addition of diverse modalities) significantly improves disease classification, staging, and progression from MCI to AD, over unimodal modeling (the sole use of T1-w imaging). Moreover, model explainability revealed that CSF markers of Aβ contributed heavily to model decision-making, thus further supporting model validity. While additional research including Aβ and tau PET and more diverse patient data are needed, our results take a much-needed step in showing the advantages inherent to implementing DL research and clinical care. In conclusion, our results represent a new horizon for the efficient implementation of personalized treatments in dementia, thus providing a promising platform for enhancing research and patient care.
Data Availability
All used data are publicly available through the https://ida.loni.usc.edu/
*Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
** Data used in the preparation of this article was obtained from the Australian Imaging Biomarkers and Lifestyle flagship study of ageing (AIBL) funded by the Commonwealth Scientific and Industrial Research Organisation (CSIRO) which was made available at the ADNI database (www.loni.usc.edu/ADNI). The AIBL researchers contributed data but did not participate in analysis or writing of this report. AIBL researchers are listed at www.aibl.csiro.au
*** Data used in preparation of this article were obtained from the Frontotemporal Lobar Degeneration Neuroimaging Initiative (FTLDNI) database (4rtni-ftldni.ini.usc.edu). The investigators at NIFD/FTLDNI contributed to the design and implementation of FTLDNI and/or provided data but did not participate in analysis or writing of this report (unless otherwise listed).
Funding
This work was supported by Row Fogo Charitable Trust (grant no. BRO-D.FID3668413, MVH). This study was supported by grants from the German Research Foundation (SCHR 774/5-1 to MLS), and the eHealthSax Initiative of the Sächsische Aufbaubank (Project TelDem). Accordingly, this study was co-financed with tax revenue based on the budget approved by the Saxon state parliament.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U19 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Author contributions
Conceptualization: G.G., J.R, K.T, M.L.S
Methodology: G.G., J.R, P.G.M., P.E., K.T, M.L.S
Investigation: G.G., J.R, P.E., K.T, M.L.S
Visualization: G.G., K.T., N.S., E.N.M.
Supervision: J.R, K.T., E.N.M., M.L.S
Writing—original draft: G.G., J.R., E.N.M.
Writing—review & editing: All authors
Competing interests
All authors declare they have no competing interests.
Supplementary Materials
3. State of the art (SOTA) - Comparisons to existing literature
To compare our results to existing literature, we collected the balanced accuracies reported by the below studies. When the balanced accuracy score was not provided, we relied on sensitivity and specificity to compute the metric by summing the sensitivity and specificity metrics and dividing number that by 2.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.
- 7.
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.
- 87.
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵