ABSTRACT
A large battery of neurocognitive tests are used to detect cognitive impairment and classify early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and Alzheimer’s Disease (AD) from cognitive normal (CN). The goal of this study was to develop a deep learning algorithm to identify a few top neurocognitive tests that can accurately classify these four groups. We also derived a simplified risk-stratification score model for classification. Over 100 variables that included neuropsychological/neurocognitive tests, demographics, genetic factors, and blood biomarkers were collected from 383 EMCI, 644 LMCI, 394 AD patients, and 516 cognitive normal from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. A neural network algorithm was trained on data split 90% for training and 10% testing. We evaluated five different feature selection methods. Prediction performance used area under the curve (AUC) of the receiver operating characteristic analysis. The five different feature selection methods consistently yielded the top classifiers to be the Clinical Dementia Rating Scale - Sum of Boxes (CDRSB), Delayed total recall (LDELTOTAL), Modified Preclinical Alzheimer Cognitive Composite with Trails test (mPACCtrailsB), the Modified Preclinical Alzheimer Cognitive Composite with Digit test (mPACCdigit), and Mini-Mental State Examination (MMSE). The best classification model yielded an AUC of 0.984, and the simplified risk-stratification score yielded an AUC of 0.962 on the test dataset. The deep-learning algorithm and simplified risk score derived from our deep-learning algorithm accurately classifies EMCI, LMCI, AD and CN patients using a few commonly available neurocognitive tests.
Introduction
Dementia is a neurodegenerative disease characterized by progressive memory loss as a result of neuronal cell death. More than 47 million people worldwide live with dementia and by 2050 that number is expected to increase to 131 million [1]. The most common type of dementia is Alzheimer’s Disease (AD) and Mild Cognitive Impairment (MCI) is often seen as risk state of progression to AD. The latter can be subdivided into early mild cognitive impairment (EMCI) and late mild cognitive impairment (LMCI), as defined in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [2]. While there is no cure for dementia, early diagnosis may enable lifestyle changes (such as diet and exercise), neurocognitive enrichment, and therapeutic treatment that may temporarily improve symptoms or slow the rate of decline of symptoms, thereby improving the quality of life [3].
The core clinical criteria for the diagnosis of MCI and AD are neuropsychological tests [4,5]. Cerebrospinal fluid and imaging biomarker tests may in some cases supplement standard clinical tests in specialized clinical settings [6]. A large array of neurocognitive tests are currently used to detect cognitive impairment and classify amongst normal controls (CN), EMCI, LMCI and AD [7-8]. Many studies have identified a few top classifiers using logistic regression and machine learning methods [9–18]. Some studies have also used MRI and genetic data in conjunction with neurocognitive measures for classification [19-20]. However, most of these studies to date performed binary classification (i.e., between CN and AD or CN and MCI) [17,21]. Classifying patients amongst CN, EMCI, LMCI and AD remains challenging.
Deep learning is increasingly being used in medicine, including classification of diseases to aid diagnosis [22-24]. Deep learning uses algorithms to learn the relationship amongst different data elements to inform outcomes. In contrast to traditional analysis methods (such as logistic regression), the specific relationships amongst different input variables with outcome variables do not need to be explicitly specified a priori. Neural networks, for example, are made up of a collection of connected nodes that model the neurons present in a human brain [25]. Each connection, like the synapses in a brain, transmits and receives signals to other nodes. Each node and the connections it forms are initialized with weights which are adjusted throughout training and create mathematical relationships between the input data and the outcomes. Deep learning is well-suited to analyze complex and large datasets where input and output variables cannot be readily parameterized. There are no published studies that employ deep learning to reduce the large array of neurocognitive tests for classifying amongst normal controls (CN), EMCI, LMCI and AD.
The goal of this study was to develop a deep-learning algorithm to identify the top neurocognitive test scores that accurately classify normal control, early MCI, late MCI and Alzheimer’s disease. We also evaluated different feature-selection algorithms. From these findings, we further constructed a simplified risk score model to classify normal, EMCI, LMCI and AD for clinical use.
Methods
Study Population
Data used in this study was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). Patients were taken from the ADNI1, ADNIGO, ADNI2, and ADNI3 patient sets. Figure 1 shows the flowchart for patient selection. The inclusion criteria were a confirmed diagnosis from screening to the baseline visit and the exclusion criteria were greater than 20% of patient data missing. Of the 1937 participants that met the inclusion criteria, 516 patients were diagnosed as CN, 383 were diagnosed as EMCI, 644 were diagnosed as LMCI, and 394 were diagnosed as AD.
Data Preprocessing
We evaluated about 100 input variables (i.e., test scores, demographic information, and biomarkers). Correlation matrix analysis showed that 47 variables had a correlation coefficient above 0.5 and were determined to be correlated, which merited exclusion from further analysis. In addition, 29 variables were missing in >20% of patients and they were also excluded from analysis. For the rest of the variables, missing data (most of which had < 10% data missing) was imputed with Classification and Regression Trees (CART) using Multivariate Imputation by Chained Equations (MICE) in R, a statistical analysis software (version 4.0.0) [26]. Although regional volumes were available through the FreeSurfer pipeline in ADNI database, >30% were missing and regional volumes were thus not included in the analysis. Only intracranial volume was included because <20% was missing and imputation was used to fill in missing data.
The following neurocognitive tests, demographics, comorbidities and other variables were used in our analysis. The neuropsychological scores included ADAS11, ADAS13, ADASQ4, CDRSB, FAQ, LDELTOTAL, MMSE, RAVLT_forgetting, RAVLT_immediate, RAVLT_learning, RAVLT_percetange_forgetting, TRABSCOR, and mPACCdigit, and mPACCtrailsB. The extracted imaging parameters that were used included intracranial volume (ICV), volume of ventricles, and whole brain volume. APOE4 status was also included. The demographics included age, sex, race, ethnicity, education level. The outcomes were 4 diagnosis classes: AD, LMCI, EMCI, and CN based on comprehensive clinical diagnosis as provided in the dataset.
Neural Network Model
Ranking of feature importance was first conducted among cognitive tests, demographic information, genetic tests, and extracted biomarkers. Five different feature selection methods were utilized to identify the most predictive variables: Information Gain, Boruta Random Forest, Recursive Feature Elimination with the Random Forest Classifier, Logistic Regression with LASSO/L1 regularization, and Permutation Importance in Keras. We used the scikit-learn library for Information Gain, Recursive Feature Elimination, and Logistic Regression analyses. We used the Boruta package in R for Boruta Random Forest. To conduct Permutation Importance analysis, we used a separate neural network that used all the features available rather than just the top few variables for training [27]. This network consisted of 4 layers, 2 fully connected dense layers followed by a dropout layer and another fully connected dense layer. The first two layers consisted of 16 neurons with the ReLU activation function. The dropout layer had a dropout rate of 0.25. The last layer consisted of 4 neurons with the Softmax activation function for multiclass classification. The model was compiled with categorical cross entropy loss with the ADAM optimizer and a learning rate of 0.001 [28]. The top predictors were those that demonstrated statistical significance.
Multi-layer Perceptron (MLP) neural network was constructed with two fully connected dense layers for classification followed by a dropout layer and finally a fully connected dense layer. The first two layers contained eight neurons along with the ReLU activation function. The next layer was a dropout layer with dropout rate of 0.05. The last layer contained four neurons and used the Softmax activation function for multiclass classification. The model was compiled with categorical cross entropy loss with the ADAM optimizer and a learning rate of 0.001. The top predictors extracted from the global feature selection analysis were used as input for the neural network and the output was the diagnosis class. The dataset was split into 90% training data and 10% testing data once before running the neural network. Diagnosis results were categorized by multiclass classification.
Risk Score Model Development
A simplified risk score model was constructed using the top five global variables (ca. cognitive test scores) identified by the different feature selection methods as followed: 1) For each variable, scores were plotted for the four classes of diagnosis and cutoff points were chosen to maximize separation amongst the 4 classes. 2) The cutoff points were then used to construct a point value system for each cognitive test’s score range. This was done by fitting the top cognitive tests in the training dataset against the diagnosis outcome using a Generalized Linear Model (GLM) [29]. 3) The GLM then assigned risk score points for each of the cognitive tests score ranges. A higher number of points for a given score range means that the patient is more likely to have AD. 4) A composite risk score from the sum of the top five variables’ risk scores was constructed for each patient. 5) The risk score model was then tested on an independent testing dataset and evaluated using ROC analysis. Risk scores of the testing dataset were plotted for the 4 classes with interpolation smoothing.
ROC analysis was used to evaluate the performance of the neural network and the risk score model, in which data were split into 90% for training and 10% for testing once. The AUC calculation was binary, in which one class was contrasted with the rest of the classes (one versus rest) and this was repeated for each of the four classes. The sensitivity and specificity reported were taken as an average of the binary sensitivity and specificity of each class. The 95% Confidence Interval (CI) for the AUC was obtained through bootstrapping the neural network’s predictions 1000 times.
Statistical Analysis
Statistical analyses were performed using SPSS v26.Frequencies and percentages for categorical variables between the stages of AD were compared in a pair-wise fashion using χ2 tests. Continuous variables, which were denoted as median (IQR), were first tested for normality with the Lilliefors corrected Kolmogorov-Smirnov test. If they were shown to not have a normal distribution, further comparison was done in a pair-wise fashion between groups using the nonparametric Kruskal-Wallis test. P-values < 0.05 were considered statistically significant.
RESULTS
Table 1 shows the demographic data for CN (n=516), EMCI (n=383), LMCI (n=644), and AD (n=394) groups. Age was not significantly different between groups except between CN and LMCI and between the LMCI and AD groups. Race and ethnicity did not differ significantly between groups. The median education level did not differ significantly between any pair except between AD versus the other classes.
With few exceptions, all neurocognitive test scores and mPACC tests showed pairwise differences between groups. The MRI-extracted parameters were significant between CN and AD and between CN and LMCI, but not significant between the other pairwise comparison. APOE4 was significant different in all pairwise comparisons. Sleep apnea and depression were the only significant comorbidities between groups.
Figure 2 shows the results of the rankings by importance from the 5 feature selection methods performed on the training dataset and Table 2 lists the top 9 features. CDRSB was the most frequently identified top feature amongst the top 5 feature selection methods (5 out of 5), followed by LDELTOTAL (4 out of 5), mPACCdigit (4 out of 5), mPACCtrailsB (4 out of 5), and MMSE (2 out of 5).
Neural Network Model for Classification
Classification was performed using the top 5 features. The performances on the testing data for the 5 methods are summarized in Table 3. AUCs for the Information Gain, Boruta Random Forest, Recursive Feature Elimination with the Random Forest Classifier, Logistic Regression with LASSO/L1 regularization, and Permutation Importance were 0.978, 0.984, 0.983, 0.906 and 0.982, respectively, on the testing dataset. The classifier selected by Boruta Random Forest performed the best in terms of AUC, but the classifier selected by Recursive Feature Elimination performed better in terms of accuracy, sensitivity, and specificity. By comparison, classification using CDRSB, LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE individually yielded an AUC of 0.8899, 0.8957, 0.8619, 0.8624, and 0.7808 respectively, on the testing dataset.
Risk Score Model
We then developed a simplified risk score model using the same top 5 variables from our deep-learning analysis. Figure 3 shows an example of the CDRSB cognitive test scores for 4 different classes. The cutoff points that maximized separation between the 4 classes were 0.1, 1.3, and 3.5, between CN and EMCI, between EMCI and LMCI, and between LMCI and AD group, respectively. The point values for individual test score ranges are summarized in Table 4. The composite score from the top 5 cognitive tests (CDRSB, LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE) were constructed using a GLM. The classification results on the testing dataset are shown in Figure 4. The risk score system classified the four groups accurately. The performance of the risk score model yielded an AUC of 0.962 [95% CI: 0.945-0.975], sensitivity of 88.06% and specificity of 96.16%, and accuracy of 89.18% for the testing dataset.
Discussion
This study developed a deep-learning algorithm to identify the top neurocognitive test scores that accurately classify normal control, early MCI, late MCI and Alzheimer’s disease. Multiple feature selection methods identified essentially the same set of top variables, providing further corroboration. CDRSB was identified to be a top feature, followed by LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE for classification of disease subtypes. Performance indices of the deep-learning model and the simplified score system were highly accurate in classifying the four groups. The best model yielded an AUC of 0.984, and the simplified risk stratification score yielded an AUC of 0.962 for classification on the test dataset. We concluded that only a few neurocognitive tests are needed to accurately classify normal control, early MCI, late MCI and AD.
The top 5 classifiers were all neurocognitive test scores. The CDRSB is rated along 6 domains of functioning, with each domain being rated on a 5-point scale, and the global CDRSB being a function of the scores from these 6 domains [30]. A higher CDRSB indicates more severe impairment. LDELTOTAL measures episodic memory and performance is measured primarily through the amount of a story that is remembered [31]. A lower LDELTOTAL score indicates more severe impairment. The mPACCdigit test measures working memory by asking the patient to repeat back a sequence of digits of increasing length, until they are not able to. The mPACCtrailsB test determines performance of processing speed with a smaller score indicating more severe impairment. Lastly, the MMSE, one of the most clinically used battery tests, is a 30-question questionnaire that is used to screen for dementia and includes tasks that involve registration, recall, and attention. Individually, these tests all perform well in separating AD from CN individuals, but struggle to diagnose MCI subtypes, Indeed, we found classification using all top 5 variables and the derived risk score system outperformed classification using CDRSB, LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE individually. Although there are a large battery of neurocognitive tests being used to classify amongst normal controls (CN), EMCI, LMCI and AD, there are no published studies that employ deep learning to reduce the large array of neurocognitive tests to classify amongst normal controls (CN), EMCI, LMCI and AD.
It is interesting to note that intracranial volume (ICV), volume of ventricles, and whole brain volume by MRI were not highly ranked as classifiers of the 4 classes. Volumetric differences could readily distinguish between CN and AD groups but might not readily differentiate between CN and MCI or between MCI subclasses [32]. Other studies investigating hippocampal volume for classifying dementia subtypes showed promise, but it is still challenging for hippocampal volume to accurately classify between CN and MCI patients (Chincarini et al., 2016). We did not include hippocampal volume because it was not readily available in the dataset.
Many studies have previously examined neurocognitive tests and identified a few top neurocognitive classifiers using non-machine learning methods, but they are not discussed here [9,11-13] (see reviews by Abd Razak et al. and De Roeck et al.) By comparison, only few studies utilized (mostly supervised learning) machine learning methods and some of these studies combined neurocognitive tests and MRI regional brain volumes as input variables [10,14-18] (Table 5). So et al. used a two-stage approach to classification with the first stage identifying the most important subsections of the MMSE and the second stage used subsections of the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) assessment [14]. They achieved up to 97% classification accuracy in Stage 1 with an MLP and 75% classification accuracy in Stage 2 with support vector machine. Lins et al. investigated a Brazilian dataset and utilized gender, age, study time (in years), AD8, MMSE, Clinical Dementia Rating (CDR), and SVFT scores, and two genetic markers (CYP46A1 and APOE4) [15]. They tested the predictive power using the Random Forest, support vector machine, and Stochastic Gradient Boosting classifiers along with an MLP neural network. They achieved a maximum binary classification accuracy between dementia and CN patients of 96% using the CDR and CYP46 features. Stamate et al. identified mPACCdigit, mPACCtrailsB, LDELTOTAL as the top classifiers [10]. They combined these scores with PET and MRI data and achieved an AUC of 0.88 for the binary classification of NC versus dementia. Chiu et al. developed NMD-12, a 12-question questionnaire that was shortened from the original 45 question questionnaire from the HAICDDS project by the Information Gain algorithm [16]. They showed that this test performed better than the commonly used MMSE and MoCA tests with an AUC of 0.94 for discriminating between CN and MCI patients and 0.97 for MCI and dementia patients. Zhu et al. analyzed a Taiwan cohort and ranked the relative importance of neuropsychological tests using Information Gain, Random Forest, and the Relief algorithm [18]. They classified normal, MCI, VMD, and dementia. Their top ranked features were C05 and J03. It is unclear what their features (C05 and J03, etc) meant, but their algorithm had an accuracy of 0.81 using Relief feature selection followed by classification with MLP method. Gill et al. investigated an MRI-based feature and Modified Barthel Index Score (activities of daily living) for binary classification between CN and MCI [17]. They used supervised machine learning and found the AUC to be 0.86.
In sum, our results are comparable or compared favorably with previous studies although comparisons were not made on the same datasets. Our study is novel in that we employed a non-supervised learning (as opposed to supervised learning) method, applied to a large and multi-center ADNI dataset with commonly used measures. We also classified amongst four groups instead of commonly used binary classification in most previous studies (i.e., between normal controls vs AD, or normal controls versus MCI).
This study has several limitations. In this cohort white and non-Hispanic/Latino ethnicities are overrepresented. Some of the comorbidities, such as sleep apnea and depression, showed significant differences between the groups. We also did not include imaging variables. Further evaluation of additional independent datasets, including prospective studies, would improve generalizability of these findings.
Conclusion
This study developed a deep-learning algorithm and a simplified risk score to identify the top neurocognitive test scores to classify normal control, early MCI, late MCI and Alzheimer’s disease. We concluded that only a few neurocognitive tests are needed to accurately classify normal control, early MCI, late MCI and AD. Accurate and early diagnosis may lead to better management of the diseases, including interventions that improve symptoms or slow the rate of decline of symptoms.
Data Availability
All data used in this study is publicly available on the ADNI (Alzheimer's Disease Neuroimaging Initiative website. Due to the public and de-identified nature of the data, an IRB was not needed. This is non-human subject research.
Footnotes
Included tables in manuscript.