ABSTRACT
Background Alzheimer’s disease (AD) exhibits considerable phenotypic heterogeneity, suggesting the potential existence of subtypes. AD is under substantial genetic influence, thus identifying systematic variation in genetic risk may provide insights into disease origins.
Objective We investigated genetic heterogeneity in AD risk through a multi-step analysis.
Methods We performed principal component analysis (PCA) on AD-associated variants in the UK Biobank (AD cases=2,739, controls=5,478) to assess structured genetic heterogeneity. Subsequently, a biclustering algorithm searched for distinct disease-specific genetic signatures among subsets of cases. Replication tests were conducted using the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (AD cases=500, controls=470). We categorized a separate set of ADNI individuals with mild cognitive impairment (MCI; n=399) into genetic subtypes and examined cognitive, amyloid, and tau trajectories.
Results PCA revealed three distinct clusters (‘constellations’) driven primarily by different correlation patterns in a region of strong LD surrounding the MAPT locus. Constellations contained a mixture of cases and controls, reflecting disease-relevant but not disease-specific structure. We found two disease-specific biclusters among AD cases. Pathway analysis linked bicluster-associated variants to neuron morphogenesis and outgrowth. Disease-relevant and disease-specific structure replicated in ADNI, and bicluster 2 exhibited increased CSF p-tau and cognitive decline over time.
Conclusions This study unveils a hierarchical structure of AD genetic risk. Disease-relevant constellations may represent haplotype structure that does not increase risk directly but may alter the relative importance of other genetic risk factors. Biclusters may represent distinct AD genetic subtypes. This structure is replicable and relates to differential pathological accumulation and cognitive decline over time.
BACKGROUND
Alzheimer’s disease (AD) diagnostic criteria have evolved over the years, but typically diagnosis has been characterized by predominant amnestic impairment that progressively impacts other cognitive domains and everyday functioning. However, clinical presentation is heterogeneous, and non-amnestic predominant variants of AD, often termed “atypical AD”, were specifically acknowledged in a 2011 update to AD diagnostic criteria and guidelines [1]. Pathological spread and neurodegeneration also tend to proceed in a stereotypical pattern, but mirroring the clinical diversity of AD, AD neuropathological exams and imaging studies have identified marked heterogeneity as well [2–8]. Despite this variability, AD is considered a distinct entity due to the overarching clinicopathologic characteristics observed across individuals even though the etiological basis of AD remains unclear. That is, it may have a unitary origin with diverse presentation, a highly heterogenous etiology that converges on a common disease phenotype, or it may be consistent with some intermediate scenario.
Sporadic AD is under considerable genetic influence, with an estimated heritability of 60-80% [9]. Thus, examining the genetic architecture of AD risk and the myriad ways in which combinations of variants associate with AD provides a useful foundation to understand its complex etiology. Although APOE represents the single largest source of genetic risk for the disease [10] recent GWAS have identified upwards of 75 different risk loci [11–15]. Considering variants that do not reach the level of genome-wide-significance but still suggest an association with AD could provide additional insight above and beyond APOE [16] and genome-wide significant variants. The highly oligo or polygenic nature of AD risk could reflect underlying etiological heterogeneity across individuals. The evolving definition of AD and subsequent debate over its origins further highlights the complex nature of the disease [17–19]. Given this genetic complexity yet commonality among aspects of AD clinical presentations, there is little reason to expect that phenotype-based classifications of sporadic AD (and its subtypes) will cleanly delineate homogeneous subgroups of genetic risk, providing motivation for a using “genotype-first” approach to identifying subtypes [20–22].
Identifying genetic subtypes or heterogeneity typically involves cluster analysis of some sort. There are many clustering algorithms one can exploit, and these techniques can be applied to various sources of data, including gene expression, GWAS summary statistics, and individual genotype data [23–25]. However, clustering approaches to identify genetic subtypes of disease face several difficulties (see Dahl et al. [26] for detailed discussion). First, a fundamental aspect of most clustering algorithms is that they will tend to identify clusters, even if no true clusters are present (i.e., false positive clusters). Second, a common approach is to search for clusters of genes or variants within a group of cases and consider them disease-related when this might not be the case, as it is often unknown whether these same clusters would be found in controls, which could reflect pathways or structures unrelated to disease status. The biclustering method described in Rangan et al. [27] addresses both issues. First, the biclustering method does not assume that a bicluster exists in the data, but rather tests the null hypothesis that one does not exist. Second, the technique searches for subsets of SNPs that express correlations within the cases that are not similarly expressed within the controls, providing evidence for disease-specificity.
Here we investigated genetic heterogeneity of Alzheimer’s disease using a combination of approaches. We first applied a principal component analysis to AD cases and controls from the UK Biobank to identify potential clusters among AD-associated SNPs that may indicate disease-relevant vulnerability across all individuals, regardless of disease status. We then applied a biclustering method to each of these likely vulnerability clusters to search for subsets of cases that harbor distinct genetic signatures that significantly increase AD risk. To validate our findings, we investigate whether the same clusters or patterns of heterogeneity also appear in an independent group of cases and controls from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Taken together, we find a hierarchical structure to an underlying heterogeneity of AD genetic risk, providing further insight into the complex etiology of the disease.
METHODS
Participant characteristics
We used imputed genotyping data from the UK Biobank (UKB) as the discovery dataset (Table 1). The UKB is a large-scale biomedical database and research resource containing genetic, lifestyle and health information from half a million UK participants [28, 29]. Data from 2,739 Alzheimer’s disease cases [30] and 5,478 age- and sex-matched controls with White British ancestry as determined by PCA [29] were included in these analyses.
A replication dataset was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu; Table 2). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. Our biclustering analyses included genotyping data from 500 individuals with Alzheimer’s disease cases and 470 controls from the ADNI-1 (n=520), ADNI-GO/2 (n=225), and ADNI-3 (n=225). Case-control status was based on the ADNI diagnosis given at each participant’s last visit available. An additional 399 individuals (ADNI-1=166, ADNI-GO/2=175, ADNI-3=58) diagnosed with MCI at their latest visit diagnosed with MCI at their latest visit were included for examination of cognitive and biomarker change. Participants were restricted to those with primarily European ancestry (>80%) as determined by SNPweights [31].
Genotyping data and quality control
Genotyping data imputed to the Haplotype Reference Consortium plus UK10K reference haplotype resource were downloaded from the UKB database along with genetic principal components (hereafter referred to as “genome-wide” principal components). Sample QC information provided by the UKB were used to apply exclusionary criteria. Individuals with excess relatedness were removed. This was defined by the UK Biobank as samples with more than 10 putative third-degree relatives (KING coefficient between 0.0442 and 0.0884). Imputed data used in these analyses included only samples with <10% missingness and biallelic SNPs with >1% minor allele frequency.
Individuals in the ADNI cohorts were genotyped using the following chips: Illumina Human610-Quad BeadChip (ADNI-1), Illumina HumanOmniExpress BeadChip (ADNI-GO/2), and Illumina Infinium Global Screening Array v2 (ADNI-3). Genetic principal components (hereafter referred to as “genome-wide” principal components) were calculated from linkage disequilibrium (LD)-pruned variants in combination with 1000 Genomes data [32] for use as covariates in later analyses. Following standard genotyping QC, imputation was performed on the Michigan Imputation Server (https://imputationserver.sph.umich.edu/) [33] using the 1000 Genomes phase 3 EUR reference panel. Imputed data from all phases were then merged. Imputed data used in this analysis included only samples with <10% missingness and biallelic SNPs with >1% minor allele frequency.
Note that, at this stage we retain all the SNPs for our primary analysis, regardless of the linkage disequilibrium (LD) relationships that might exist between them. The reason we retain all the SNPs is that our heterogeneity analysis (described below) will specifically search for combinations of SNPs that preferentially exhibit correlations across the case-subjects in contrast to the controls. These correlations drive the heterogeneous structures we are trying to find and can themselves be thought of as a form of LD that interacts with the disease. Thus, to carry out our heterogeneity analysis we retain all the relevant SNPs (regardless of LD), and later correct for ‘population-wide’ LD (e.g., SNP-correlations that are not disease-specific, but putatively ancestry-related) within our biclustering analysis below.
Assessing high-level structure with principal components analysis
We restricted analyses to 486,823 variants in the UKB data that were associated with AD with an uncorrected p-value of <0.05 from the AD GWAS by Kunkle et al. [12] to accommodate potential association signals beyond those exhibiting genome-wide significance. After applying MAF (>1%) and genotype missingness filters (<10%), a principal components analysis (PCA) was applied to the allele combinations (dummy-coded according to the 3 possible allele combinations for each) of these AD-associated variants across all UKB cases and controls to assess the presence of high-level structure in the data (i.e., a non-Gaussian distribution). For more details on the choice of allele-coding, see Supplemental Methods section Rationale for allele-coding. The PCA was implemented in C using the normalized power iteration method calculated to a relative error of 1e-6, or single-digit precision. Individuals were assigned to clusters, hereafter referred to as “constellations” to disambiguate from the term “bicluster” used in subsequent steps, by applying the ISO-SPLIT algorithm [34] implemented in MATLAB (https://github.com/flatironinstitute/isosplit5) to participant loadings on the first PC.
Classification was based only on the first PC because it most robustly separated constellations across a range of p-value thresholds. Although the sample was restricted to White British participants, it is possible that these loadings reflect remaining ancestry-related population substructure. We therefore determined whether the resulting structure was specific to the set of AD-associated variants in several ways. First, PCA was applied to sets of variants restricted to p-value thresholds ranging from p<0.05 to p<1.0 in increments of 0.05. Loadings on the first two PCs were plotted at each increment, with individuals labelled according to cluster membership at the p<0.05 threshold to visually assess stability of clustering. Second, to determine whether the resulting structure was simply a function of the number of variants analyzed, a PCA was applied to a randomly chosen set of variants with the same size as those included in the analysis restricted to variants with p<0.05. Third, we examined loadings of allele combinations on the primary PC ordered by genomic position as suggested by Privé et al. [35] to assess whether it was driven by regions of high population-wide LD.
Bicluster analysis of disease-specific structure
Heterogeneity exhibited at the top level of the constellations strongly suggests that there may be other heterogeneous substructures contained within each. This possibility seems all the more likely due to the non-Gaussian distribution of subjects within each constellation, and that the distribution of cases and controls appears to be in different directions across constellations.
To search for biclusters we use the half-loop method described in Rangan et al. [36]. This biclustering strategy (along with similar spectral-biclustering strategies) works best when the combined population of cases and controls is relatively homogenous. Given the clear structure evident from the PCA and the differential distribution of cases relative to controls across constellations, we therefore searched for disease-specific heterogeneity within each constellation separately.
Our biclustering strategy involves an iterative process which starts with all the participants (in this case, all participants in a given constellation) and SNPs, and then sequentially removes AD cases and allele-combinations from consideration. The Supplementary Methods contains a detailed description of this procedure but is summarized here. Briefly, we can measure the fraction of allele combinations that are shared between cases and subtract from this the fraction that are shared with controls to obtain a disease-related signal strength of remaining AD cases and controls. This difference can be thought of as a measure of disease-specific LD within the remaining AD-cases. The subtraction of the control-signal above is a form of “control-correction”. This control-correction is a critical step because it reduces the influence of structure present in both cases and controls on clustering. For example, disease non-specific LD or technical artifacts that are present in the full sample will be controlled for by this step. At each iteration the cases and allele-combinations with the smallest contributions to this signal are removed and the process repeats as described above. Recording this value at each step produces a ‘trace’ which indicates the disease-related signal-strength associated with the remaining AD cases and variants at that iteration.
In pursuing this process, we controlled for the first two principal-components extracted from a GRM of all individuals in our sample (and without thresholding for AD-associated SNPs) to control for ancestry. When calculating these genome-wide principal-components we did correct for LD, as our goal was to estimate the principal-components across the whole population (i.e., assuming homogeneity across subjects).
We can use the peak of the trace to delineate the membership of the dominant bicluster (i.e., which AD cases and allellic-combinations contribute to the disease-specific signal). We identify the peak by finding the internal maximum, ignoring the initial and final iterations that include >95% or <5% of AD cases.
As a null-hypothesis, we assume that the disease-label (i.e., case vs control) is not associated with the genetic profile of each subject. Therefore, we randomly permuted the case- and control-labels across subjects with similar genome-wide principal components (see Rangan et al. [27] for details) and re-calculated the traces as described above. Here, we use 500 permutations. By comparing the original trace with the distribution of traces drawn from the null-hypothesis, we can assign a p-value to the observed trace at each iteration. The null distribution will retain any structure that is uncorrelated with disease label, so provides a second type of correction against identifying clusters driven by non-disease-related LD. In this case, we only assess iterations that include >5% of the cases (i.e., the final iterations with very few cases and/or variants are ignored). To determine whether we have found a statistically significant bicluster within the original data, termed the ‘dominant’ bicluster, we can examine the p-value of the highest peak (pmax) and the average p-value across iterations (pavg). Depending on the structure of the bicluster, the various p-values may be quite different, but each may be a useful metric for identifying significant biclusters. For example, if the original trace has one or more clear peaks, then there are statistically robust ‘cutpoints’ which can be used to delineate bicluster membership and pmax is likely to be very small. On the other hand, if the original trace has a very broad peak or a long plateau, then the bicluster is quite ‘fuzzy’, corresponding to a continuum of membership. In this case, we might expect pavg to be small, but pmax may be relatively large. If the dominant bicluster within a data-set is statistically significant, we can extract it and then search for a secondary bicluster. This is done by scrambling the entries of the submatrix associated with the bicluster (i.e., entries corresponding to the participants and allele-combinations that were retained in the bicluster) and running the search algorithm again [27, 37].
Gene set enrichment analysis of bicluster-associated variants
We conducted a GWAS across AD-associated SNPs comparing bicluster cases with controls belonging to the same disease-relevant constellation in which the bicluster was found (e.g., bicluster 1 cases versus controls for constellation 1) using PLINK2 [38]. Analyses were adjusted for the first two genome-wide principal components. As with the analysis of constellations, variants positively associated with each bicluster (i.e., positive regression coefficient and uncorrected p<0.05) were mapped to genes using the g:Profiler package in R [39]. A gene set enrichment analysis was then applied to annotated genes from each bicluster using the clusterProfiler package in R [40, 41]. The analysis was restricted to gene sets from the Gene Ontology resource [42, 43] containing between 10 and 1,000 genes. The Cytoscape app EnrichmentMap v3.3 [44] was used to visualize gene sets significantly enriched (FDR corrected p<0.05) in each bicluster. We constructed a network in which nodes represented gene sets significantly associated with each bicluster (gene sets can be associated with either or both).
Edges were defined by the proportion of overlapping genes between gene sets (using a minimum overlap threshold of 0.5). Next, the AutoAnnotate app (http://baderlab.org/Software/AutoAnnotate) was used to cluster and annotate nodes based on degree of overlapping genes using the MCL Cluster algorithm.
Validation of genetic heterogeneity in ADNI
We next investigated whether the disease-relevant and disease-specific heterogeneity found in the UKB would generalize to Alzheimer’s disease cases and controls from the ADNI (see Supplementary Methods section Replication of disease-relevant constellations in ADNI for full details). First, we determined the subset of AD-associated SNPs common to both the UKB and ADNI datasets. Mirroring the original analysis, we applied a PCA to all cases and controls across this set of intersecting SNPs. ADNI participants were projected into the same principal components space using the allele combination loadings defined in UKB data, allowing us to assess overlap between datasets. ADNI participants were assigned to the nearest constellation based on participant loadings on the first principal component.
After participants were classified into constellations, we assessed replication of the dominant biclusters separately in the constellations in which they were found (see Supplementary Methods section Replication of disease-specific biclusters in ADNI for full details). Thus, we tested replication of bicluster 1 only among individuals belonging to constellation 1, and tested replication of bicluster 2 only among individuals belonging to constellation 2. For a given bicluster, we calculated the first 2 dominant SNP-wise principal components (using only SNPs in common to both datasets) among the UKB cases belonging to the bicluster. The resulting SNP loadings were used to project UKB participants in the constellation (including controls, bicluster cases, and non-bicluster cases). Similarly, we calculated projections for each of the ADNI participants belonging to a given constellation using these UKB bicluster-defined loadings. If the bicluster structure is present in the ADNI data, then the distribution of cases and controls in this space should be similar between datasets. That is, for a given participant in the testing set, participants from the training set located nearby should tend to have the same label.
We can choose what fraction of the training set participants to compare, denoted as f, and for a given choice of f we calculated the proportion of nearest neighbors in the training set that have the same label (i.e., either case or control) as each of the individuals in the test set. We assessed the quality of each choice of f using a permutation test in which the average “label match” across participants in the test set is compared to a null distribution constructed from label-shuffled data. We ran 500 permutations for these analyses. We calculate the average z-score across the range of f in (flo, fhi), with flo corresponding to the lower end of the 95% confidence-interval for affine-point-matching (i.e., flo∼ 6%) and fhiat 50%. This average is calculated using a normalizing factor to correct for heteroskedasticity [45]. The variance determining the normalizing factor is calculated from the analogous Z-scores obtained after alignment of the projections onto principal-components calculated from randomly selected biclusters (i.e., subsets of cases and allele-combinations) of the same size as the bicluster of interest. We calculated a global empirical p-value by comparing the average Z-score of the observed data across the range of parameter choices of f in (flo, fhi) to the null distribution across the same range. Given that a particular bicluster is globally significant, values of f corresponding to high Z-scores indicate reasonable values of f to use when labelling new data, as described below.
Association of bicluster groups with cognitive and biomarker trajectories
The procedures described above were repeated on the full set of ADNI genotyping data, this time including individuals with MCI in addition to AD cases and controls. This included applying the previously described genotyping filters, projecting ADNI data onto PC space defined by the UKB data and assigning to the nearest constellation, and then projecting individuals belonging to constellations 1 and 2 using the bicluster-defined loadings from biclusters 1 and 2, respectively. The number of nearest neighbors with each label (i.e., bicluster case, non-bicluster case, and controls) was then recorded for each ADNI participant. The fraction of training set participants used as nearest neighbors was determined by the fraction with the highest Z-score in the validation step described above. Each ADNI participant was assigned a soft label, calculated as the proportion of nearest neighbors in the UKB training set that were bicluster cases. To assign individuals to biclusters 1 and 2, k-means clustering with k=2 was applied to the soft labels (i.e., proportions of nearest neighbors that were bicluster cases) of each ADNI participant belonging to constellations 1 and 2 separately.
Phenotypic analyses were restricted to the ADNI MCI participants who were not included in the bicluster validation analysis. We examined change of cognition, amyloid, and phosphorylated tau (p-tau) over time between individuals labelled as bicluster 1, bicluster 2, or non-bicluster. Cognition was assessed using scores on the Preclinical Alzheimer’s Cognitive Composite (PACC) [46, 47]. Amyloid was assessed with florbetapir PET data processed according to previously published methods (http://adni.loni.usc.edu/methods) [48, 49].
Specifically, we downloaded mean standardized uptake value ratios (SUVR) from a set of regions including frontal, temporal, parietal and cingulate cortices using whole cerebellum as a reference region. Cerebrospinal fluid p-tau CSF samples were collected on cohort participants and processed as previously described [50]. CSF p-tau was measured with the fully automated Elecsys immunoassay (Roche Diagnostics) by the ADNI biomarker core (University of Pennsylvania). Florbetapir PET and CSF p-tau were chosen as the measures of amyloid and p-tau because they provided the highest number of visits with relevant data across our MCI participant group.
Group differences between individuals with MCI labelled as bicluster 1, bicluster 2, or non-bicluster were assessed with linear mixed effects models using the lme4 [51] and lmerTest [52] R packages. For each outcome of interest (cognition, amyloid, or p-tau), all timepoints with available data were included. An interaction between age and group was used to assess differences in the trajectories of cognitive and biomarker measures over time. A random intercept was included for participant.
RESULTS
Identification of disease-relevant constellations
The PCA of AD-associated variants that were retained after applying QC filters (n=446,700 SNPs; 1,340,100 allele-combinations) in the UKB data revealed three distinct clusters, or “constellations”, each containing a mixture of AD cases and controls (Figure 1A). The plots in Figure 1B indicate that this structure only emerges when restricting to AD-associated SNPs. When all variants are included in the PCA (e.g., no p-value criterion is used) there is also evidence for substructure, but it is quite different than what is observed when a threshold of p<0.05 is used, with individuals from each of the constellations being highly interspersed. The separation of constellations along the dominant component emerges around p<0.25, and the constellations begin to further separate along the second PC at p<0.05. Results of a PCA applied to a random set of SNPs of the same size as were included at the p<0.05 level again found that individuals from each constellation were interspersed, and the overall structure closely mirrored that seen when all SNPs were included (Supplementary Figure 1). Three similarly distinct clusters emerge if a PCA is applied to additively-coded data as opposed to allele-coding (Supplemental Figure 2).
Plotting the SNP-wise loadings of PC1 revealed strong contributions from variants in the region of chromosome 17q21 (Figure 2). This is a known region of extended LD with complex genomic architecture, including a 900-kb inversion polymorphism surrounding the MAPT gene that defines two haplotyes, H1 and H2, with H1 containing multiple sub-haplotypes [53, 54].
Examining the genotypes of each constellation at the H1/H2 tagging SNP rs8070723, there is a strong (but not perfect) correspondence between constellations and haplotypes. Constellation 1 is largely homozygous for the H1-associated allele, constellation 3 is largely homozygous for the H2-associated allele, and constellation 2 is heterozygous (Supplemental Table 2). Thus, while variation from across the genome contributes to this constellation structure, it is driven primarily by several correlation patterns within this LD block, likely corresponding to H1/H2 haplotype status.
Taken together, these findings suggest that the constellations are disease-relevant because they only emerge as dominant modes of variation when restricting consideration to AD-associated variants, and because the distribution of cases and controls falls along different directions across constellations. However, they are not disease-specific because constellations contain both cases and controls, suggesting cluster membership alone does not discriminate between high and low disease risk. The three constellations were not significantly different on demographic characteristics, including sex, age, education, percent with AD dementia, or number of APOE-e4 alleles (Supplementary Table 1).
Identification of disease-specific biclusters
The stark heterogeneity exhibited at the level of the disease-relevant constellations strongly suggests that there may be other heterogeneous substructures contained within each. This possibility seems all the more likely due to the non-Gaussian distribution of subjects within each constellation. Additionally, the distribution of cases and controls appears to be different across constellations. Examining Figure 1A, it appears that the bias or shift in cases relative to controls does not fall in the same direction across constellations, thus requiring that the search for biclusters be carried out separately for each constellation. The results of our bicluster searches are shown in Figure 3. Each subplot corresponds to the search in a different constellation. Within each subplot the red curve corresponds to the signal-strength of the dominant bicluster within the data, referred to as a ‘trace’ in the Methods section. The black curves indicate the distribution of traces drawn from the null-hypothesis via a permutation-test. Generally speaking, a red trace that has either a high peak or a high average (relative to the distribution of black traces) indicates a statistically significant bicluster. For this data set we detected a statistically significant bicluster in constellation 1 (termed “bicluster 1”; pavg=0.002, pmax=0.008) and constellation 2 (termed bicluster 2; pavg=0.028, pmax=0.244), but not in constellation 3 (pavg=0.484, pmax=0.102), which is the smallest of the 3 constellations. The shape of the traces conveys additional information about the structure of the biclusters. For example, the sharp peaks defining bicluster 1 in Figure 3A indicate clear cut points at which membership can be clearly delineated (i.e., a disease subtype). In contrast, the broader plateau of the trace in Figure 3B reflects a bicluster with “fuzzy” boundaries whereby there is a smoother continuum of membership in the bicluster. This is also reflected in the different pattern of p-values between biclusters (i.e., whether the peak or average of the trace tends to be more significant). Thus, examining the traces informs us whether disease heterogeneity takes the form of distinct subtypes, a continuous spectrum of risk, or some intermediate structure. After delineating and removing the dominant bicluster within constellations 1 and 2 (see Methods) we searched once again for any additional biclusters but did not find a second bicluster that was statistically significant in either constellation (Supplementary Figure 3). However, when searching for these secondary biclusters, the clearly non-Gaussian distribution of traces drawn from the label-shuffled null distribution (visualized as clumped strands among the black traces) suggests there is residual heterogeneity, but we are not powered to identify it in the current sample. The bicluster participants were not significantly different on demographic characteristics, including sex, age, education, or number of APOE-e4 alleles (Supplementary Table 1). Supplementary Figure 4 presents a plot of the constellations shown in Figure 1 with bicluster cases highlighted.
Gene set enrichment results of disease-specific biclusters
Bicluster 1 was enriched for a number of gene sets, including those related to receptor activity, calcium and sodium ion transport, dendritic structure, GTPase activity, and regulation of the MAPK cascade. Bicluster 2 was enriched for gene sets related to the MHC protein complex, regulation of cell size, lipid transport, and tyrosine kinase activity. Both biclusters showed enrichment for cell-projection morphogenesis, and synaptic transmission. Figure 4 shows network plots of gene sets enriched in each bicluster. The gene sets are grouped by proportion of overlapping genes and labelled with dominant terms to illustrate the shared and unique functions associated with variants in each bicluster.
Although many of the enriched gene sets identified in this analysis relate to basic functions and thus do not appear to be AD-specific, it should be noted that the variants considered in these analyses were already selected for their association with AD. That is, the gene sets shown in Figure 4 are those which are enriched in bicluster variants relative to the other AD-associated variants, rather than with respect to all other variants, including AD-nonspecific variants. Pathways that are typically found to be enriched for AD-associated variants constitute a background signal that is likely shared among many of the cases, including those that are not part of either bicluster.
There is also a degree of overlap in the gene sets enriched for each bicluster. However, the proportion of overlap seen at the level of gene sets is reduced when looking at the overlap among constituent genes or SNPs (Supplementary Figure 5). This indicates that the variants associated with each bicluster may have impacts that converge on similar downstream pathways, but the specific perturbations encompassed by each may not be the same.
Disease-relevant constellations are evident in ADNI data
Using only SNPs common to both datasets, we re-calculated the top principal components of AD-associated variants in the UKB data. These loadings were used to project both the UKB and ADNI data into the same PC space. As seen in Figure 5, the grouping of UKB participants into 3 constellations was almost perfectly mirrored in the ADNI data, albeit with a slight shift in positioning. As in the UKB data, each constellation contained a mixture of AD cases and controls, reflecting disease-relevant but not disease-specific clustering.
Disease-specific biclusters replicate in ADNI
Overall, we found evidence for significant replication of bicluster 1 in the ADNI data (global p=0.006), and weaker yet significant replication of bicluster 2 (global p=0.048). This pattern is consistent with the initial bicluster search, in which bicluster 1 seemed to have more distinct boundaries whereas the boundaries of bicluster 2 were more diffuse. The best label similarity ranking in comparison to the null distribution was achieved using 31.6% of the sample as nearest neighbors for bicluster 1 and using 27.3% of the sample as nearest neighbors for bicluster 2. Figure 6 displays the labeling accuracy across a range of nearest neighbors. Note that the shape of these curves can be influenced by the structure of the bicluster. For example, a bicluster with clearly delineated boundaries may be less sensitive to altering the nearest neighbor parameter compared to one with blurrier boundaries. Consistent with this, we see high accuracy of label matching in bicluster 1 across a large range of nearest neighbor fractions, ranging from about 17% up to our maximum threshold of 50%. In contrast, the accuracy of label matching bicluster 2 is more sensitive to this parameter choice, with high accuracy using nearest neighbor fractions from about 26% to 36%. Importantly, these plots provide a range of sensible values one can use to label new data.
Cognitive and biomarker trajectories across bicluster groups in ADNI MCI participants
We found evidence for differential cognitive and biomarker trajectories across bicluster groups in a separate set of ADNI individuals diagnosed with MCI. Individuals assigned to bicluster 2 demonstrated significantly greater decline on the PACC compared to the non-bicluster (β=-0.40, t-value=-2.29, p=0.022) and bicluster 1 groups (β=-0.60, t-value=2.92, p=0.004) (Figure 7A). Bicluster 1 demonstrated a somewhat greater increase in florbetapir over time compared to the non-bicluster group, but this difference was not significant (β=0.32, t-value=1.52, p=0.129) (Figure 7B). Bicluster 2 demonstrated a significantly greater increase of CSF p-tau over time compared to the non-bicluster group (β=0.45, t-value=2.05, p=0.041) (Figure 7C).
DISCUSSION
Genetic analyses suggest the etiology of AD is multifactorial [55], but the extent to which the composition of genetic influences on AD risk varies across individuals and whether it is replicable across different data sets has been unclear. We identified evidence for several subsets of individuals that exhibit different patterns of genetically-mediated vulnerability to the risk for AD. Among AD-associated SNPs, there is clear heterogeneity across individuals, and this heterogeneity seemed to follow a hierarchical structure. Importantly, this structure was observed in two independent datasets, indicating that it is not sample-specific, but rather appears to be a generalizable feature of AD genetic risk.
The first level of heterogeneity emerged as 3 clusters, which we termed “constellations”, from a PCA of the UKB data confined to variants associated with AD based on prior GWAS. Similar analyses (i.e., PCA) are often used to identify ancestry-related population structure [56], but we found that the constellations only emerged when examining AD-associated SNPs and individuals from each constellation were highly intermixed when analyzed using all SNPs or a random selection of SNPs. These results indicate that the constellations do not simply reflect ancestry-related substructure. We further found a spike in loadings on the first PC used to define the constellations within the region of chromosome 17q21.31. This is a region of strong extended LD driven by a 900-kb inversion polymorphism surrounding the MAPT gene that defines two haplotypes, H1 and H2, with the H1 haplotype further dividing into several sub-haplotypes [53, 54]. The MAPT gene codes for tau, the primary component of neurofibrillary tangles, so it is clearly relevant to AD. However, as a risk gene it is more strongly associated with primary tauopathies, and investigations of the MAPT locus and the H1/H2 haplotypes find inconsistent evidence for a specific association with AD risk [54, 57]. A stratified GWAS of AD by Strickland, et al. [58] found that there were several variants with haplotype-dependent associations, which may explain these inconsistencies.
It may be that our constellations at least partially reflect the H1/H2 haplotypes (along with H1 sub-haplotypes) defined by variation in this region. Using the H1/H2 haplotype-tagging SNP rs8070723, Strickland, et al. [58] found that 39% of their sample were H2 carriers (H1H2 + H2H2). Based on the genotypes of individuals across our constellations, individuals in constellations 2 and 3 (who largely have genotypes corresponding to H1H2 and H2H2 carriers, respectively) accounted for an identical 39% in our sample as well. A block of complete LD in this region and minimal recombination may further explain the surprising degree of separation between these constellations [53, 54, 59]. We note an important aspect of these constellations: they contain both cases and controls and therefore membership is not associated with increased AD risk. However, the overall distribution of cases and controls is different in each. Taken together with results from Strickland, et al. [58] showing haplotype-dependent associations, we refer to the constellation structure are “disease-relevant” in that it may alter the relative importance of other risk variants (e.g., by associating with protection or vulnerability to these other risk variants) but is not directly associated with increased AD risk.
Cases belonging to the disease-specific biclusters harbored distinct genetic signatures when compared to other AD cases and may therefore reflect “subtypes” of AD genetic risk. The variants defining the two disease-specific biclusters were enriched for both similar and unique pathways when compared to each other and the disease-relevant constellations in which they were found. It is important to re-emphasize the fact that these analyses were restricted to AD-associated variants, which are enriched for lipid processing, cholesterol transport, amyloid precursor (APP) processing and Aβ formation, tau protein binding, and immune response [12]. The bicluster-associated pathways reported here thus exist within this broader context of other AD-associated pathways. Bicluster 1 was enriched for gene sets involved in regulating the MAPK cascade, which may contribute to AD progression in several cell-type-dependent ways, such as increasing neuroinflammation, promoting neurofibrillary tangle formation, and depressing synaptic plasticity [60]. Bicluster 2 demonstrated enrichment in a cluster of gene sets related to the major histocompatibility complex (MHC), which is consistent with findings that implicate microglia-mediated immune response as a key player in AD [12, 15, 61, 62]. Both biclusters – and bicluster 1 in particular – were enriched for gene sets related to synaptic signaling and signal transduction. Similar enrichment was found among newly prioritized genes in the most recent largescale GWAS of AD [15]. We also found both biclusters were strongly enriched for a number of pathways that interact to influence the morphogenesis and outgrowth of neurons. This includes gene sets related to the cellular components themselves (e.g., cell bodies, dendritic spines, synapses, axonal projections, cystoskeletal components), but also factors that modulate their development such as polymerization, β-catenin binding, and GTPase activity [63–67]. Disruption of these processes may leave structures more vulnerable to insult resulting in, for example, greater Aβ-mediated synapse loss. APP (the precursor of Aβ) is important for regulating axonal and synaptic growth through influencing cytoskeletal remodelling [68] and tau plays a critical role in stabilizing microtubules [69, 70]. It is therefore possible that the hallmark pathologies of AD, i.e., Aβ and tau protein build-up, may exert impacts through toxic effects, but also reflect a loss of function that affects cell structure and growth. Further work is needed to comprehensively characterize the impacts of these bicluster-associated variants.
We found that both the disease-relevant constellation structure and the disease-specific biclusters replicated in the ADNI sample. There are several important differences between the ADNI and UKB cohorts. First, although analyses were restricted to individuals of European ancestry, the UKB was further restricted to a White British subset, so the ADNI data may contain relatively more ancestry-related heterogeneity. Second, the studies used different genotyping chips (and ADNI genotypes were obtained with 3 different chips across phases) and imputation panels. Third, ADNI participants were recruited with the intent of mirroring a clinical trial population focused on Alzheimer’s disease, whereas the UKB attempted to recruit a broader base of individuals to study multiple outcomes. Replicable structure across these datasets is thus unlikely to reflect dataset-specific or artifactual confounds that should differentially affect these datasets.
Supporting the relevance of the genetic structure found here, we found that a separate set of ADNI participants diagnosed with MCI assigned to different bicluster groups demonstrated differential cognitive and biomarker trajectories. Individuals with genetic signatures resembling bicluster 2 exhibited greater accumulation of p-tau and a corresponding steeper decline in cognitive performance over time. The variants associated with this genetic subtype may contribute to a more aggressive form of AD. We previously found that genetic risk affecting different biological pathways can preferentially relate to amyloid or tau accumulation [71]. Thus, it may be that the balance of pathological accumulation is shifted towards p-tau in these individuals, and p-tau has been shown to be more closely linked to subsequent neurodegeneration [72] and cognitive decline [73] compared to amyloid. The genetic heterogeneity observed here may impact not only level but distribution of pathology, which may be particularly important in the case of tau [3, 7, 74]. Heterogeneity in genetic risk may also relate to comorbid conditions that exacerbate AD progression. It is therefore clear that more work is needed to characterize the downstream impacts of genetic heterogeneity on disease outcomes.
Our findings may indicate that there are likely multiple genetically-mediated pathologies underlying AD that converge on common clinical manifestations. This is not a unique phenomenon. For example, Charcot-Marie-Tooth disease involves a common set of clinical symptoms that can arise from separate genetic origins [75]. We did find evidence for overlap in the pathways enriched across the different constellations and biclusters identified in our analysis, and this may indicate where convergence begins. However, there are a few caveats. First, receiving the same clinical diagnosis does not rule out the possibility of meaningful differences across cases. There are atypical forms of clinically-defined AD [76–79] and there is growing evidence for several biological subtypes of AD characterized by distinct patterns of pathological spread or neurodegeneration [2, 3, 5]. Further work is needed to determine whether the genetic heterogeneity identified here is associated with more subtle phenotypic differences. Second, the degree of overlap among pathways that we observed is driven by SNPs that were assigned to genes and pathways, but not all SNPs were assigned to genes.
Many AD-associated SNPs are located in non-coding regions, so the overlap of enriched gene sets ignores substantial numbers of SNPs uniquely associated with each. Differences between constellations and biclusters may arise from the functional effects of the variants themselves, including those not assigned to genes. Third, it is possible that the subsets of cases identified by the biclustering represent misdiagnosed AD (assuming the underlying disease cause is driven by partially distinct genetic factors). Identifying misdiagnoses or undiagnosed cases in a heterogenuous population is a potential use-case of the method. However, the number of cases in each bicluster would require a higher rate of misdiagnosis than one might expect, especially in ADNI where other dementias were specifically excluded.
CONCLUSIONS
In sum, we found evidence of a hierarchical structure underlying heterogeneity in the genetic risk of AD. The disease-relevant constellation structure is driven to a large degree by an extended region of LD on chromosome 17q.21, with constellations potentially reflecting MAPT haplotypes. Membership in a given constellation did not directly increase risk for AD, but may alter the relative importance of other genetic risk variants for AD. On the other hand, the biclusters may be considered disease subtypes with distinct genetic signatures compared to the broader population of AD cases, which may have important implications for treatment efforts. Despite all cases presenting with a common clinical syndrome, the etiology and path taken to clinical manifestation may vary across patients. The differential cognitive and biomarker trajectories between genetic subtypes provides some evidence that this heterogeneity has consequences for downstream disease processes. Identifying subtypes of AD could facilitate precision medicine approaches that tailor treatment strategies to the individual for increased effectiveness.
Data Availability
Data used in preparation of this article were obtained from the UK Biobank (application ID 63648) database at http://www.ukbiobank.ac.uk/ and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database at https://adni.loni.usc.edu/. These data may be obtained upon application to the respective studies and completion of data use and/or material transfer agreements. Code implementing the bicluster search is available at https://github.com/adirangan/lakcluster_matlab. Additional code used in the analysis is available upon request to the corresponding author (J.A.E.).
Funding
This work was supported by National Institute on Aging grants K01 AG063805 (JAE), UH3-AG064706 (NJS), U19 AG023122 (NJS), U19 AG065169-01A1 (NJS) and U24AG078753 (NJS).
Conflict of interest
Jeremy Elman is an Editorial Board Member of this journal but was not involved in the peer-review process of this article nor had access to any information regarding its peer-review. All other authors have no conflict of interest to report.
Acknowledgments
We thank the anonymous reviewers whose insightful comments and suggestions helped improve the manuscript. This research has been conducted using data from UK Biobank, a major biomedical database (http://www.ukbiobank.ac.uk/). Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Footnotes
↵† Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
Revised main text and figures; Revised Supplemental Files
REFERENCES
- [1].↵
- [2].↵
- [3].↵
- [4].
- [5].↵
- [6].
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].
- [14].
- [15].↵
- [16].↵
- [17].↵
- [18].
- [19].↵
- [20].↵
- [21].
- [22].↵
- [23].↵
- [24].
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- 54.↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- 64.
- [65].
- [66].
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- 71.↵
- 72.↵
- [73].↵
- 74.↵
- [75].↵
- [76].↵
- [77].
- [78].
- [79].↵
- [80].↵