Abstract
Normal and pathologic neurobiological processes influence brain morphology in coordinated ways that give rise to patterns of structural covariance (PSC) across brain regions and individuals during brain aging and brain diseases. The genetic underpinnings of these patterns remain largely unknown. We apply a stochastic multivariate factorization method to a diverse population of 50,699 individuals (12 studies, 130 sites) and derive data-driven, multi-scale PSCs of regional brain size. PSCs were significantly correlated with 915 genomic loci in the discovery set, 617 of which are novel, and 72% were independently replicated. Key pathways influencing PSCs involved reelin signaling, apoptosis, neurogenesis, and appendage development, while pathways of breast cancer indicate potential interplays between brain metastasis and PSCs associated with neurodegeneration and dementia. Using machine learning, multi-scale PSCs effectively derive imaging signatures of several brain diseases. Our results elucidate new genetic and biological underpinnings that influence structural covariance patterns in the human brain.
Main
Brain structure and function are interrelated via complex networks that operate at multiple scales, ranging from cellular and synaptic processes, such as neural migration, synapse formation, and axon development, to local and broadly connected circuits.1 Due to a fundamental relationship between activity and structure, many normal and pathologic neurobiological processes, driven by genetic and environmental factors, collectively cause coordinated changes in brain morphology. Structural covariance analyses investigate such coordinated changes by seeking patterns of structural covariation (PSC) across brain regions and individuals.1 For example, during adolescence, PSCs derived from magnetic resonance imaging (MRI) have been considered to reflect a coordinated cortical remodeling as the brain establishes mature networks of functional specialization.2 Structural covariance is not only related to normal brain development or aging processes but can also reflect coordinated brain change due to disease. For example, individuals with motor speech dysfunction may develop brain atrophy in Broca’s inferior frontal cortex and co-occurring brain atrophy in Wernicke’s area of the superior temporal cortex.3 See Fig. 1C for an example.
The human brain develops, matures, and degenerates in coordinated patterns of structural covariance at the macrostructural level of brain morphology.1 However, the mechanisms underlying structural covariance are still unclear, and their genetic underpinnings are largely unknown. We hypothesized that brain morphology was driven by multiple genes (i.e., polygenic) collectively operating on different brain areas (i.e., pleiotropic), resulting in connected networks covaried by normal aging and various disease-related processes. Along the causal pathway from underlying genetics to brain morphological changes, we sought to elucidate which genetic underpinnings (e.g., genes), biological processes (e.g., neurogenesis), cellular components (e.g., nuclear membrane), molecular functions (e.g., nucleic acid binding), and neuropathological processes (e.g., Alzheimer’s disease) might influence the formation, development, and changes of structural covariance patterns in the human brain.
Previous neuroimaging genome-wide association studies (GWAS)4, 5 have partially investigated the abovementioned questions and expanded our understanding of the genetic architecture of the human brain. However, their focus was on conventional neuroanatomical regions of interest (ROI) instead of data-driven PSCs. In brain imaging research, prior studies have applied structural covariance analysis to elucidate underlying coordinated morphological changes in brain aging and various brain diseases,1 but have had several limitations. They often relied on pre-defined neuroanatomical ROIs to construct inter- and intra-individual structural covariance networks. These a priori ROIs might not optimally reflect the molecular-functional characteristics of the brain. In addition, most population-based studies have investigated brain structural covariance within a relatively limited scope, such as within relatively small samples, over a relatively narrow age window (e.g., adolescence2), within a single disease (e.g., Parkinson’s disease6), or within datasets lacking sufficient diversity in cohort characteristics or MRI scanner protocols. These have been imposed, in part, by limitations in both available cohort size and in the algorithmic implementation of structural covariance analysis, which has been computationally restricted to modest sample sizes when investigated at full image resolution. Lastly, prior studies have examined brain structural covariance at a single fixed ROI resolution/scale/granularity. While the optimal scale is unknown and may differ by the question of interest, the highly complex organization of the human brain may demonstrate structural covariance patterns that span multiple scales.7, 8
We examined structural covariance of regional cortical and subcortical volume in the human brain using MRI from a diverse population of 50,699 people from 12 studies, 130 sites, and 12 countries, comprised of cognitively healthy individuals, as well as participants with various diseases/conditions over their lifespan (ages 5 through 97). In order to enable such a large-scale study, we developed a stochastic adaptation of the orthogonally projective non-negative matrix factorization9 to derive multi-scale PSCs. This method allowed us to derive PSCs at any desirable scale, which is defined by the number, C, of derived PSCs. Herein we present results from coarse to fine scales corresponding to C = 32, 64, 128, 256, 512, and 1024. We hypothesized that PSCs at multiple scales could delineate the human brain’s multi-factorial and multi-faceted morphological landscape and genetic architecture in healthy and diseased individuals. We examined the associations between these multi-scale PSCs and common genetic variants at different levels (N=8,469,833 SNPs). In total, 617 novel genomic loci were identified; key pathways (e.g., neurogenesis and reelin signaling) contributed to shaping structural covariance patterns in the human brain. In addition, we leveraged PSCs at multiple scales to better derive individualized imaging signatures of several diseases than any single scale PSCs using machine learning. All experimental results and the multi-scale PSCs were integrated into the MuSIC (Multi-scale Structural Imaging Covariance) atlas and made publicly accessible through the BRIDGEPORT (BRaIn knowleDGE PORTal) web portal: https://www.cbica.upenn.edu/bridgeport/.
Results
We summarize this work in three units (I to III) outlined in Fig. 1. In Unit I (Fig. 1A), we present the stochastic orthogonally projective non-negative matrix factorization (sopNMF) algorithm (Method 1), optimized for large-scale multivariate structural covariance analysis. The sopNMF algorithm decomposes large-scale imaging data through online learning to overcome the memory limitations of opNMF. A subgroup of participants with multiple disease diagnoses and healthy controls (ages 5-97, training population, N=4000, Method 2) were sampled from the discovery set (N=32,440, Method 2); their MRI underwent a standard imaging processing pipeline (Method 3A). The processed images were then fit to sopNMF to derive the multi-scale PSCs (N=2003) from the loadings of the factorization (Method 1). We incorporate participants with various disease conditions because previous studies have demonstrated that inter-regional correlated patterns (i.e., depicting a network) show variations in healthy and diseased populations, albeit to a differing degree.10 Multi-scale PSCs were extracted across the entire population and statistically harmonized11 (Method 3B). Unit II (Fig. 1B) investigates the harmonized data for 2003 PSCs (13 PSCs have vanished in this process for C=1024; see Method 1) in two brain structural covariance analyses. Specifically, we performed i) GWAS (Method 4) that sought to discover associations of PSCs at single nucleotide polymorphism (SNP), gene, or gene set-level; and ii) pattern analysis via machine learning (Method 5) to derive individualized imaging signatures of several brain diseases and conditions. Unit III (Fig. 1C) presents BRIDGEPORT, making these massive analytic resources publicly available to the imaging, genomics, and machine learning communities.
Patterns of structural covariance via stochastic orthogonally projective non-negative matrix factorization
We first validated the sopNMF algorithm by showing that it converged to the global minimum of the factorization problem using the comparison population (N=800, Method 2). The sopNMF algorithm achieved similar reconstruction loss and sparsity as opNMF but at reduced memory demand (Supplementary eFigure 1). The lower memory requirements of sopNMF made it possible to generate the multi-scale PSCs by jointly factorizing 4000 MRIs in the training population. The results of the algorithm were robust and obtained a high reproducibility index (RI) (Supplementary eMethod 2) in reproducibility analyses: split-sample analysis (RI = 0.76±0.27) and split-sex analysis (RI = 0.79±0.27) (Supplementary eFigure 2). We then extracted the multi-scale PSCs in the discovery set (N=32,440) and the replication set (N=18,259, Method 2) for Unit II. These PSCs succinctly capture underlying neurobiological processes across the lifespan, including the effects of typical aging processes and various brain diseases. In addition, the multi-scale representation constructs a hierarchy of brain structure networks (e.g., PSCs in cerebellum regions), which models the human brain in a multi-scale topology.7, 12
Patterns of structural covariance are highly heritable
The multi-scale PSCs are highly heritable (0.05< h2 <0.78), showing high SNP-based heritability estimates (h2) (Method 4B) for the discovery set (Fig. 2). Specifically, the h2 estimate was 0.49±0.10, 0.39±0.14, 0.29±0.15, 0.25±0.15, 0.27±0.15, 0.31±0.15 for scales C=32, 64, 128, 256, 512 and 1024 of the PSCs, respectively. The Pearson correlation coefficient between the two independent estimates of h2 was r = 0.94 (p-value < 10-6, between the discovery and replication sets) in the UK Biobank (UKBB) data. The scatter plot of the two sets of h2 estimates is shown in Supplementary eFigure 3. The h2 estimates and p-values for all PSCs are detailed in Supplementary eFile 1 (discovery set) and eFile 2 (replication set). Our results confirm that brain structure is heritable to a large extent and identify the spatial distribution of the most highly heritable regions of the brain (e.g., subcortical gray matter structures and cerebellum regions).13
617 novel genomic loci of patterns of structural covariance
We discovered genomic locus-PSC pairwise associations (Method 4C, Supplementary eMethod 5) within the discovery set and then independently replicated these associations on the replication set. We found that 915 genomic loci had 3791 loci-PSC pairwise significant associations with 924 PSCs after Bonferroni correction (Method 4G) for the number of PSCs (p-value threshold per scale: 10.3 > -log10[p-value] > 8.8) (Supplementary eFile 3, and Fig. 3A). Our results showed that the formation of these PSCs is largely polygenic; the associated SNPs might play a pleiotropic role in shaping these networks.
Compared to previous literature, out of the 915 genomic loci, the multi-scale PSCs identified 617 novel genomic loci not previously associated with any traits or phenotypes in the GWAS Catalog14 (Supplementary eFile 4, Figure 3B). These novel associations might indicate subtle neurobiological processes that are captured thanks to the biologically relevant structural covariance expressed by sopNMF. The multi-scale PSCs identified many novel associations by constraining this comparison to previous neuroimaging GWAS12, 13 using T1w MRI-derived phenotypes (e.g., regions of interest from conventional brain atlases) (Fig 3B, Supplementary eTable 3, eFile 5, 6, and 7).
To replicate these genomic loci, our UKBB replication set analysis (Method 4H) demonstrated that 3638 (96%) exact genomic locus-PSC associations were replicated at nominal significance (-log10[p-value] > 1.31), 2705 (72%) of which were significant after correction for multiple comparisons (Method 4G, -log10[p-value] > 4.27). We present this validation in Supplementary eFile 8 from the replication set. The summary statistics, Manhattan, and QQ plots derived from the combined population (N=33,541) are presented in BRIDGEPORT.
Gene set enrichment analysis highlights pathways that shape patterns of structural covariance
For gene-level associations (Method 4D), we discovered that 164 genes had 2489 gene-PSC pairwise associations with 445 PSCs after Bonferroni correction for the number of genes and PSCs (p-value threshold: 8.6 > -log10[p-value] > 7.1) (Supplementary eFile 9).
Based on these gene-level p-values, we performed hypothesis-free gene set pathway analysis using MAGMA15(Method 4E): a more stringent correction for multiple comparisons was performed than the prioritized gene set enrichment analysis using GENE2FUN from FUMA (Method 4F, Fig. 4). We identified that six gene set pathways had 18 gene set-PSC pairwise associations with 17 PSCs after Bonferroni correction for the number of gene sets and PSCs (N=16,768 and C from 32 to 1024, p-value threshold: 8.54 > -log10[p-value] > 7.03) (Fig. 3C, Supplementary eFile 10). These gene sets imply critical biological and molecular pathways that might shape brain morphological changes and development. The reelin signaling pathway exerts vital functions in regulating neuronal migration, dendritic growth, branching, spine formation, synaptogenesis, and synaptic plasticity.16 The appendage morphogenesis and development pathways indicate how the anatomical structures of appendages are generated, organized and progressed over time, which are often related to the cell adhesion pathway. These pathways elucidate how cells or tissues can be organized to create a complex structure like the human brain.17 In addition, the integral component of the cytoplasmic side of the endoplasmic reticulum membrane is thought to form a continuous network of tubules and cisternae extending throughout neuronal dendrites and axons.18 The DSCAM (Down syndrome cell adhesion molecule) pathway likely functions as a cell surface receptor mediating axon pathfinding. Related proteins are involved in hemophilic intercellular interactions.19 Lastly, Nikolsky et al.20 defined genes from the breast cancer 20Q11 amplicon pathway were involved in the brain might indicate the brain metastasis of breast cancer, which is usually a late event with deleterious effects on the prognosis.21 In addition, previous findings22, 23 revealed an inverse relationship between Alzheimer’s disease and breast cancer, which might indicate a close genetic relationship between the disease and brain morphological changes mainly affecting the entorhinal cortex and hippocampus (PSC: C128_3 in Fig. 4).
Illustrations of genetic loci and pathways forming two patterns of structural covariance
To illustrate how underlying genetic underpinnings might form a specific PSC, we showcased two PSCs: C32_4 for the superior cerebellum and C128_3 for the hippocampus-entorhinal cortex. The two PSCs were highly heritable and polygenic in our GWAS using the entire UKBB data (Fig. 4, N=33,541). We used the FUMA24 online platform to perform SNP2GENE for annotating the mapped genes and GENE2FUNC for prioritized gene set enrichment analyses (Method 4F). The superior cerebellum PSC was associated with genomic loci that can be mapped to 85 genes, which were enriched in many biological pathways, including psychiatric disorders, biological processes, molecular functions, and cellular components (e.g., apoptotic process, axon development, cellular morphogenesis, neurogenesis, and neuro differentiation). For example, apoptosis – the regulated cell destruction – is a complicated process that is highly involved in the development and maturation of the human brain and neurodegenerative diseases.25 Neurogenesis – new neuron formation – is crucial when an embryo develops and continues in specific brain regions throughout the lifespan.26 All significant results of this prioritized gene set enrichment analysis are presented in Supplementary eFile 11.
For the hippocampus-entorhinal cortex PSC, we mapped 45 genes enriched in gene sets defined from GWAS Catalog, including Alzheimer’s disease and brain volume derived from hippocampal regions. The hippocampus and medial temporal lobe have been robust hallmarks of Alzheimer’s disease.27 In addition, these genes were enriched in the breast cancer 20Q11 amplicon pathway20 and the pathway of metastatic breast cancer tumors28, which might indicate a specific distribution of brain metastases: the vulnerability of medial temporal lobe regions to breast cancer, 21 or highlight an inverse association between Alzheimer’s disease and breast cancer.22 Lastly, the nuclear membrane encloses the cell’s nucleus – the chromosomes reside inside – which is critical in cell formation activities related to gene expression and regulation. To further support the overlapping genetic underpinnings between this PSC and Alzheimer’s diseases, we calculated the genetic correlation (rg = −0.28; p-value=0.01) using GWAS summary statistics from the hippocampus-entorhinal cortex PSC (i.e., 33,541 people of European ancestry) and a previous independent study of Alzheimer’s disease29 (i.e., 63,926 people of European ancestry) using LDSC.30 All significant results of this prioritized gene set enrichment analysis are presented in Supplementary eFile 12.
Multi-scale patterns of structural covariance derive disease-related imaging signatures
The diversity of the population we used to derive the PSCs allowed us to derive PSCs that reflect brain development, aging, and the effects of several brain diseases. We investigate the added value of the multi-scale PSCs as building blocks of imaging signatures for several brain diseases and risk conditions using machine learning methods (herein, we opted for linear support vector machines (SVM)) (Method 5).31 The aim is to harness machine learning to drive a clinically interpretable metric for quantifying an individual-level risk to each disease category. To this end, we define the signatures as SPARE-X (Spatial PAtterns for REcognition) indices, where X is the disease. For instance, SPARE-AD captures the degree of expression of an imaging signature of AD-related brain atrophy, which has been shown to offer diagnostic and prognostic value in prior studies.32
In our samples, the most discriminative indices were SPARE-AD and SPARE-MCI (Fig. 5, Supplementary eTable 4, eFigure 4). C=1024 achieved the best performance for the single-scale analysis (e.g., AD vs. controls; balanced accuracy: 0.90±0.02; Cohen’s d: 2.50). Multi-scale representations derived imaging signatures that showed the largest effect sizes to classify the patients from the controls (Fig. 5) (e.g., AD vs. controls; balanced accuracy: 0.92±0.02; Cohen’s d: 2.61). PSCs obtained better classification performance than both AAL (e.g., AD vs. controls; balanced accuracy: 0.82±0.02; Cohen’s d: 1.81) and voxel-wise regional volumetric maps (RAVENS)33 (e.g., AD vs. controls; balanced accuracy: 0.85±0.02; Cohen’s d: 2.04) (Supplementary eTable 4 and eFigure 6). Our classification results were higher than previous baseline studies,34, 35 which provided an open-source framework to objectively and reproducibly evaluate AD classification using machine learning. Using the same cross-validation procedure and evaluation metric, they reported the highest balanced accuracy of 0.87±0.02 to classify AD from healthy controls. Notably, our machine learning experiments followed good practices, employed rigorous cross-validation procedures, and avoided critical methodological flaws, such as data leakage or double-dipping (see critical reviews on this topic elsewhere34, 36).
BRIDGEPORT: bridging knowledge across patterns of structural covariance, genomics, and clinical phenotypes
We integrated our experimental results and the MuSIC atlas into the BRIDGEPORT online web portal. This online tool allows researchers to interactively browse the MuSIC atlas in 3D, query our experimental results via variants or PSCs, and download the GWAS summary statistics for further analyses. In addition, we allow users to search via conventional brain anatomical terms (e.g., the right thalamus proper) by automatically annotating traditional anatomic atlas ROIs, specifically from the MUSE atlas37 (Supplementary eTable 5), to MuSIC PSCs based on their degree of overlaps (Supplementary eFigure 5). Open-source software dedicated to image processing,37 genetic quality check protocols, MuSIC generation with sopNMF, and machine learning34 is also publicly available (see Code Availability for details).
Discussion
The current study investigates patterns of structural covariance in the human brain at multiple scales from a large population of 50,699 people and, importantly, a very diverse cohort allowing us to capture patterns of structural covariance emanating from normal and abnormal brain development and aging, as well as from several brain diseases. Through extensive examination of the genetic architecture of these multi-scale PSCs, we confirmed genetic hits from previous T1-weighted MRI GWAS and, more importantly, identified 617 novel genomic loci and molecular and biological pathways that collectively influence brain morphological changes and development over the lifespan. Using a hypothesis-free, data-driven approach, we elucidated that underlying genetics and biological pathways can form multi-scale structural covariance patterns, which can be further used as building blocks to predict various diseases. All experimental results and code are encapsulated and publicly available in BRIDGEPORT for dissemination: https://www.cbica.upenn.edu/bridgeport/, in order to enable various neuroscience studies to investigate these patterns of structural covariance in diverse contexts. Together, the current study highlighted the adoption of machine learning methods in brain imaging genomics and deepened our understanding of the genetic architecture of the human brain.
Our findings reveal new insights into genetic underpinnings that influence patterns of structural covariance in the human brain. Brain morphological development and changes are largely polygenic and heritable, and previous neuroimaging GWAS has not fully uncovered this genetic landscape. In contrast, genetic variants, as well as environmental, aging, and disease effects, exert pleiotropic effects in shaping morphological changes in different brain regions through specific biological pathways. The mechanisms underlying brain structural covariance are not yet fully understood. They may involve an interplay between common underlying genetic factors, shared susceptibility to aging, and various brain pathologies, which affect brain growth or degeneration in coordinated brain morphological changes.1 Our data-driven, multi-scale PSCs identify the hierarchical structure of the brain under the principle of structural covariance and are associated with genetic factors at different levels, including SNPs, genes, and gene set pathways. These 617 novel genomic loci, as well as those previously identified, collectively shape brain morphological changes through many key biological and molecular pathways. These pathways are widely involved in reelin signaling, apoptotic processes, axonal development, cellular morphogenesis, neurogenesis, and neuro differentiation,25, 26 which may collectively influence the formation of structural covariance patterns in the brain. Strikingly, pathways involved in breast cancer shared overlapping genetic underpinnings evidenced in our MAGMA-based and prioritized (GENE2FUNC) gene set enrichment analyses (Fig. 3C and Fig. 4), which included specific pathways involved in breast cancer and metastatic breast cancer tumors. One previous study showed that common genes might mediate breast cancer metastasis to the brain,21 and a later study further corroborated that the metastatic spread of breast cancer to other organs (including the brain) accelerated during sleep in both mouse and human models.38 We further showcased that this brain metastasis of breast cancer might be associated with specific neuropathologic processes, which were captured by PSCs data driven by Alzheimer’s disease-related neuropathology. For example, the hippocampus-entorhinal cortex PSC (C128_3, Fig. 4) connected the bilateral hippocampus and medial temporal lobe – the salient hallmark of Alzheimer’s disease. Our gene set enrichment analysis results further support this claim: the genes were enriched in the gene sets of Alzheimer’s disease and breast cancer (Fig. 4). Previous research22, 23 also found an inverse association between Alzheimer’s disease and breast cancer. In addition, PSCs from the cerebellum were the most genetically influenced brain regions, consistent with previous neuroimaging GWAS.4, 5 The cerebral cortex has been thought to largely contribute to the unique mental abilities of humans. However, the cerebellum may also be associated with a much more comprehensive range of complex cognitive functions and brain diseases than initially thought.39 Our results confirmed that many genetic substrates might support different molecular pathways, resulting in the cerebellar functional organization, high-order functions, and dysfunctions in various brain disorders.
The current work demonstrates that appropriate machine learning analytics can be used to shed new light on brain imaging genetics. Previous neuroimaging GWAS leveraged multimodal imaging-derived phenotypes from conventional brain atlases4, 5 (e.g., ROIs from the AAL atlas). In contrast, multi-scale PSCs are purely data-driven and likely to reflect the dynamics of underlying normal and pathological neurobiological processes giving rise to structural covariance. The diverse training sample from which the PSCs were derived, including healthy and diseased individuals of a wide age range, enriched the diversity of such neurobiological processes influencing the PSCs. In addition, modeling structural covariance at multiple scales (i.e., multi-scale PSCs) indicated that disease effects could be robustly and complementarily identified across scales (Fig. 6), concordant with the paradigm of multi-scale modeling of the brain.12 Imaging signatures of brain diseases, derived via supervised machine learning models, were consistently more distinctive when formed from multi-scale PSCs compared to single-scale PSCs.
MuSIC – with the strengths of being data-driven, multi-scale, and disease-effect informative – contributes to the century-old quest for a “universal” atlas in brain cartography40 and is highly complementary to previously proposed brain atlases. For instance, Chen and colleagues41 used a semi-automated fuzzy clustering technique with MRI data from 406 twins and parcellated the cortical surface area into a genetic covariance-informative brain atlas; MuSIC was data-driven by structural covariance. Glasser and colleagues42 adopted a semi-automated parcellation procedure to create a multimodal cortex atlas from 210 healthy individuals. Although this method was successful in integrating multimodal information from cortical folding, myelination, and functional connectivity, this semi-automatic approach requires significant resources, some with limited resolution. MuSIC allows flexible, multiple scales for delineating macroscopic brain topology; the inclusion of patient samples exposes the model to sources of variability that may not be visible in healthy controls. Another pioneering endeavor is the Allen Brain Atlas project,43 whose overarching goals of mapping the human brain to gene expression data via existing conventional atlases, identifying local gene expression patterns across the brain in a few individuals, and deepening our understanding of the human brain’s differential genetic architecture, are complementary to ours – characterizing the global genetic architecture of the human brain, emphasizing pathogenic variability and morphological heterogeneity.
Bridging knowledge across the brain imaging, genomics, and machine learning communities is another pivotal contribution of this work. BRIDGEPORT provides a platform to lower the entry barrier for whole-brain genetic-structural analyses, foster interdisciplinary communication, and advocate for research reproducibility.34, 44–47 The current study demonstrates the broad applicability of this large-scale, multi-omics platform across a spectrum of neurodegenerative and neuropsychiatric diseases.
Methods
Method 1: Structural covariance patterns via stochastic orthogonally projective non-negative matrix factorization
The sopNMF algorithm is a stochastic approximation built and extended based on opNMF9, 48. We consider a dataset of n MR images and d voxels per image. We represent the data as a matrix X where each column corresponds to a flattened image: . The sopNMF algorithm factorizes X into two low-rank (r) matrices and under the constraints of non-negativity and column-orthonormality. Using the Frobenius norm, the loss of this factorization problem can be formulated as where I stands for the identity matrix. The columns wi ∈ ℝd, ‖wi‖2 = 1, ∀ i ∈ {1. r} of the so-called component matrix W = [w1, w2, … , wr] are part-based representations promoting sparsity in data in this lower-dimensional subspace. From this perspective, the loading coefficient matrix H represents the importance (weights) of each of the features above for a given image. Instead of optimizing the non-convex problem in a batch learning paradigm (i.e., reading all images into memory) as opNMF,9 sopNMF subsamples the number of images at each iteration, thereby significantly reducing its memory demand, by randomly drawing data batches of b ≤ n images (b is the batch size); this is done without replacement so that all data goes through the model once (⌈n/b⌉). In this case, the updating rule can be rewritten as We calculate the loss on the entire dataset at the end of each epoch (i.e., the loss is incremental across all batches) with the following expression We evaluated the training loss and the sparsity of W at the end of each iteration. Moreover, early stopping was implemented to improve training efficiency and alleviate overfitting. We summarize the sopNMF algorithm in Supplementary Algorithm 1. An empirical comparison between sopNMF and opNMF is detailed in Supplementary eMethod 1.
We applied sopNMF to the training population (N=4000). After the algorithm converged, the component matrix W was sparse and the loading coefficient matrix was used as the multi-scale PSCs. To build the MuSIC atlas, we clustered each voxel (row-wise) into one of the r features/PSCs as follows: where M is a d-dimensional vector and j ∈ {1. d}. The j-th element of M equals k if Wj,k is the maximum value of the j-th row. We finally projected the vector into the original image space to visualize each PSC of the MuSIC atlas (Fig. 1). Of note, 13 PSCs have vanished in this process for C=1024: all 0 for these 13 vectors.
Method 2: Study population
We consolidated a large-scale multimodal consortium (N=50,699) consisting of imaging, cognition, and genetic data from 12 studies, 130 sites, and 12 countries (Supplementary eTable 1): the Alzheimer’s Disease Neuroimaging Initiative49 (ADNI) (N=1765); the UK Biobank50 (UKBB) (N=39,564); the Australian Imaging, Biomarker, and Lifestyle study of aging51 (AIBL) (N=830); the Biomarkers of Cognitive Decline Among Normal Individuals in the Johns Hopkins University52 (BIOCARD) (N=288); the Baltimore Longitudinal Study of Aging53, 54 (BLSA) (N=1114); the Coronary Artery Risk Development in Young Adults55 (CARDIA) (N=892); the Open Access Series of Imaging Studies56 (OASIS) (N=983), PENN (N=807); the Women’s Health Initiative Memory Study57 (WHIMS) (N=995), the Wisconsin Registry for Alzheimer’s Prevention58 (WRAP) (N=116); the Psychosis Heterogeneity (evaluated) via dimEnsional NeurOiMaging59 (PHENOM) (N=2125); and the Autism Brain Imaging Data Exchange60 (ABIDE) (N=1220). All studies were approved by the corresponding Institutional Review Boards. Each participant consented to be part of the imaging, cognition, and/or genetic biobanks. Data from the UKBB for this project pertains to application 35148.
We present the demographic information of the population under study in Supplementary eTable 1. This large-scale consortium reflects the diversity of MRI scans over different races, disease conditions, and ages over the lifespan. To be concise, we defined four populations or data sets per analysis across the paper:
Discovery set: It consists of a multi-disease and lifespan population that includes participants from all 12 studies (N=32,440). Note that this population does not contain the entire UKBB population but only our first download (July 2017, N=21,305).
Replication set: We held out 18,259 participants from the UKBB dataset to replicate the GWAS results. We took these data from our second download of the UKBB dataset (November 2021, N=18,259).
Training population: We randomly drew 250 patients (PT), including AD, MCI, SCZ, ASD, MDD, HTN (hypertension), DM (diabetes mellitus), and 250 healthy controls (CN) per decade from the discovery set, ensuring that the PT and CN groups have similar sex, study and age distributions. The resulting set of 4000 imaging data was used to generate the MuSIC atlas with the sopNMF algorithm. The rationale is to maximize variability across a balanced sample of multiple diseases or risk conditions, age, and study protocols rather than overfit the entire data by including all images in training.
Comparison population: To validate sopNMF by comparison to the original opNMF algorithm, we randomly subsampled 800 participants from the training population (100 per decade for balanced CN and PT). For this scale of sample size, opNMF can load all images into memory for batch learning.61
Method 3: Image processing and statistical harmonization
(A) : Image processing. Images that passed the quality check (Supplementary eMethod 4) were first corrected for magnetic field intensity inhomogeneity.62 Voxel-wise regional volumetric maps (RAVENS)33 for each tissue volume were then generated by using a registration method to spatially align the skull-stripped images to a template in MNI-space.63 We applied sopNMF to the RAVENS maps to derive MuSIC.
(B) : Statistical harmonization of MuSIC PSCs: We applied MuSIC to the entire population (N=50,699) to extract the multi-scale PSCs. Specifically, MuSIC was applied to each individual’s RAVENS gray matter map to extract the sum of brain volume in each PSC. Subsequently, the PSCs were statistically harmonized by an extensively validated approach, i.e., ComBat-GAM 11 (Supplementary eMethod 3). After harmonization, the PSCs were considerably normally distributed (skewness = 0.11±0.17, and kurtosis = 0.67±0.68) (Supplementary eFigure 6A and B). To alleviate the potential violation of normal distribution in downstream statistical learning, we quantile-transformed all PSCs. In agreement with the literature,64, 65 males were found to have larger brain volumes than females on average (Supplementary eFigure 6C). The AAL ROIs underwent the same statistical harmonization procedure.
Method 4: Genetic analyses
Genetic analyses were restricted to the discovery and replication set from UKBB (Method 2). We processed the array genotyping and imputed genetic data (SNPs). The two data sets went through a “best-practice” imaging-genetics quality check (QC) protocol (Method 4A) and were restricted to participants of European ancestry. This resulted in 18,052 participants and 8,430,655 SNPs for the discovery set, and 15,243 participants and 8,470,709 SNPs for the replication set. We reperformed the genetic QC and genetic analyses for the combined populations for BRIDGEPORT, resulting in 33,541 participants and 8,469,833 SNPs. Method 4G details the correction for multiple comparisons throughout our analyses.
(A) : Genetic data quality check protocol. First, we excluded related individuals (up to 2nd-degree) from the complete UKBB sample (N=488,377) using the KING software for family relationship inference.66 We then removed duplicated variants from all 22 autosomal chromosomes. We also excluded individuals for whom either imaging or genetic data were not available. Individuals whose genetically-identified sex did not match their self-acknowledged sex were removed. Other excluding criteria were: i) individuals with more than 3% of missing genotypes; ii) variants with minor allele frequency (MAF) of less than 1%; iii) variants with larger than 3% missing genotyping rate; iv) variants that failed the Hardy-Weinberg test at 1×10-10. To adjust for population stratification,67 we derived the first 40 genetic principle components (PC) using the FlashPCA software68. The genetic pipeline was described elsewhere,69 and documented online: https://www.cbica.upenn.edu/bridgeport/data/pdf/BIGS_genetic_protocol.pdf.
(B) : Heritability estimates and genome-wide association analysis. We estimated the SNP-based heritability explained by all autosomal genetic variants using GCTA-GREML.70 We adjusted for confounders of age (at imaging), age-squared, sex, age-sex interaction, age-squared-sex interaction, ICV, and the first 40 genetic principal components (PC). One-side likelihood ratio tests were performed to derive the heritability estimates. In GWAS, we performed a linear regression for each PSC and included the same covariates as in the heritability estimates using PLINK.71
(C) : Identification of novel genomic loci. Using PLINK, we clumped the GWAS summary statistics based on their linkage disequilibrium to identify the genomic loci (see Supplementary eMethod 5 for the definition of the index, candidate, independent significant, lead SNP, and genomic locus). In particular, the threshold for significance was set to 5×10-8 (clump-p1) for the index SNPs and 0.05 (clump-p2) for the candidate SNPs. The threshold for linkage disequilibrium-based clumping was set to 0.60 (clump-r2) for independent significant SNPs and 0.10 for lead SNPs. The linkage disequilibrium physical-distance threshold was 250 kilobases (clump-kb). Genomic loci take into account linkage disequilibrium (within 250 kilobases) when interpreting the association results. The GWASRAPIDD72 software was then used to query the genomic loci for any previously-reported associations with clinical phenotypes documented in the NHGRI-EBI GWAS Catalog14 (p-value < 1.0×10-5, default inclusion value of GWAS Catalog). We defined a genomic locus as novel when it was not present in GWAS Catalog.
(D) : Gene-level associations with MAGMA. We performed gene-level association analysis using MAGMA.15 First, gene annotation was performed to map the SNPs (reference variant location from Phase 3 of 1,000 Genomes for European ancestry) to genes (human genome Build 37) according to their physical positions. The second step was to perform the gene analysis based on the GWAS summary statistics to obtain gene-level p-values between the pairwise of 2003 PSCs and the 18,097 protein-encoding genes containing valid SNPs.
(E) : Hypothesis-free gene set enrichment analysis with MAGMA. Using the gene-level association p-values, we performed gene set enrichment analysis using MAGMA. Gene sets were obtained from Molecular Signatures Database (MsigDB, v7.5.1),73 including 6366 curated gene sets and 10,402 Gene Ontology (GO) terms. All other parameters were set by default for MAGMA. This hypothesis-free analysis resulted in a more stringent correction for multiple comparisons (i.e., by the total number of tested genes and the number of PSCs) than the FUMA prioritized gene set enrichment analysis (see below F).
(F) : FUMA analyses for the illustrations of specific PSCs. In SNP2GENE, three different methods were used to map the SNPs to genes. First, positional mapping maps SNPs to genes if the SNPs are physically located inside a gene (a 10 kb window by default). Second, expression quantitative trait loci (eQTL) mapping maps SNPs to genes showing a significant eQTL association. Lastly, chromatin interaction mapping maps SNPs to genes when there is a significant chromatin interaction between the disease-associated regions and nearby or distant genes.24 In addition, GENE2FUNC studies the expression of prioritized genes and tests for enrichment of the set of genes in pre-defined pathways. We used the mapped genes as prioritized genes. The background genes were specified as all genes in FUMA, and all other parameters were set by defaults. We only reported gene sets with adjusted p-value < 0.05.
(G) : Correction for multiple comparisons. We practiced a conservative procedure to control for the multiple comparisons. In the case of GWAS, we chose the default genome-wide significant threshold (5.0×10-8, and 0.05 for all other analyses) and independently adjusted for multiple comparisons (Bonferroni methods) at each scale by the number of PSCs. We corrected the p-values for the number of phenotypes (N=6) for genetic correlation analyses. For heritability estimates, we adjusted the p-values for the number of PSCs at each scale. For gene analyses, we controlled for both the number of PSCs at each scale and the number of genes. We adopted these strategies per analysis to correct the multiple comparisons because PSCs of different scales are likely hierarchical and correlated – avoiding the potential of “overcorrection”.
(H) : Replication analysis for genome-wide association studies. We performed GWAS by fitting the same linear regressing models as the discovery set. Also, following the same procedure for consistency, we corrected for the multiple comparisons using the Bonferroni method. In particular, we corrected it for the number of genomic loci (N=915) found in the discovery set with a nominal p-value of 0.05, which thereby resulted in a stringent test with an equivalent p-value threshold of 3.1×10-5 (i.e., (-log10[p-value] = 4.27). We performed a replication for the 915 genomic loci, but, in reality, SNPs in linkage disequilibrium with the genomic loci are likely highly significant.
Method 5: Pattern analysis via machine learning for individualized imaging signatures
SPARE-AD captures the degree of expression of an imaging signature of AD, and prior studies have shown its diagnostic and prognostic values.32 We generalized the SPARE imaging signature to multiple diseases (SPARE-X, X represents disease diagnoses). Following our reproducible open-source framework35, we performed nested cross-validation (Supplementary eMethod 6) for the machine learning models and derived imaging signatures to quantify individualized disease vulnerability.
SPARE indices
MuSIC PSCs were fit into a linear support vector machine (SVM) to derive SPARE-AD, MCI, SCZ, DM, HTN, MDD, and ASD. Specifically, the SVM aims to classify the patient group (e.g., AD) from the control group and outputs a decision function that indicates how close each participant is to the hyperplane. The samples selected for each task are presented in Supplementary eTable 2.
No statistical methods were used to predetermine sample size. The experiments were not randomized, and the investigators were not blinded to allocation during experiments and outcome assessment.
Data Availability
The GWAS summary statistics corresponding to this study are publicly available on the BRIDGEPORT web portal (https://www.cbica.upenn.edu/bridgeport/). The GWAS summary statistics used in the genetic correlation analyses were fetched from the GWAS Catalog platform (https://www.ebi.ac.uk/gwas), although each study provided the original links; The GWAS Catalog platform was used to query if the SNPs identified by MuSIC were previously reported.
Data Availability
The GWAS summary statistics corresponding to this study are publicly available on the BRIDGEPORT web portal (https://www.cbica.upenn.edu/bridgeport/). The GWAS summary statistics used in the genetic correlation analyses were fetched from the GWAS Catalog platform (https://www.ebi.ac.uk/gwas), although each study provided the original links; The GWAS Catalog platform was used to query if the SNPs identified by MuSIC were previously reported.
Code Availability
The software and resources used in this study are all publicly available:
sopNMF: https://pypi.org/project/sopnmf/, MuSIC, and sopNMF (developed for this study)
BRIDGEPORT: https://www.cbica.upenn.edu/bridgeport/, (developed for this study)
BIGS: https://www.cbica.upenn.edu/bridgeport/data/pdf/BIGS_genetic_protocol.pdf, genetic processing protocol (developed for this study)
MLNI: https://pypi.org/project/mlni/, machine learning (developed for this study)
MUSE: https://www.med.upenn.edu/sbia/muse.html, image preprocessing
Clinica: http://www.clinica.run/, machine learning and image preprocessing
PLINK: https://www.cog-genomics.org/plink/, GWAS
GCTA: https://yanglab.westlake.edu.cn/software/gcta/#Overview, heritability estimates
LDSC: https://github.com/bulik/ldsc, genetic correlation estimates
MAGMA: https://ctg.cncr.nl/software/magma, gene analysis
GWASRAPIDD: https://rmagno.eu/gwasrapidd/articles/gwasrapidd.html, GWAS Catalog query
MsigDB: https://www.gsea-msigdb.org/gsea/msigdb/, gene sets database
Competing Interests
DAW served as Site PI for studies by Biogen, Merck, and Eli Lilly/Avid. He has received consulting fees from GE Healthcare and Neuronix. He is on the DSMB for a trial sponsored by Functional Neuromodulation. AJS receives support from multiple NIH grants (P30 AG010133, P30 AG072976, R01 AG019771, R01 AG057739, U01 AG024904, R01 LM013463, R01 AG068193, T32 AG071444, and U01 AG068057 and U01 AG072177). He has also received support from Avid Radiopharmaceuticals, a subsidiary of Eli Lilly (in-kind contribution of PET tracer precursor); Bayer Oncology (Scientific Advisory Board); Eisai (Scientific Advisory Board); Siemens Medical Solutions USA, Inc. (Dementia Advisory Board); Springer-Nature Publishing (Editorial Office Support as Editor-in-Chief, Brain Imaging, and Behavior). OC reports having received consulting fees from AskBio (2020) and Therapanacea (2022), having received payments for writing a lay audience short paper from Expression Santé (2019), and that his laboratory has received grants (paid to the institution) from Qynapse (2017-present). Members of his laboratory have co-supervised a Ph.D. thesis with myBrainTechnologies (2016-2019) and with Qynapse (2017-present). OC’s spouse is an employee and holds stock options of myBrainTechnologies (2015-present). OC has a patent registered at the International Bureau of the World Intellectual Property Organization (PCT/IB2016/0526993, Schiratti J-B, Allassonniere S, Colliot O, Durrleman S, A method for determining the temporal progression of a biological phenomenon and associated methods and devices) (2017). ME receives support from multiple NIH grants, the Alzheimer’s Association, and the Alzheimer’s Therapeutic Research Institute. MZ serves as a consultant and/or speaker for the following pharmaceutical companies: Eurofarma, Lundbeck, Abbott, Greencare, Myralis, and Elleven Healthcare.
Authors’ contributions
Dr. Wen takes full responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Wen, Nasrallah, Davatzikos
Acquisition, analysis, or interpretation of data: Wen, Nasrallah, Davatzikos, Abdulkadi, Satterthwaite, Dazzan, Kahn, Schnack, Zanetti, Meisenzahl, Busatto, Crespo-Facorro, Pantelis, Wood, Zhuo, Koutsouleris, Wittfeld, Grabe, Marcus, LaMontagne, Heckbert, Austin, Launer, Espeland, Masters, Maruff, Fripp, Johnson, Morris, Albert, Resnick, Saykin, Thompson, Li, Wolf, Raquel Gur, Ruben Gur, Shinohara, Tosun-Turgut, Fan, Shou, Erus, Wolk
Drafting of the manuscript: Wen, Nasrallah, Davatzikos
Critical revision of the manuscript for important intellectual content: all authors
Statistical and genetic analysis: Wen
Study supervision: Davatzikos
Acknowledgments
The iSTAGING consortium is a multi-institutional effort funded by NIA by RF1 AG054409. The Baltimore Longitudinal Study of Aging neuroimaging study is funded by the Intramural Research Program, National Institute on Aging, National Institutes of Health and by HHSN271201600059C. The BIOCARD study is in part supported by NIH grant U19-AG033655. The PHENOM study is funded by NIA grant R01MH112070 and by the PRONIA project as funded by the European Union 7th Framework Program grant 602152. Other supporting funds are 5U01AG068057, 1U24AG074855, and S10OD023495. Data were provided [in part] by OASIS OASIS-3: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352. AV-45 doses were provided by Avid Radiopharmaceuticals, a wholly owned subsidiary of Eli Lilly. OC has received funding from the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and reference ANR-10-IAIHU-06 (Agence Nationale de la Recherche-10-IA Institut Hospitalo-Universitaire-6). This research has been conducted using the UK Biobank Resource under Application Number 35148. Data used in preparation of this article were in part obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Dr. Wen and Dr. Davatzikos had full access to all the data in the study. They took responsibility for the integrity of the data and the accuracy of the data analysis.