BrainXcan identifies brain features associated with behavioral and psychiatric traits using large scale genetic and imaging data

Advances in brain MRI have enabled many discoveries in neuroscience. Comparison of brain MRI features between cases and controls have highlighted potential causes of psychiatric and behavioral traits. However, due to the cost of collecting MRI data and the difficulty in recruiting particular patient groups, most studies have small sample sizes, limiting their reliability. Furthermore, interpretation is complicated by reverse causality, where many observed brain differences are the result of disease rather than the cause. Here we propose a method (BrainXcan) that leverages the power of large-scale genome-wide association studies (GWAS), reference brain MRI data, and methodological advances in causal inference using genetic instruments to discover new mechanisms of disease etiology and validate existing ones. BrainXcan tests complex traits for association with genetic predictors of brain MRI derived phenotypes to pinpoint relevant region-specific and cross-brain features. It also evaluates consistency and direction of the causal flow with Mendelian Randomization. As this approach requires only genetic data, BrainXcan allows us to test a host of hypotheses on mental illness, across many disorders and MRI modalities, using existing public data resources. Our method shows that reduced axonal density across the brain is associated with the risk of schizophrenia, consistent with the disconnectivity hypothesis. We also find structural features hippocampus, amygdala, and anterior cingulate cortex among others associated with schizophrenia risk highlighting the potential of our approach to bring orthogonal lines of evidence to inform the biology of complex traits.

subytpe's PC (Y ∼ PC) (see details in Supplementary Notes 1). 153 The overall noise level within IDP k led to what is known as attenuation bias whereas adjusting for PCs (weighted sum of IDPs) led to what is known as collider bias (Dahl et al., 2019).
where m is the number of IDPs in the subtype (ranging from 14 to 96) and n is the sample size of the GWAS 154 study or equivalently of the BrainXcan association study (typically ∼ 100K − 1M ). The attenuation bias 155 reduced the estimate of β k by a factor of (1− t 2 s 2 +t 2 ). The third term represents the collider bias, proportional 156 to the average effect size across brain regions, which can be expected to be relatively small if the effect is 157 specific to a region.

158
Regression on the subtype's PC instead of the latent variable L yields a biased coefficient: This coefficient is the sum of the latent brain-wide effect (α) and the effects of individual regions. The 159 interpretation of this coefficient will depend on the significance of the individual region effects estimated in 160 Eq.(1). If all region-specific effects are small and not significant, then we can assume that the brain-wide for canonical examples of polygenic traits such as height or BMI. The estimates are shown in Fig. 3b with 177 common human traits added for reference. 178 Ridge regression predicts brain features better than elastic net 179 We trained genetic predictors of the original brain IDPs, the derived principal components, and the residual 180 IDPs using penalized regression approaches, ridge regression and elastic net. We restricted our search to 181 linear models so that BrainXcan could be applied using solely GWAS summary statistics even when the 182 individual level genotype and phenotype data were not available. Given the high polygenicity of IDPs, we 183 anticipated that many of the predictors would be based on ridge regression with all the SNPs having a nonzero 184 weight. Therefore, to keep the computation manageable and to make the prediction models applicable to a 185 broader set of GWAS studies where access to individual level data may not be possible, we restricted the 186 training to HapMap3 SNPs, which tend to be imputed with high quality, with MAF > 0.01 in European 187 samples. To avoid issues with strand mis-specification, we excluded the ambiguous SNPs (e.g. AT, CG).

188
These restrictions left us with a total of 1.07 million SNPs (Methods) for the subsequent procedure. 189 We trained two sets of models, one with a ridge and the other one with a elastic net penalty. Recall that 190 ridge regression uses l 2 penalty and yields highly polygenic predictors (all non zero weights) whereas elastic 191 net yields sparse models, setting the weights of most variants to 0. Given the high polygenicity of IDPs, we 192 expected ridge regression to perform better than the sparsity-inducing elastic net penalty (Methods).
193 9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 3, 2021.

10
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 3, 2021.  Fig. 3c.) 198 We also found that the improved performance of ridge regression over elastic net predictions was correlated IDPs using genotype alone and correlated these predicted values with complex traits.

205
As expected, the prediction performance increased with the heritability of the IDP, i.e. more heritable traits 206 were predicted better as shown in Fig. 3d. However, we also noted that the median prediction R 2 was lower 207 than 8% of the heritability (Fig. 3c), an upper bound of how the performance of linear predictors. This low 208 proportion of heritability captured by the genetic predictors highlights the need to increase the sample size 209 of reference image data to reach the upper bound of the performance. 210 We filtered out unreliable predictors by keeping only the ones that showed prediction performance correlation  14 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. There are two caveats that need to be considered when interpreting Mendelian randomization results. One 291 is that we first select the IDP-trait pair based on their association. Therefore, the p-values of the Mendelian 292 randomization will not be well-calibrated. Therefore, we propose using the p-values to discern between 293 possible direction of the causal flow, not to quantify significance.

294
The second caveat relate to the power difference between reference image and GWAS studies. Currently, 295 reference image data have much smaller sample sizes (n∼ 30K) compared to GWAS studies of complex traits 296 (n∼ 100K − 1M ). We consider three main mediating scenarios depicted in Fig. 2b). In scenario i) the brain 297 feature mediates the genetic association with the phenotype, i.e. genetic risk factors alter the brain feature 298 which in turn alters the risk for the phenotype. In scenario ii) genetic factors affect the phenotype which 299 in turn alter the brain feature. In scenario iii) genetic factors affect an underlying latent factor which alter 300 both the phenotype and the brain feature. 301 A significant i) and not significant ii) can be interpreted as evidence that the brain feature alteration is 302 affecting the phenotype given the higher power of GWAS studies in general. However, a significant scenario 303 ii) and non significant scenario i) could simply mean that the instruments (strongly associated SNPs and 304 their effect sizes) for the brain feature are not reliable enough to yield significance. In this case, scenario ii) 305 should be considered supported by the data but scenarios i) and iii) should not be ruled out.  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

19
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

20
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

21
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 3, 2021. ; https://doi.org/10.1101/2021.06.01.21258159 doi: medRxiv preprint the anterior cingular cortex. Our results add robust orthogonal lines of evidence to existing literature since 380 our associations are based on the genetic components of the traits and are less likely to be confounded than 381 studies with observed traits. 382 We provide a user-friendly software package to attract investigators less familiar with the genetic field. Our 383 software is implemented in R and python with a streamlined pipeline coded in snakemake. An automated 384 report in html format with pre-defined figures summarizing the results facilitates interpretation. 385 We note that there are several limitations in the current work. First, the prediction performance of the 386 current genetic predictors are largely limited by the size of the training cohort (Fig. 3d). As the UK Biobank 387 is gradually collecting more brain imaging data (Littlejohns et al., 2020), we expect the training cohort size 388 will increase to at least 100,000 individuals. We will be updating the the brain IDP predictors as the data 389 becomes more available. Second, the S-BrainXcan calculation relies on the genotype covariance which is 390 approximated as a banded matrix (Methods) here. This approximation may, particularly, affect the stability 391 of the joint analysis since the jointly analysis relies heavily on the predicted IDP covariance which is derived 392 from the genotype covariance. This was one of the reasons we decided not to pursue the joint model since Mendelian randomization, the Mendelian randomization p-value is not well-calibrated; iv) causality is valid 400 only when the Mendelian randomization assumptions are hold. Given these limitations, we consider the 401 Mendelian randomization results to be quantification of the evidence for the direction of the putative causal 402 flow. Fourth, only linear prediction models are used in our implementation. More sophisticated models 403 could be used for prediction but that will limit the application to cases where the full individual level data 404 is available. Given the current availability of summary results and the need to use very large sample sizes 405 to reliably estimate genetic effects makes the use of non-linear models less than optimal. Despite these 406 limitations, we anticipate that BrainXcan, a user-friendly analysis tool, will be broadly adopted and help in 407 identifying brain features important in the pathogenesis as well as diagnosis of complex phenotypes.

22
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 3, 2021. ; https://doi.org/10.1101/2021.06.01.21258159 doi: medRxiv preprint We queried from UK Biobank database using ukbREST (Pividori and Im, 2019) to retrieve the list of 459 The elastic net models were fitted using the R package snpnet which implements the BSAIL algorithm proposed in Qian et al. (2020). Specifically, we fit the following optimization problem: When the individual-level information is not available, we calculated the BrainXcan statistic approximately using the GWAS summary statistic and the genotype covariance from a reference panel. The formulas are similar to the ones proposed in Barbeira et al. (2018). Let b j and se( b j ) be the estimated effect size and the corresponding standard error for variant j from the GWAS. And z GWAS,j is the GWAS z-score of SNP j. Let R represent the genotype sample covariance matrix where R jj = Cov(X j , X j ), namely the sample covariance between variant j and j . We can calculate the marginal test statistics using the following results (see derivations in Supplementary Notes 3):β