Abstract
Augmenting traditional genome wide association studies (GWAS) with advanced machine learning algorithms can allow the detection of novel signals in available cohorts. We introduce “Genome wide association neural networks (GWANN)”, a novel approach that uses neural networks (NNs) to account for nonlinear and SNP-SNP interaction effects. We applied GWANN to family history of Alzheimer’s disease (AD) in the UK Biobank. Our method identified 25 known AD genes, 2 target nominations and 68 potentially novel genes, and validated the results against brain eQTLs, AD phenotype associations, biological pathways, disease associations and differentially expressed gene sets in the AD brain. Some drugs targeting novel GWANN hits are currently in clinical trials for AD. Applying NNs for GWAS illustrates their potential to complement existing algorithms and methods, and enable the discovery of novel and tractable targets for AD.
1 Introduction
Alzheimer’s disease (AD) affects approximately 30 million people in the world, making it the most common form of dementia1. It is characterised by the build-up of Aβ and tau proteins in the brain, leading to neuronal death and impaired cognitive function2. In the last 10 years genome wide association studies (GWAS) have revolutionised our understanding of the inherited basis of disease and they have been critical in identifying multiple risk loci and disease pathways associated with AD3. However, some limitations of the current approaches at GWAS may be hindering their ability to uncover the complex genetic landscape of AD without extending the sample size. Despite the number of SNPs identified until today, they still only explain a small fraction of the heritability of the disease4; the hits often appear to be unrelated to each other5, 6; they have limited biological relevance to the disease7; and it is often challenging to ascertain the genes and the biological mechanisms underlying each SNP association4–7 Along with the modern availability of large datasets8–10, an enhancement that can complement current GWAS methods is the introduction of machine learning analysis methods that can unveil more complex patterns in genomic data that would otherwise have been missed by the traditional linear models.
Machine learning methods, more particularly Neural Networks (NNs), have been instrumental in the advancement of multiple engineering industries due to their efficacy in analysing complex data patterns11, especially where large amounts of data is available. Specifically, in the task of finding disease associated loci, NNs have recently been employed and tested on a list of complex traits and diseases such as eye colour and schizophrenia12. Given the recent success of NNs in a myriad of fields including genomics, our aim was to develop NNs specialised to perform a gene-level GWAS using SNP data available in the UK Biobank (UKBB)13. Our method is gene-based and considers groups of SNPs within and around a gene to establish the association of the gene with the phenotype of interest. In this work we demonstrate the application of our new GWANN method to identify associations with familial history of AD, a proxy for AD8, present the genetic associations found by the method, and systematically validate the results against brain eQTLs, AD phenotype associations, biological pathways, disease associations and differentially expressed gene sets in the AD brain.
2 Results
2.1 Identification of novel genes related to AD using GWANN
GWANN identified 69 genes significantly associated with maternal history, 29 genes associated with paternal history, and 95 genes in the parental (i.e. maternal and paternal) meta-analysis (Figure 3, Supplementary Table 1). All these were significant after multiple- comparison correction with a Bonferroni threshold of P value<6.82x10-7 (Figure 2, Supplementary Table 2). Table 1 lists the genes with their P values. Supplementary Table 2 additionally lists the top three input features that were most important to the NNs of GWANN in the maternal and paternal history datasets. In terms of the NN architecture, each of these features can be a SNP or a covariate encoding. Supplementary Table 3 contains the genes in linkage disequilibrium (LD) — thresholds of r2 ≥ 0.1 and r2 ≥ 0.25 — based on the top three maternal and paternal SNPs in the windows for the genes. A more conservative set of hits considering only those genes that passed P value thresholds of 0.05, 10-2, 10-3, 10-4 and 10-5 can be seen in Supplementary Figure 1. A standard SNP-level GWAS performed using PLINK 2.0 with a logistic regression model, on the same set of SNPs and individuals as used in GWANN, identified 13 genes for maternal history, four genes for paternal history and 14 genes in the meta-analysis. Five of these genes (APOE locus – APOE, APOC1, TOMM40, BCAM and EXOC3L2) overlapped with the findings of GWANN. When compared with other GWAS on familial history of AD and AD diagnosis listed in the GWAS catalogue3, there was an overlap of 25 genes with the GWANN hits. Twelve of these have been previously associated with AD (ANK3, APH1B, GLIS3, EPHA1, WWOX in addition to the APOE locus). Seven others are listed in the GWAS catalogue as suggestive associations (P value<1x10-5) with AD (EXOC4, NRXN1, NRXN3, PAX5, CADM2, LUZP2 and RBMS3). In a recent analysis on the pleiotropic predisposition to AD and educational attainment14, DCC and CADM2 were identified as associations, and EFNA5 and LRP1B as suggestive associations.
NN architecture used in the GWANN method. The top-left branch generates a 1D encoding from the SNPs input (green), while the bottom-left branch does so for the covariate input (red). The right trunk merges the encodings of both branches to output whether the input belongs to cases (blue output) or controls (red output).
Manhattan plot of the parental meta-analysis. The threshold for genome wide significance (P value<6.82x10-7) is indicated by the red dotted line. GWANN identified 21 known hits, 2 nominated AD targets and 72 novel gene hits. The black dotted line indicates the suggestive association threshold (P value<1x10-5).
(a) Venn diagram of 69 maternal, 29 paternal and 95 meta-analysis genes with P value<6.82x10-7 in each analysis. (b) Analysis of the intersection of hit genes between the methods on the x axis and y axis. The number of genes in the sets and intersections are specified within the brackets and the intensity of each block (darker is more significant) represents the significance of the size of the intersection set, given the 2 individual sets. Blocks in grey suggest an insignificant number of genes in the intersection. The EADB set is without the APOE locus, and hence the intersection with the traditional UKBB GWAS (using the same data set as used for GWANN) results in a null set.
GWANN hit genes. Pm+p represents the METAL meta-analysis P value, Pm the GWANN maternal analysis P value and Pp the GWANN paternal analysis P value.
DAB1 has previously been associated with psychosis in AD15, and MACROD2 with neurofibrillary tangles16.
EFNA5, NRXN3, BDNF and ROBO1, have been nominated as potential targets for AD. Among the other novel genes identified by our method, PITPNM2, RNF182 and SAMD4A have been reported to be significantly associated with Braak; RNF182 and SAMD4A are associated with CERAD; and CPNE5, SAMD4A, ATP1B3, MAGI2 and ARSG associated with COGDX, three measures of AD pathology or cognitive phenotypes (Table 2). A total of 69 genes were significantly enriched with eQTLs in brain regions and 51 genes show differential RNA expression in post-mortem AD brains (Supplementary Table 4). There is evidence in the GWAS catalogue for 42 genes which have previously been associated with diseases in the category of “nervous system disease” (EFO:0000618).
Genes that had significant association with at least one of BRAAK, CERAD or COGDX in data obtained from the Agora AD knowledge portal (https://agora.adknowledgeportal.org/).
2.2 Enriched protein-protein interaction (PPI) network of GWANN hits
The PPI network of the 95 associated genes identified post meta-analysis was significantly enriched (P value=3.33x10-16, Figure 4a). Given a network of 95 possible nodes, the expected number of connections by chance would be 23, but the network of the GWANN hits show 72 connections, thereby being significantly enriched. On retaining only the novel GWANN genes, the PPI network remains enriched (.0285). The reason for the drop in significance is possibly due to the absence of indirect interactions through the old hits, indicated by the green dotted lines in Figure 4b.
PPI network of GWANN hits from STRING. Red - old hits/suggestive hits, orange - nominated hits, blue - novel hits. The grey solid lines indicate a direct connection between two genes and the green dotted lines indicate an indirect connection through a known hit. In each of the networks, the node with the highest degree (most interactions) is the bottommost node followed by decreasing degree in the anti-clockwise direction. (a) PPI network containing the 46 (inclusive of old, nominated and novel hits) out of the 95 hits that had connections with each other. (b) The same network as in (a) after removing all known hits.
2.3 Enrichment of GWANN hits using transcriptomic data from AD post mortem brains
Using the results of a recent large meta-analysis of AD brain transcriptomic data17, we verified whether the results of our NN analyses show significant representation in the set of differentially expressed genes (DEGs) in the cerebellum, frontal lobe, parietal lobe, and temporal lobe. The authors reported results for DEGs in different groups of patients, denoting these as S1=AD (patients with AD) and S2=non-AD (i.e. patients with mental disorders different from AD). We ran our analysis for four different sets of DEGs, (i) S1, (ii) S2, (iii) S1 - S2 (patients with only AD and no other mental disorder), and (iv) S1 ∩ S2 (patients with any mental disorders). For each set, the DEGs in the cerebellum and parietal lobe, and for sets i, ii and iii, the frontal lobe was enriched at a significance level of 1.25x10-2 (=.05/4). For the frontal lobe, there were no DEGs reported by the authors for set iv, and there was no enrichment in sets i, ii or iii (Table 3).
(a) Enrichment P values of DEGs in four different brain regions for AD and other mental disorders analysed by Patel et al. [CITE] (b) Enrichment P values of DEGs in four different brain regions for the analysis between AD, asymptomatic AD and controls by Patel et al. [CITE] * P value<1.25x10-2
Another study compared gene expression between controls, asymptomatic AD cases and AD cases (GSE118553)18. The studied brain regions were the cerebellum, entorhinal cortex, frontal cortex and temporal cortex. None of the regions showed enrichment in the case of asymptomatic AD vs controls. DEGs for asymptomatic AD vs AD and AD vs controls in the entorhinal cortex showed enrichment, and DEGs for AD vs controls in the temporal cortex were enriched at a significance level of 1.25x10-2 (=.05/4). Supplementary Figures 2 and 3, respectively, contain the cumulative distribution plots used in the enrichment of each dataset.
2.4 Enriched diseases, tissues, biological pathways and GO terms using GWANN summary statistics and hit genes
The gene set enrichment analysis (GSEA) resulted in enrichment of the pathways for amyotrophic lateral sclerosis (P value=3.76x10-4) and Alzheimer’s disease (2.98x10-3) amongst the set of KEGG pathways. Although both of these are disease-specific pathways, all KEGG pathways (disease specific and non-disease specific) were included in the analysis. To further ascertain possible disease associations, we also calculated enrichment in DisGeNET (Table 4, Figure 5c), obtaining significance in schizophrenia (1.38x10-6), autisic disorder (6.67x10-7) and neurodevelopmental disorders (5.72x10-10), to name a few. The most enriched tissue groups were the brain, blood vessel and nerve (Figure 5d).
(a) GO term over representation analysis for GWANN meta-analysis hits using ClueGO v2.5.8. For network visualisation purposes, GO child terms spanning more than two were grouped and collapsed into a node (white nodes). V-shaped gene nodes correspond to either antisense, ’BDNF-AS’ as opposed to BDNF; or flanking sequence regions, between ’PDCD10’ and ’SERPINI1’. GO node size is proportional to the number of genes enriched in the group. (b) Pie chart showing the percentage of GWANN hits in each GO term. (c) Dotplot of the disease enrichment analysis with DisGeNET, with gene ratio (x-axis, #input genes / #genes in disease term); disease terms from a multitude of disease-gene databases (y-axis); FDR for each disease enriched term; and number of genes (#genes found in the disease term). (d) Tissue specificity plot (up and down regulated) for GTEx v8 54 tissue types using GWANN genes with P value<1x10-5 in the paternal meta-analysis. Significantly enriched tissues (Pbon<0.05) are highlighted in red.
Disease enrichment. Disease gene sets from DisGeNET that were significantly enriched after applying FDR correction. Column “GWANN hit count” indicates the number of genes that were part of the disease gene set among the 95 significant genes from GWANN.
Novel GWANN hits with known associated drugs and the main category of diseases they are used for.
The enriched pathways include statin inhibition of cholesterol production (3.69x10-4); disruption of postsynaptic signalling by copy number variations (CNV) (8.38x10-3); regulation of commissural axon pathfinding by SLIT and ROBO (3.69x10-3); NTRK2 signalling through RAS and CDK5 (5.41x10-3, 6.03x10-3); RUNX1 regulation of transcription of genes involved in BCR signalling (6.32x10-3); and Reelin signalling (8.99x10-3). A complete list of pathways enriched in the GSEA for maternal history, paternal history and parental meta-analysis can be found in Supplementary Table 5, and disease enrichment in Supplementary Table 6.
In the over-representation analysis (ORA) (Figure 5a) using gene ontology (GO) terms, 56% (Figure 5b) of the terms were related with the cellular response to nerve growth factor stimulus (GO:1990090). The binding of BDNF (neurotrophins family) to NTRK2 (TRK family) initiates a cascade of intracellular signalling events including Rho protein signal transduction (GO:0035023) triggering neuron recognition (GO:0008038), amyloid-beta metabolic process (GO:0050435), cellular component of the Schaffer collateral - CA1 synapse (GO:0098685), Golgi to plasma membrane transport (GO:0006893) and ephrin receptor signalling pathway (GO:0048013). The results for GO enrichment of maternal and paternal histories can be found in Supplementary Figure 4 and Supplementary Table 7.
2.5 Potential of novel targets for AD drug discovery
Seventeen novel hits were reported as tractable targets for drug discovery (Supplementary Table 8). NTRK2, FSHR, BCR, PDE1C and EPHA6, are five novel GWANN hits that have known approved drugs (Table 4). Among these, BCR has 2 drugs — nilotinib and dasatinib — associated with it, one of which has undergone, and the other is currently undergoing clinical trials for AD. From the list of enriched pathways, NTRK2 signalling plays an important role in synaptic transmission and neuronal development (R-HSA-9032845 and R-HSA- 9026519) and may be a potential therapeutic target for AD. DCC, another GWANN hit that was also recently associated with pleiotropic predisposition to Alzheimer’s disease and educational attainment14, is not currently associated with any approved drug but received the highest tractability probability of 98.47%. It has a medical-quality pocket for small molecule development and given its importance in axon guidance (enriched pathway R-HSA-428542), could potentially be another therapeutic target to explore.
3 Discussion
We applied GWANN to maternal and paternal histories of AD using data from the UKBB, followed by a parental meta-analysis. In doing so, we identified 23 known or nominated AD hits and 72 potentially novel hits. The association analysis was further supported by post-hoc enrichment analyses which highlighted enriched biological and disease pathways relevant to AD and neurodegeneration. Several GWANN hits were also identified to have known drugs that could possibly be repurposed, or a good probability for intervention with various modalities like small molecules and antibodies.
3.1 GWANN identifies hits associated with AD diagnosis
Of the genes that overlap between GWANN’s discovery and discovery from previous GWAS, most of them have also been identified previously using familial history data in the UKBB. However, GWANN also identified previously associated or suggestively associated genes from GWAS performed on cohorts of diagnosed AD patients that have not been identified in the UKBB. GLIS3 was previously identified in a GWAS performed on cerebrospinal fluid tau levels20; WWOX was previously identified in the IGAP data21; and IMMP2L was previously identified for CNVs associated with AD22. Among the suggestive associations, other than the genes possibly linked to AD diagnosis, EFNA5 was suggestively associated with psychosis in AD15 and MACROD2 with neurofibrillary tangles16. ANK3, a novel discovery in the latest GWAS conducted on the European Alzheimer’s Disease BioBank (EADB)19, was also identified by GWANN. Furthermore, observing the neighbourhood of the hits in the EADB analysis, we see a greater number of genes reaching nominal significance in the GWANN analysis as compared to the neighbourhood of genes that were insignificant in the EADB analysis (Supplementary Figure 5). The ability of GWANN to identify these genes from the UKBB data suggests the presence of subtle patterns in the data that were possibly missed by the traditional linear models but identified by our method.
3.2 Role of hit genes and enriched biological pathways in neurodegeneration and AD
The enriched pathways of the novel hits and known suggestive hits play important roles in the pathogenesis of AD. For example, axon guidance molecules such as netrin-1 play an important role in the regulation of Aβ levels and Reelin levels23. Two enriched pathways, R- HSA-428542 and R-HSA-8866376 are child-pathways of axon guidance. DCC, a GWANN hit, is one of the main receptors of netrin-1 and an integral component of the enriched pathway R-HSA-428542. Through its interaction with ROBO1 (GWANN hit and AD target nomination) in the presence of SLIT, it plays an important role in controlling commissural axon growth; and (ii) its interaction with APP mediates axon guidance by enhancing intracellular signalling and may have a role in the negative regulation of Aβ formation24.
Furthermore, the phosphorylation of DAB1, another GWANN hit, plays a role in Reelin signalling (R-HSA-8866376). The loss of Reelin function in humans has previously been associated with AD25.
Another GWANN hit with previous suggestive association to AD, PAX5, is a transcription co- factor along with the RUNX1 complex that plays an important role in BCR signalling and B cell development (enriched pathway R-HSA-8939245). The role of B cells in the pathogenesis of central nervous system diseases is well established and the B cell depleting therapies have shown success in patients with disorders such as multiple sclerosis26. A recent study in mice has also suggested the involvement of B cells in neurodegenerative diseases like AD by causing immunoglobulin deposits around Aβ plaques27.
AD is often characterised by synaptic failure (2) and disruption of postsynaptic signalling by CNVs (WP4875) was among the enriched pathways in the GSEA. Two GWANN hits, NRXN1 and NRXN3, belong to the neurexin family of proteins and contribute significantly to synaptogenesis28. The disruption of the functioning of these genes due to CNVs affect synapse formation and neurodevelopment. They have also previously been associated with autism29 and schizophrenia30. Another integral component of synaptic transmission is controlled by neurotrophins like BDNF (GWANN hit and nominated AD target), which play a very important part in the survival, differentiation, and plasticity of neurons. Low levels of BDNF mRNA in patients with AD and mild cognitive impairment is an experimentally reproduced finding and BDNF treatments in rodent and primate models of AD have shown success previously (2). NTRK2 (or TRKB), a novel GWANN hit, is the main receptor that BDNF binds to; NTRK2 signalling plays a major role in neuronal development and possibly affects hippocampal long-term potentiation (R-HSA-9620244) (2) through CDK5 catalytic activity (R-HSA-9032845), two pathways that were also enriched in the GSEA.
3.3 Repurposing known drugs of novel hits for AD
A large proportion of the known drugs associated with the novel GWANN hits belong to a broader group of drugs known as tyrosine kinase inhibitors (TKI). Drugs like nilotinib and dasatinib, among other TKIs, are being tested to be repurposed for AD due to their ability to reduce tau hyperphosphorylation and reverse Aβ-induced synaptic dysfunction and synapse loss in mouse models of AD-related pathology31. A previous study using mouse models also identified the effect of TKIs in reducing Aβ levels and astrocyte and dendritic cell number after treatment with nilotinib and bosutinib, two drugs associated with the novel GWANN hit NTRK2. Other than TKIs, dipyridamole, a drug associated with the novel hit PDE1C, was previously shown to prevent Aβ-induced microglial inflammation, thereby making it a possible therapeutic intervention for AD32. Pentoxifylline, a methylxanthine associated with PDE1C, has been suggested to improve cognitive function in patients with vascular dementia33, reduce the odds of occurrence of Parkinson’s disease34 and reduce Aβ levels34. This further supports the quality of the novel GWANN hits and (i) suggests the possibility of repurposing the existing drugs for AD, and (ii) provides evidence that several of these drugs are already being explored for the treatment of AD.
3.4 Limitations and considerations
Given that the method involves training a very large number of NNs, we acknowledge the fact that it requires a fairly large amount of computational time and resources. However, we believe that with the rapidly advancing field of ML and AI, it is possible to further optimise this method to achieve a speed up in computational time and reduction in resource usage. For example, the method of “knowledge distillation” uses the concept of a larger teacher NN teaching a much smaller student NN to learn the same task, which allows a significant speed up in time35. Secondly, NNs are more difficult to interpret as compared to traditional linear models, thereby rendering them as “black boxes’’. In this work we explored the gradients of the NNs to identify the SNPs that were most important in driving the prediction, but more methods to increase their interpretability and identify interactive effects could be very beneficial in further understanding the hits identified and those missed by GWANN. Finally, we acknowledge the fact that the discovery in this work warrants replication in different cohorts containing diagnosis of AD.
3.5 Conclusion
We applied our method to family history of AD using data from the UKBB, but it can potentially be extended to other data sources, as well as be applied to other diseases or groups of diseases to understand the common associations, if any, between them. The ability of GWANN to identify a set of possible targets that are part of meaningful biological pathways associated with AD and neurodegeneration, opens the world of GWAS to a new analysis method that could augment the success of existing methods in understanding the pathogenesis of AD and other diseases.
4 Methods and Materials
4.1 Data source
We utilised data from the UKBB (http://www.ukbiobank.ac.uk). The data comprises health, cognitive and genetic data collected from ∼500,000 individuals aged between 37 and 73 years from the United Kingdom at the study baseline (2006–2010)13, 36. We used imputed SNP data as input to GWANN. In addition, the covariates used were age (field 21003), sex (field 31), the first six genetic principal components (PCs) obtained from UKBB variables (field 22009), education qualification (field 6138) and a binary variable if this data was not available (1 if missing and 0 if not). Full UKBB cohort and variable descriptions are provided in the Supplementary material.
4.2 Cohort selection
The case groups consisted of individuals with maternal and paternal histories of AD, some of whom also had AD diagnosis. Individuals with diagnosed AD but no familial history of AD, and those with other neurological disorders were removed from the control groups37. We divided the entire range of ages into three groups (age-group1: 38-52, age-group2: 53-61, age-group3: 62-73 years) and paired them with the 2 sexes (male and female) to obtain six broad groups — (age-group1, male), (age-group1, female) etc. In order to deal with the problem of imbalanced classes which affects the training of machine learning algorithms, we opted to randomly downsample the controls, while balancing for the six groups, to match the number of cases and obtain a 1:1 ratio. The sets of individuals in the datasets for paternal history and maternal history analyses were not overlapping. After these steps, maternal history had 26,133 cases and controls, and paternal history had 12,680 cases and controls.
4.3 Neural network model
GWANN follows an architecture with 2 branches that later merge into a single trunk (Figure 1). One of the branches reads contiguous SNPs within a genomic region involving each gene, while the other reads the covariates. The common trunk combines this information to predict family history of AD. Each sample consists of SNPs and covariates for a group of 10 individuals, with all 10 being either cases (people with family history of AD) or controls. The NN is trained on predicting whether the sample is formed by cases or controls. The rationale behind using convolutional layers in our architecture (Figure 1) was to implement “group training”, which allows the NNs of GWANN to consider the group of 10 cases or controls as a single sample, enabling them to identify similar patterns across the individuals in the group. Before passing the output of this section to the densely connected section of the model, they are passed through an “attention” block to focus on important features and ignore features without much information. The final feature vector, obtained from the densely connected portion of the NN focussing on the SNPs, is concatenated with a feature vector or encoding generated from the covariates (bottom-left branch of NN in Figure 1) and finally passed through the densely connected end layers of the NN to obtain the final prediction.
The covariate encodings are obtained from the penultimate layer of the bottom-left branch. Further information about the NN architecture can be found in the Supplementary material.
4.4 Training the neural network
The NNs were trained to predict the status of maternal and paternal history of AD in two separate analyses. The dataset for each phenotype was split into case-control balanced training and testing sets. For maternal history, this resulted in a training set of 44,426 and a testing set of 7,840; paternal history was split into a training set of 21,556 and a testing set of 3,804. In order to implement “group training”, we separately upsampled the individuals in the training and testing sets by 10 and randomly grouped them into groups of 10. We ensured no overlap of individuals between the training and testing sets. All missing SNP values were replaced with -1 and all variables were min-max scaled prior to training. To optimise the training process, (i) we pre-trained the covariate branch (bottom-left branch in Figure 1); and (ii) saved and reused the encodings from the penultimate layer, instead of passing the covariates through the pre-trained branch during each training to obtain the same encodings.
For every gene, SNPs within the gene and 2500 bp upstream and downstream of it were considered. Since NNs are computationally more intensive than linear models, we set the limit to 2500 bp as a trade-off between increased computational time and including downstream and upstream SNPs in the analysis. This also minimised the chances of overlap between genes which are very close to each other. We divided every gene into windows of maximum 50 SNPs and the final analysis was done on all windows of all genes. A different NN was trained for each window per gene in the entire genome. This resulted in having to train a total of 73,310 models for maternal history and 73,299 models for paternal history.
The NN models were built and trained using PyTorch38, a Python library built and optimised for machine learning and deep learning. Code will be available upon request.
4.5 Identifying significantly associated genes
A null distribution of accuracies, Anull, was obtained from a set of NNs trained on the same covariate encodings along with simulated SNP data generated using the “dummy” command of PLINK 2.039, 40. The P value of window i was obtained as 1 - CDFnull(ai), where CDFnull is the cumulative distribution function of the distribution fit to Anull.
4.6 Meta-analysis for parental histories of AD
After conducting separate analyses for maternal and paternal histories of AD, for each phenotype, we assigned a gene-level P value as the most significant P value amongst all windows for that gene. Following this, we used METAL41 to perform a gene-level meta-analysis of the parental histories of AD (maternal and paternal). Since NNs do not produce beta values or standard errors, the “sample size weighted” method was used and the “weight” parameter for maternal and paternal histories were set as 7,840 and 3,804 respectively, the number of samples in the testing set for each analysis.
4.7 Enrichment and post-hoc analyses
LDLink42 was used to find the LD between SNPs using the “British in England and Scotland” population. We used information from the Agora AD knowledge portal (https://agora.adknowledgeportal.org) to identify genes that have (i) significant eQTLs in the brain; (ii) change in RNA expression in post-mortem AD brains; (iii) AD target nominations; and (iv) association with AD cognitive phenotypes (Braak, CERAD and COGDX). STRING V11.543 was used to perform PPI analysis of the genes that were significantly associated after the parental meta-analysis. Pathway enrichment was performed using GSEA44. This enrichment was performed using the gene-level statistics for the parental meta-analysis for all analysed genes. The enrichment was performed for KEGG, Wiki and Reactome pathways present in the canonical pathways of MSigDB v7.5.145. We also performed an enrichment analysis of DEG sets in AD post-mortem brain regions — meta-analysis of AD brain transcriptomic data17 and GSE11855318 — using the same method, with the exception of multiple testing corrections. We applied a bonferroni correction for the number of DEG sets that were analysed. In addition to the GSEA, we performed a GO ORA of the GWANN hits using ClueGO v2.5.846 and a disease enrichment analysis with DisGeNET47. FUMA48 was used to identify tissues enriched using the GWANN genes with P value<1x10-5. Finally, we used TargetDB49 to get a picture of the tractability or suitability of the novel GWANN hits for intervention by modalities such as small molecules or antibodies. The details and parameters of implementation for the different analyses can be found in the Supplementary material.
Data Availability
All data produced in the present work are contained in the manuscript and supplementary information and files.
Funding
This work was supported by the Centre for Artificial Intelligence in Precision Medicines; Johnson and Johnson; the John Fell Foundation [grant ID 0010659]; and the Virtual Brain Cloud from European commission [grant number H2020-SC1-DTH-2018-1]. C.A. is funded by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
Conflicts of Interest / Competing Interests
W.S. is funded by Johnson and Johnson. A.N.H. received funding from Johnson and Johnson, GlaxoSmithKline and Ono Pharma.
Author Contributions
U.G., W.S. and A.N.H. theorised and designed the method. W.S. carried out the pre- processing and quality control of the genomic data. U.G. developed the method and performed the experiments. U.G., W.S., L.W. and M.F. structured the post-hoc analyses.
M.F. and U.G. carried out the different post-hoc analyses. M.F., D.N. and U.G. carried out the identification of drugs associated with the GWANN hits. B.U. performed the comparison of GWANN summary statistics with the EADB GWAS summary statistics. Q.L. helped with optimising the code pipeline for training the NNs. U.G., W.S., L.W. and M.F. wrote the manuscript. L.S., C.A., A.A., M.A., H.C and C.V.D. provided critical feedback that helped revise and finalise the manuscript. A.N.H. was the principal supervisor for the work carried out in this manuscript.
Acknowledgments
We thank the UK Biobank participants and the UK Biobank team for their work in collecting, processing, and disseminating these data for analysis. This research was conducted using the UK Biobank Resource under the approved project 15181. The results published here are in part based on data obtained from Agora (https://agora.adknowledgeportal.org), a platform initially developed by the NIA-funded AMP-AD consortium that shares evidence in support of AD target discovery.
Footnotes
upamanyu.ghose{at}psych.ox.ac.uk, william.sproviero{at}psych.ox.ac.uk, laura.winchester{at}psych.ox.ac.uk, marco.fernandes{at}psych.ox.ac.uk, danielle.newby{at}psych.ox.ac.uk, brittany.ulm{at}ndph.ox.ac.uk, liu.shi{at}psych.ox.ac.uk, qiang.liu{at}psych.ox.ac.uk, cassandra.adams{at}cmd.ox.ac.uk, aalbukhari{at}kau.edu.sa, malmansouri{at}kau.edu.sa, hchoudhry{at}kau.edu.sa, cornelia.vanduijn{at}ndph.ox.ac.uk, alejo.nevado-holgado{at}psych.ox.ac.uk
ABBREVIATIONS: NN - Neural Network; GWANN - Genome Wide Association Neural Networks; UKBB - UK BioBank; PPI - Protein-Protein Interaction; GSEA - Gene Set Enrichment Analysis; ORA - Over Representation Analysis; CDF - Cumulative Distribution Function; GO - Gene Ontology