Systematic single-variant and gene-based association testing of 3,700 phenotypes in 281,850 UK Biobank exomes

Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variation in human disease has not been explored at scale. Exome sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variation across a wide range of phenotypes to discover genes and allelic series relevant to human health and disease. Here, we present results from systematic association analyses of 3,700 phenotypes using single-variant and gene tests of 281,850 individuals in the UK Biobank with exome sequence data. We find that the discovery of genetic associations is tightly linked to frequency as well as correlated with metrics of deleteriousness and natural selection. We highlight biological findings elucidated by these data and release the dataset as a public resource alongside a browser framework for rapidly exploring rare variant association results.


Introduction
Coding variation has been the most readily interpretable class of genomic variation since the development of the gene model and mapping of the human genome. As such, it has facilitated the mapping and interpretation of variants with immediate clinical importance such as the American College of Medical Genetics actionable variant list (1). More recently, exome sequencing has yielded the discovery of specific causal variants for hundreds of rare diseases, particularly dominant acting de novo variants for severe diseases (2).
As the sample sizes of exome sequencing datasets continue to grow, so do the opportunities to identify associations between rare variants and phenotypes (both complex traits and diseases). In complex diseases, identifying causal genetic factors for a given disease can provide direct insight into the potential for therapeutic avenues. For instance, gain-of-function variants in PCSK9 have been demonstrated to increase LDL levels and thus risk for cardiovascular disease (3). Accordingly, loss-of-function (LoF) variants are protective for cardiovascular disease (4), and less than 15 years after the discovery of this effect, therapeutic approaches to inhibit PCSK9 have been brought to market (5).
Deeply phenotyped biobanks present a unique opportunity to simultaneously analyze multiple diseases and traits within a single cohort. These datasets enable the discovery of new disease genes with therapeutic potential at a large scale across phenotypes. For instance, a recent joint association analysis of the UK Biobank and the FinnGen study cohorts revealed rare variants in ANGPTL7 that protect against glaucoma (6). The UK Biobank is a collection of over 500,000 participants with standardized, detailed phenotypic data on which GWAS have been run extensively. Recent studies have leveraged previous releases of the exome sequence data to explore various aspects of rare variant associations in this dataset (7)(8)(9). Here, we present and publicly release results from a systematic, large-scale rare variant association analysis of 3,700 phenotypes in more than 300,000 exome sequenced individuals.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

Generating high-quality exome data for rare variant associations
We built an end-to-end pipeline for read mapping, processing, joint variant calling, quality control, and mixed model association analysis, and applied this pipeline to 302,325 individuals with exome sequence data from the UK Biobank. The read mapping and processing pipeline adopted the GATK Best Practices pipeline (GRCh38), and the resulting gVCF files were joint-called using a scalable implementation in Hail ( Supplementary Information; Fig. S1) (10).
We processed a set of 3,700 phenotypes including 1,117 quantitative traits as well as 2,583 binary traits with at least 200 cases, which included 681 disease endpoints based on ICD-10 codes (Fig. S2).
After performing quality control (QC) in a similar but augmented (e.g. array concordance; see Supplementary Information) manner as for the Genome Aggregation Database (gnomAD) (11), we generated a high-quality dataset of 286,310 individuals (Figs. S3 to S5; table S1), including 281,850 individuals of European ancestry in which we find 20,343,543 high-quality variants (Fig. S6). For each of 19,591 protein-coding genes, we considered up to three functional annotation categories: predicted LoF (pLoF), missense (including low-confidence pLoF variants and in-frame indels), and synonymous, resulting in 57,650 groups for association testing (i.e., one group per gene and functional annotation category).

Creating a high-quality set of rare variant associations
We performed group tests using the mixed model framework SAIGE-GENE (12), which includes single-variant tests and gene-based burden (mean) and SKAT-O (hybrid variance/mean) tests (Fig. S7). In total, we performed up to 7,575,993 single-variant tests and 57,650 group tests for each of 3,700 phenotypes ( Fig. 1). Additionally, we randomly generated 314 heritable phenotypes to test the asymptotic properties of the mixed-model association testing framework (Figs. S8 to S9), and to determine empirical p-value thresholds for Type I error control. Based on this analysis, for each phenotype, we consider genome-wide p-value . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  , and groups (i.e., gene-annotation pairs; C) before and after quality control. After quality control, the number of variants (D) and genes (E) are broken down by annotation and frequency bin (alternate allele frequency for variants, cumulative allele frequency for genes).
We performed extensive quality control on these summary statistics ( Fig. 1; Table S2; Supplementary Information), including an 0.01% minor and cumulative allele frequency filter for variants and genes, respectively, as well as genomic control (lambda GC) for each phenotype and each gene (Figs. S11 to S15). Further, we pruned to a set of 1,362 high-quality independent phenotypes encompassing 559 continuous traits and 803 binary traits, including 310 ICD codes (Figs. 1A, S16; Table S2). We confirm the robustness of our results by comparing them to a previous large-scale study of height (Tables S3 to S5, Fig. S17) and red . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
We filtered to 349,941 variants, including 8,865 pLoF variants, 208,723 missense variants, and 132,353 synonymous variants with a cohort frequency of at least 0.01% (corresponding to an allele count of approximately 60; Fig. 1B). For group tests, we filter to a high-quality set of 40,492 gene tests with at least 20X coverage (Fig. S13) and an aggregate allele frequency of at least 0.01% for pLoF (N=9,558 genes), missense (N=15,457), and synonymous (N=15,477) (Fig. 1C).
Using these criteria, we identified a total of 27,421, and 4,560 associations meeting our p-value threshold, with a mean of 20.1 and 3.35 associations per phenotype, for single-variant tests and group tests, respectively (disease results shown in Fig. 2A-B). Comparing the group test results to single-variant association test results, we find 1,069 associations (on average 0.8 per phenotype) from group tests where no association reached our p-value threshold for any single variant in the corresponding gene (Fig. 2C).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 23, 2021. ; The number of gene-level associations per phenotype is shown as a barplot, broken down by trait type (left) and normalized within each trait type (right). The single-variant tests are grouped into genes where at least one associated variant is necessary to be "Significant by variant" which is shown alongside group tests ("Significant by gene") as well as genes where an association is found both for group and single-variant tests.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Displaying rare variant associations
The utility of human genetic variation datasets are substantially enhanced when made accessible in the form of online portals that enable non-technical domain experts to quickly browse, interpret, and export results for downstream follow-up (15). We extended our gnomAD browser toolkit to create the genebass (gene-biobank association summary statistics) browser

Frequency and selection affect the landscape of rare variant associations
A major complicating factor in the analysis and interpretation of association statistics, particularly from rare variants, is the relationship between natural selection, allele frequency, effect size, and power for discovery. Sham et al. showed that the power to detect association is proportional to the variance explained of a biallelic variant (16). Specifically, for a continuous trait the variance explained of a biallelic variant that is purely additive is 2pqa 2 where p is the allele frequency, q = 1-p and a is the allelic effect of the variant. Thus, for a fixed effect size, a more common variant will capture more variance and by extension show stronger association.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 23, 2021. ; However, the process of negative selection will tend to decrease the frequency of functional damaging variants, suggesting that variants with large effect sizes are more likely to be rare. Indeed, partitioned heritability analyses for common variants support the presence of these countervailing forces, as comparatively lower frequency variants have larger absolute effect sizes but this growth in effect size is slower than the loss in variance explained from their lower frequency (17). In evaluating the landscape of rare variant association, we observe a similar pattern with increasing proportion of variants associated with at least one phenotype as frequency increases (Fig. 3A); however, within each frequency category, we observe the effect of functional annotation, a known correlate for deleteriousness, on the proportion of variants with at least one association.
Comparing the number of associations by variant annotation within each allele frequency category, we find that pLoF variants have a larger number of associations than missense variants, followed by synonymous variants for single-variant tests (Fig. 3A) as well as group tests (Fig. 3B). Within missense variants, variant deleteriousness as predicted by PolyPhen2 (18) is correlated with the number of associations meeting our p-value threshold (Fig. S18). For splice donor variants, we find a correlation between the proportion expressed across transcripts (pext) (19) and the number of associations (Fig. 3C). Additionally, the pathogenicity level of ClinVar variants is correlated with phenotypic association (Fig. 3D).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 23, 2021. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Gene function influences association statistics
We examined the phenotypic impact of gene categories previously known to have functional relevance and/or a role in disease. In particular, we find that 470 genes previously implicated in developmental delay (20) are more likely to be associated with a phenotype in the UK Biobank (Fisher's exact p = 2 x 10 -3 ; Fig. 4). Further, we observe a correlation between selection against pLoFs in a gene and the phenotypic impact of pLoFs in that gene: specifically, constrained genes (i.e., those in the highest decile of LOEUF, a metric of loss-of-function intolerance) are more likely to be associated with a phenotype (5.9%) than a frequency-matched set of genes in the genome (2.0%; Fisher's exact p = 2.1 x 10 -6 ; Fig. 4). Similarly, autosomal dominant and autosomal recessive genes, as well as genes with previously established hits in the GWAS catalog and FDA approved drug targets, show an increased phenotypic impact of pLoFs and missense variants. Finally, cell essential genes show an increased proportion of associated phenotypes through a pLoF mechanism, while non-essential genes do not show an enrichment.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 23, 2021. ; Figure 4 | The effect of gene function on the landscape of rare variant associations. The proportion of gene-annotation pairs with at least one association (SKAT-O p < 2.5 x 10 -8 ) is shown for a number of gene categories, each compared to a background set of genes matched on cumulative allele frequency. Error bars represent 95% confidence intervals. Asterisks denote a significant difference between the background set and test set (* and ** indicate p < 0.05 and p < 0.001, respectively).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Biological insights from rare variant association results
The biological information encapsulated in this dataset is extremely high-dimensional, and we release the full dataset of results for the benefit of the community. Here, we highlight a set of known and putative associations as examples of the power of this dataset. First, we recapitulate many known associations from previous studies, including associations between PCSK9 and LDL cholesterol (pLoF burden p = 3 x 10 -94 ), COL1A1 and bone density (pLoF burden p = 1.5 x 10 -8 ) (21), KLF1 and several red blood cell traits (pLoF burden < 2 x 10 -8 ) (22), and LRP5 (Wnt coreceptor) and bone density and osteoporosis phenotypes (pLoF burden < 5 x 10 -7 ) (23).
Finally, we highlight novel biological signals identified in the exome dataset, enabled by the results browser. In particular, we find an association between predicted loss-of-function of SCRIB and white matter integrity of tapetum (Fig. 5). Notably, this association is not significant at any single pLoF variant, but when aggregated into a SKAT-O or burden group test, the overall ablation of the transcript is associated at a p-value of 6 x 10 -13 (Fig. 5A). This provides additional context to a signal observed in a recent GWAS of white matter integrity (24) averaged across regions of the brain, as well as in the body of corpus callosum (Fig. 5B). To our knowledge, this gene has not been associated in previous genome-wide association studies, although it is a constrained gene (pLI = 0.93) that shows evidence for neural tube defects in mice (25) with ultra-rare occurrences in humans (26).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 23, 2021. ;

Discussion
We have generated rare variant association analysis summary statistics for 3,700 phenotypes and made these data available to the public, via bulk data downloads as well as a public-facing browser (https://genebass.org).
There are a number of limitations to our analysis. Although we have performed extensive QC to improve the reliability of these results, we urge caution in interpreting association results, particularly for the rarest binary traits (prevalence < 10 -4 ) and for ultra-rare variants (frequency < . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 23, 2021. ; https://doi.org/10.1101/2021.06.19.21259117 doi: medRxiv preprint 10 -4 ). For pLoF variants, the median cumulative allele frequency across genes is approximately 1.5 x 10 -4 , suggesting that group tests at current sample sizes are only powered to detect individual gene effects for quantitative traits that capture at least 0.02% of variance, as well as diseases and traits that have a high prevalence (well above 10%; Fig. S10). This is further observed in the lack of asymptotic properties of the mixed-model tests for rarer binary traits (Fig.   S9). Nonetheless, global biological trends are apparent, such as the relative ordering of functional impact (pLoF > missense > synonymous; Fig. 3), highlighting that the ability to accurately annotate variants with the functional consequences on a gene is critical to powering discovery in rare variant analysis. Further, measures of natural selection at the gene level continue to highlight that certain classes of genes, such as LoF-intolerant genes, are clearly enriched for phenotypic associations.
Finally, these association analyses were only performed for individuals of European ancestry, the largest group in the dataset. Notably, these analyses only interrogate a slice of human genetic diversity, and expanding to additional ancestries has been shown to increase power and resolution for genetic discovery (27-29); however, as the sample sizes of non-European individuals in the UK Biobank dataset are very limited, these analyses would be underpowered for most binary traits including many disease outcomes. Additional datasets with large sample sizes of diverse individuals will be required to overcome these limitations.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 23, 2021.