Abstract
Exome-sequencing studies have generally been underpowered to identify deleterious alleles with a large effect on complex traits as such alleles are mostly rare. Because the population of northern and eastern Finland has expanded considerably and in isolation following a series of bottlenecks, individuals of these populations have numerous deleterious alleles at a relatively high frequency. Here, using exome sequencing of nearly 20,000 individuals from these regions, we investigate the role of rare coding variants in clinically relevant quantitative cardiometabolic traits. Exome-wide association studies for 64 quantitative traits identified 26 newly associated deleterious alleles. Of these 26 alleles, 19 are either unique to or more than 20 times more frequent in Finnish individuals than in other Europeans and show geographical clustering comparable to Mendelian disease mutations that are characteristic of the Finnish population. We estimate that sequencing studies of populations without this unique history would require hundreds of thousands to millions of participants to achieve comparable association power.
Similar content being viewed by others
Main
Most alleles with demonstrated deleterious effects on phenotypes directly alter the structure or function of a protein1,2. Exome-sequencing studies aim to discover such alleles and demonstrate their association to common diseases and disease-related quantitative traits. However, exome-sequencing studies to date generally have identified few newly associated rare variants or genes3,4. The sample size that is required for such discoveries remains uncertain and theoretical analyses indicate that studies to date have been underpowered, as most deleterious variants are expected to be rare owing to purifying selection5. These previous analyses also suggest that the power to detect associations to deleterious alleles is highest in populations that have expanded in isolation after recent bottlenecks, as alleles passing through the bottlenecks may increase to much higher frequencies than in other populations6,7,8.
Finland exemplifies such a history. Bottlenecks occurred at the founding of early-settlement regions (southern and western Finland) 2,000–4,000 years ago and again with internal migration to late-settlement regions (northern and eastern Finland) in the fifteenth and sixteenth centuries9. Finland’s subsequent population growth (to approximately 5.5 million) generated sizable geographical sub-isolates in late-settlement regions.
This unique population history has resulted in ‘the Finnish Disease Heritage’10, 36 Mendelian diseases that are much more common in Finnish individuals than in other Europeans. These disorders concentrate in late-settlement regions of Finland10, and the genes responsible for them exhibit extreme enrichment of deleterious variants11,12,13. We created the Finnish Metabolic Sequencing (FinMetSeq) study to capitalize on the population history of late-settlement Finland to discover rare-variant associations with cardiovascular and metabolic disease-relevant quantitative traits through exome sequencing of two extensively phenotyped population cohorts, FINRISK and METSIM (Methods).
We successfully sequenced 19,292 FinMetSeq participants and tested the identified variants for association with 64 clinically relevant quantitative traits, discovering 43 novel associations with deleterious variants14,15: 19 associations (11 traits) in FinMetSeq alone and 24 associations (20 traits) in a combined analysis of FinMetSeq with 24,776 Finns from three cohorts with imputed genome-wide genotypes. Of the 26 variants that underlie these 43 associations, 19 were unique to Finland or enriched more than 20-fold in FinMetSeq compared to non-Finnish Europeans (NFE). These enriched alleles cluster geographically like Finnish Disease Heritage mutations, indicating that the distribution of trait-associated rare alleles may vary significantly between locations within a country.
We demonstrate that exome sequencing in a historically isolated population that expanded after recent population bottlenecks is an efficient strategy to discover alleles with a substantial effect on quantitative traits. As most of the novel, putatively deleterious trait-associated variants that we identified are unique to or highly enriched in Finland, we estimate that similarly powered studies of these variants in non-Finnish populations would require hundreds of thousands or millions of participants.
Genetic variation
In 19,292 successfully sequenced exomes, we identified 1,318,781 single-nucleotide variants and 92,776 insertion or deletion variants (Supplementary Tables 1–3 and Supplementary Information). Compared to NFE control exomes (gnomAD v.2.1, Extended Data Fig. 1a), FinMetSeq exomes showed depletion of singletons and doubletons and excess variants with minor allele count (MAC) ≥ 5, particularly for predicted-deleterious alleles (Extended Data Fig. 1b).
Association analyses
We tested for association between genetic variants in FinMetSeq and 64 clinically relevant quantitative traits after standard adjustments for medications and covariates, and transformation to normality for analyses (Methods, Supplementary Tables 4, 5). Out of 64 traits, 62 exhibited significant heritability with common single-nucleotide variants (P < 0.05; 5% < h2 < 53%; Extended Data Fig. 2a, Supplementary Table 6), with substantial phenotypic and genetic correlations between traits (Extended Data Fig. 2b).
Single-variant association tests with genetic variants with MAC ≥ 3 among the 3,558 to 19,291 individuals measured for each trait (Supplementary Tables 4, 5) identified 1,249 associations (P < 5 × 10−7) at 531 variants (Supplementary Table 7); 53 traits were associated with at least one variant (Fig. 1a). All 1,249 associations remained significant after adjustment for multiple testing (exome-wide and across the 64 traits using a hierarchical procedure setting average the false discovery rate (FDR) to 5%; see Methods). Using this procedure on the 531 associated variants, we detected 287 more associations (Supplementary Table 8), most of which reflected a high correlation between lipid traits. Of the 531 variants, those with a greater than 10× frequency in FinMetSeq compared to NFE were more likely to be trait-associated (odds ratio = 4.92, P = 2.6 × 10−5; Extended Data Fig. 1c).
After clumping associated variants within 1 megabase (Mb) and with r2 > 0.5 into single loci (Methods), the 531 associated variants represented 262 distinct loci (597 trait–locus pairs; Supplementary Table 7). The number of associated loci per trait correlated positively with trait heritability (r = 0.38, P = 8.8 × 10−4), although height was a notable outlier (Fig. 1b).
Most variants and loci (61%) were associated with a single trait; 4% were associated with ≥10 traits. Overlapping associations (Extended Data Fig. 3a) reflect both phenotypic and genetic correlations and the estimated genetic correlation of trait pairs predicts shared loci between traits (Extended Data Fig. 3b). Gene-based association tests revealed 54 associations with P < 3.88 × 10−6 and multi-trait FDR-corrected P < 0.05 (Methods and Supplementary Table 9), including 10 traits associated with APOB (Extended Data Fig. 4) and a novel association of SECTM1 with high density lipoprotein cholesterol subfraction 2 (HDL2-C) (Extended Data Fig. 5).
To determine which of the 1,249 single-variant associations are distinct from previous GWAS findings, we repeated the association analysis for each trait conditioning on published associated variants in the EBI GWAS Catalog (as per December 2016, Methods); 478 associations at 126 loci remained significant (P < 5 × 10−7), including at least one association for 48 traits (Supplementary Table 10). Conditionally associated variants were more often rare (24% versus 11%), more likely protein-altering (31% versus 22%) and more frequently >10× enriched in FinMetSeq relative to NFE (19% versus 10%) than associated variants overall.
Replication and follow-up
We attempted to replicate the 478 single-variant associations (unconditional and conditional P ≤ 5 × 10−7) and follow up on 2,120 sub-threshold associations from FinMetSeq (unconditional 5 × 10−7<P ≤ 5 × 10−5 and conditional P ≤ 5 × 10−5) in 24,776 participants from three Finnish cohort studies: FINRISK16,17 participants not in FinMetSeq (n = 18,215), Northern Finland Birth Cohort 196618 (n = 5,139) and Helsinki Birth Cohort19 (n = 1,412), all imputed using the Finnish SISu v.2 reference panel (www.sisuproject.fi). Following association analysis within each cohort, we conducted a meta-analysis of the three imputation-based studies to test for replication of FinMetSeq variants (replication analysis) and a four-study meta-analysis with FinMetSeq to follow up on suggestive associations (combined analysis).
Of 448 significant variant–trait associations with replication data, 392 (87.5%) replicated at P < 0.05 (Supplementary Table 11). Of the 1,417 sub-threshold associations, 431 reached P < 5 × 10−7 in the combined analysis (Supplementary Table 12); more than 60% of the variants were absent from the reference panel and thus could not be tested further.
Among the significant associations from FinMetSeq or the combined analysis, 43 associations were with 26 predicted deleterious variants (6 protein truncating variants (PTVs) and 20 missense variants) that conditional analysis and literature review suggest are novel (Table 1). Of those, 19 associations (15 variants) were significant in FinMetSeq (Table 1 and Supplementary Table 11); another 24 associations (16 variants) reached significance in the combined analysis (Table 1 and Supplementary Table 12). Furthermore, 34 out of 43 associations were with 19 variants either found only in Finland or enriched more than 20-fold in FinMetSeq compared to NFE. The identification of associations for these 19 variants would have required much larger samples in NFE populations than in FinMetSeq (Fig. 2a, b). We provide brief summaries relating some of these associations to known biology and previously described genetic evidence (Table 1, expanded version in Supplementary Table 13; see Supplementary Information), highlighting here the most notable findings.
Anthropometric traits
A predicted damaging missense variant (Arg94Cys) in THBS4, which was 45× more frequent in FinMetSeq than in NFE, was associated in the combined analysis with a mean 5.9 kg decrease in body weight. THBS4 encodes thrombospondin 4, a matricellular protein that is found in blood vessel walls and highly expressed in heart and adipose tissues20. THBS4 may regulate vascular inflammation21 and has been implicated in the risk of heart disease22.
A predicted damaging missense variant (Val104Met) in DLK1, which was 177× more frequent in FinMetSeq than in NFE, was associated in the combined analysis with a mean 1.3 cm decrease in height. DLK1 encodes delta-like notch ligand 1, an epidermal growth factor that interacts with fibronectin and inhibits adipocyte differentiation. Uniparental disomy of DLK1 causes Temple and Kagami–Ogata syndromes, which are characterized by growth restriction, hypotonia, joint laxity, motor delay and early onset of puberty23. Paternally inherited common variants near DLK1 are associated with childhood obesity, type 1 diabetes, age at menarche and precocious puberty24,25,26. Homozygous null mutations in the mouse orthologue Dlk1 lead to embryos with reduced size, skeletal length and lean mass27; in Darwin’s finches, single-nucleotide variants at this locus have a strong effect on beak size28.
High-density lipoprotein cholesterol
A predicted deleterious missense variant (Arg112Trp) in CD300LG is associated in FinMetSeq with a mean 0.95 mmol l−1 increase in high-density lipoprotein cholesterol (HDL-C) and is associated with increased HDL2-C and ApoA1. This variant, which is absent from NFE, has an opposite direction of effect from a previously reported deleterious missense variant in this gene29, which encodes a type-I cell-surface glycoprotein.
Amino acids
A stop gain variant (Arg722X) in ALDH1L1 is associated in FinMetSeq with reduced serum glycine levels and is absent from NFE; this trait may increase risk for cardiometabolic disorders30,31. ALDH1L1 encodes 10-formyltetrahydrofolate dehydrogenase, which competes with serine hydroxymethyltransferase to alter the ratio of serine to glycine in the cytosol. Gene-based tests suggest that additional PTVs and missense variants in ALDH1L1 alter glycine levels (P = 1.4 × 10−20; Extended Data Fig. 6 and Supplementary Table 9).
Ketone bodies
A predicted damaging missense variant (Phe517Ser) in ACSS1 is associated in the combined analysis with increased serum acetate levels and is absent from NFE. ACSS1 encodes an acyl-coenzyme A synthetase and has a role in the conversion of acetate to acetyl-CoA. In rodents, increased acetate levels lead to obesity, insulin resistance and metabolic syndrome32.
Trait-associations and disease end points
Genotype data from FinnGen33 enabled us to test whether deleterious variants responsible for our novel trait associations contributed to related disease end points. We examined 22 diseases for the 25 available variants shown in Table 1; 3 variants were associated with diseases in FinnGen at a Bonferroni threshold value of P < 0.05/(22 × 25) = 9.0 × 10−5 (Supplementary Table 14).
A predicted damaging missense variant (Ser328Pro) in KRT40, which is associated in FinMetSeq with elevated HDL-C but is absent in NFE, is associated in FinnGen with increased risk of pancreatitis. Although this is the first disease association reported for KRT40, type-I keratins regulate exocrine pancreas homeostasis34. A 29-bp deletion that causes a frameshift in FAM151A is associated in FinMetSeq with decreased total cholesterol in intermediate-density lipoproteins (IDL-C) and decreased concentration of IDL particles, is 6.7× more frequent in FinMetSeq than NFE and is associated in FinnGen with decreased risk of myocardial infarction. Interpretation of this association is complicated as the variant is also situated in an overlapping gene (ACOT11), which is involved in fatty acid metabolism and lies <1Mb from a cardioprotective variant in PCSK9. Finally, a predicted damaging missense variant (Arg65Trp) in DBH, which is associated with a mean 1.0 mm Hg decrease in diastolic blood pressure in the combined analysis, is 23.8× more frequent in FinMetSeq than in NFE, and is associated in FinnGen with decreased risk of hypertension. Distinct loci in this gene and gene-based tests are associated with mean arterial pressure35,36.
Replication outside Finland
To assess the generalizability of these novel associations, we attempted to replicate associations from our combined analysis with data from the UK Biobank. Across 8 anthropometric and blood pressure traits for which UK Biobank data are publicly available, our combined analysis identified 31 trait–variant associations, of which 23 were present in the UK Biobank. Of the 23 associations, 20 were to variants with a minor allele frequency (MAF) > 1% in FinMetSeq and a comparable frequency in UK Biobank; 15 (75%) showed association in UK Biobank at P < 0.05/23 = 2.2 × 10−3. The three rare variants in this analysis were all more than 10× more frequent in FinMetSeq than in UK Biobank; none were associated in UK Biobank (Supplementary Table 15). However, even after adjusting for winner’s curse37, we had <50% power to detect these associations in UK Biobank, consistent with the argument that extremely large samples will be needed in other populations to achieve the power for rare-variant association studies that we observed in Finland.
Enriched variants cluster geographically
Given the concentration of Finnish Disease Heritage mutations within regions of late-settlement Finland38, we hypothesized that trait-associated variants discovered through FinMetSeq would also cluster geographically. Principal component analysis supported this hypothesis, revealing a broad-scale population structure within late-settlement regions among 14,874 unrelated FinMetSeq participants with known parental birthplaces (Extended Data Fig. 7). Carriers of PTVs and missense alleles showed more clustering of parental birthplaces than carriers of synonymous alleles, even after adjusting for MAC (Supplementary Table 16a, b).
To analyse the distribution of variants within late-settlement Finland, we delineated geographically distinct population clusters using haplotype sharing among 2,644 unrelated individuals with both parents born in the same municipality (Methods and Extended Data Fig. 8). We compared variant counts across functional classes and frequencies between an early-settlement reference cluster and 12 clusters containing ≥100 individuals (Extended Data Fig. 9 and Supplementary Tables 17, 18). Clusters that represent the most heavily bottlenecked late-settlement regions (Lapland and Northern Ostrobothnia) displayed a deficit of singletons and enrichment of intermediate frequency variants compared to other clusters.
Variants that were more than 10× enriched in FinMetSeq compared to NFE displayed particularly strong geographical clustering (Supplementary Table 19). We further characterized clustering for FinMetSeq-enriched trait-associated variants, by comparing mean distances between birthplaces of parents of minor allele carriers to those of non-carriers (Supplementary Table 20). Most of these variants were highly localized. For example, for rs780671030 in ALDH1L1, the mean distance between parental birthplaces is 135 km for carriers and 250 km for non-carriers (P < 1.0 × 10−7, Fig. 3a).
Finally, we identified comparable geographical clustering between carriers of 35 Finnish Disease Heritage mutations and carriers of FinMetSeq-enriched trait-associated variants (Fig. 3b and Methods). Clustering was considerably greater in carriers than clustering observed for non-carriers of both sets of variants, suggesting that rare trait-associated variants may be much more unevenly distributed geographically than has previously been appreciated.
Discussion
We demonstrate that a well-powered exome-sequencing study of deeply phenotyped individuals can identify numerous rare variants that are associated with medically relevant quantitative traits. The variants that we identified provide a useful starting point for studies aimed at uncovering biological mechanisms and fostering clinical translation. The power of this study to discover rare-variant associations derives from the numerous deleterious variants that are enriched in or unique to Finland. Prioritizing the sequencing of multiple population isolates that have expanded from recent bottlenecks is a strategy for increasing the scale of the discovery of rare-variant associations7,39,40,41. Because genetic drift results in a different set of alleles to pass through population-specific bottlenecks, thus enriching some variants and depleting others, the numerous rare-variant associations that could be identified by sequencing of well-phenotyped samples across multiple isolates could rapidly increase our understanding of the genetic architecture of complex traits.
Our results support recent suggestions of continuity between the genetic architectures of complex traits and disorders that are classically considered monogenic42,43, by identifying numerous deleterious variants with large effects on quantitative traits that demonstrate geographical clustering comparable to the clustering of the mutations responsible for the Finnish Disease Heritage.
Using a Finland-specific reference panel44 to impute FinMetSeq variants into array-genotyped samples from three other Finnish cohorts enabled us to identify additional novel associations. However, the clustering in FinMetSeq of deleterious trait-associated variants within limited geographical regions and our inability to follow up on more than 700 sub-threshold associations from FinMetSeq for which the associated variants were absent in the Finnish imputation reference panel, emphasize the importance of representing regional subpopulations in such reference panels, to account for fine-scale population structures.
The value of rare-variant studies in population isolates will depend on the richness of phenotypes in sequenced cohorts from these populations. For example, we associated fewer than 100 of the more than 24,000 deleterious, highly enriched variants identified in FinMetSeq with any of the 64 quantitative traits studied here. The associations that we identified to disease end points in FinnGen hint at the discoveries that will be possible when that database reaches its full size of 500,000 participants. The insights gained from such efforts will accelerate the implementation of precision health, informing projects in more heterogeneous populations that are still at an early stage45.
Methods
Data reporting
No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.
Study designs, phenotypes, and sequenced participants of the METSIM and FINRISK studies
METSIM is a single-site study investigating cardiometabolic disorders and related traits in 10,197 men randomly selected from the population register of Kuopio, Eastern Finland, aged 45 to 73 years at initial examination from 2005 to 2010. We attempted exome sequencing of all METSIM study participants15,46.
FINRISK is a series of health examination surveys47 based on random population samples from five (six in 2002) geographical regions of Finland, carried out every five years beginning in 1972. For exome sequencing, we chose 10,192 participants in the 1992–2007 FINRISK surveys from northeastern Finland (former provinces of North Karelia, Oulu and Lapland).
All participants in both studies provided informed consent, and study protocols were approved by the Ethics Committees at participating institutions (National Public Health Institute of Finland; Hospital District of Helsinki and Uusimaa; Hospital District of Northern Savo). All relevant ethics committees approved this study.
Selection of traits, harmonization, exclusions, covariate adjustment and transformation
Of the 257 quantitative traits measured in both METSIM and FINRISK, we selected 64 for association analysis in FinMetSeq based on clinical relevance for cardiovascular and metabolic health (Supplementary Tables 4, 5). We excluded individuals with type 1 diabetes and women who were pregnant at the time of phenotyping from all analyses; individuals with type 2 diabetes from analyses of glycaemic traits; and individuals who had not fasted for at least 8 h after their last meal for traits influenced by food consumption. A complete list of exclusions can be found in Supplementary Table 5. We adjusted measured values of systolic and diastolic blood pressures for individuals on antihypertensive medication at the time of testing48,49, and serum lipid measures for individuals on lipid-regulating medications50,51. Trait adjustments are listed in Supplementary Table 5.
We prepared quantitative traits for association analysis separately for METSIM and FINRISK by linear regression on trait-specific covariates after log-transforming skewed variables. Covariates for regression analyses included: age and age2 (METSIM); sex, age, age2 and cohort year (FINRISK). Trait transformations and trait-specific covariates are listed in Supplementary Table 5. Several traits were adjusted for sex hormone treatment, which included women on contraceptives or hormone-replacement therapy. We transformed residuals from these initial regression analyses to normality using inverse normal scores.
Exome sequencing
We carried out exome sequencing in two phases.
Phase 1. We quantified 10,379 DNA samples with PicoGreen (ThermoFisher Scientific) and randomly parsed samples with adequate DNA (>250 ng) into cohort-specific files. We then re-arrayed samples to ensure equal numbers of METSIM and FINRISK samples on each 96-well plate, alternating samples between studies in consecutive positions within and across plates, to minimize between-study batch effects.
Using 100–250 ng input DNA, we constructed dual-indexed libraries using the HTP Library Kit (KAPA Biosystems, target insert size of 250 bp), pooling 12 libraries before hybridization to the SeqCap EZ HGSC VCRome (Roche) exome reagent. After estimating the concentration of each captured library pool by qPCR (Kapa Biosystems) to produce appropriate cluster counts for the HiSeq2000 platform (Illumina), we generated 2× 100-bp paired-end sequencing data, yielding approximately 6 Gb per sample to achieve a coverage depth of ≥20× for ≥70% of targeted bases for every sample.
Phase 2. We quantified, prepared, pooled and captured 9,937 samples as described for phase 1. We generated 2× 125-bp paired-end sequencing reads on the HiSeq2500 1T to achieve the same coverage as described for phase 1.
Contamination detection, sequence alignment, sample quality control and variant calling
We aligned sequence reads to the human genome reference build 37 (bwa-mem, v.0.7.7), realigned insertions or deletions (indels) (GATK52 IndelRealigner v.2.4) and marked duplicates (Picard MarkDuplicates, v.1.113; http://broadinstitute.github.io/picard) and overlapping bases (BamUtil clipOverlap v.1.0.11; http://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap).
For each sample, we required single-nucleotide variant (SNV) genotype array concordance >90% if SNV array data were available, excluding samples with estimated contamination >3% or sample swaps compared to existing genotype data (verifyBamID53 v.1.1.1; Supplementary Table 1).
We called SNVs and short indels with GATK52 (v.3.3, using recommended best practices) for all targeted exome bases and 500 bp of sequence up and downstream of each target region using HaplotypeCaller. We merged calls in batches of 200 individuals using CombineGVCFs and recalled genotypes for all individuals at all variable sites with GenotypeGVCFs.
After merging genotypes for the 19,378 samples that passed preliminary quality-control checks, we filtered SNVs and indels separately using the recommended best practices for variant quality score recalibration (VQSR). We used the true-positive variants in the GATK resource bundle (v.2.5; build37) to train the VQSR model after restricting to sites in targeted exome regions. After assessment with VQSR, we retained variants for which we identified ≥99% of true-positive sites used in the training model for both SNVs and indels.
Following initial variant filtering, we decomposed multi-allelic variants into bi-allelic variants, left-aligned indels and dropped redundant variants using vt54 (v.0.5). We filtered variants with >2% missing calls and/or Hardy–Weinberg P< 10−6. We additionally removed variants with an overall allele balance (alternate allele count/sum of total allele count) < 30% in genotyped samples. We excluded 86 individuals with >2% missing variant calls yielding a final analysis set of 19,292 individuals.
Array genotypes, genotype imputation and integrated exome + imputation panel
For all except 1,488 participants (57 METSIM, 1,431 FINRISK), previously generated array genotypes were available17,55, with which we generated three datasets: (1) a merged array-based call set of all variants present in ≥90% of array-genotyped individuals across both cohorts; (2) a merged array-based Haplotype Reference Consortium (HRC) v.1.1 imputed dataset using the Michigan Imputation Server56,57; (3) an integrated dataset containing HRC imputed genotypes and exome-sequence variants (excluding all individuals without array data, and using the sequence-based genotypes in cases in which there was overlap between sequenced and imputed genotypes).
Annotation
We annotated the final set of sequence variants that passed quality control using variant effect predictor (VEP v.76)58 of Ensembl using five in silico algorithms to predict the functional impact of missense variants: PolyPhen2 HumDiv and HumVar59, LRT60, MutationTaster61 and SIFT62.
Association testing
Single variants
We carried out single-variant association tests for transformed trait residuals with genotype dosages for variants with MAC ≥ 3 assuming an additive genetic model, using the EMMAX63 linear mixed model approach, as implemented in EPACTS (v.3.3.0; http://genome.sph.umich.edu/wiki/EPACTS), to account for relatedness between individuals. We used genotypes for sequenced variants with MAF ≥ 1% to construct the genetic relationship matrix.
Conditioning on associated variants from previous GWAS
To differentiate association signals identified here from known associations, we performed exome-wide association analysis for each trait conditioning on variants previously associated (P < 10−7) with that trait in the EBI GWAS catalogue (https://www.ebi.ac.uk/gwas/downloads; 4 December 2016 version)64, publications55,65,66,67 or manuscripts in preparation. The keywords from the GWAS catalogue that we used to assign known variants to each trait can be found in Supplementary Table 21. We also manually curated published associations for specific metabolites65,68.
Using the combined HRC and exome panel, we pruned each trait-specific list of associated variants (GWAS variants) based on linkage disequilibrium (r2 > 0.95). Of the 23 GWAS variants that were absent from the HRC and exome panel, we identified a proxy (r2 > 0.80) variant for 17; we excluded the remaining 6 variants from the conditional analysis. The variants included in the conditional analysis are listed in Supplementary Table 22. We extracted genotypes for variants used in conditional analysis from the HRC and exome panel and converted dosages to alternate allele counts by rounding to the nearest integer (0, 1 or 2). For conditional analyses, we imputed missing genotypes for the individuals without array data using the mean genotype. We then ran association analysis using the same linear mixed model approach as in unconditional analysis but including the complete set of pruned GWAS variants as covariates in the association test. We then evaluated the novelty of conditional associations by searching OMIM, ClinVar, and the literature.
Defining loci
To identify the number of distinct associations for each trait, we performed linkage disequilibrium clumping using Swiss (https://github.com/welchr/swiss) of variants with unconditional P < 5 × 10−7 or both unconditional and conditional P < 5 × 10−5 for at least one trait. For each variant in this subset, we provided Swiss with the minimum unconditional P value across all traits. The clumping procedure starts with the variant with the smallest P value, merges into one locus all variants within ±1Mb that have r2 > 0.5 with the index variant and iterates this process until no variants remain.
Calculating effects and variance explained of individual variants
For novel variants highlighted in Table 1, we evaluated the effect of each variant on the trait values by calculating the mean trait value in carriers and non-carriers. As the effect estimates from our association tests are standardized, we calculated variance explained for a given variant with the equation var. exp. = \(2f(1-f){\hat{\beta }}^{2}\), where f is the MAF and \(\hat{\beta }\) is the estimated effect size. The variance explained is included in Supplementary Table 10.
Gene-based testing
We carried out gene-based association tests using the mixed model implementation of SKAT-O69, considering three different, but nested, sets of variants (variant ‘masks’): (1) PTVs at any allele frequency with VEP annotations: frameshift_variant, initiator_codon_variant, splice_acceptor_variant, splice_donor_variant, stop_lost, stop_gained; (2) PTVs included in (1) plus missense variants with MAF < 0.1% scored as damaging or deleterious by all five functional prediction algorithms; (3) PTVs included in (1) plus missense variants with MAF < 0.5% scored as damaging or deleterious by all five algorithms.
For each trait and mask, we only tested genes with at least two qualifying variants. Each mask contained a different number of genes with at least two qualifying variants: up to 7,996, 12,795 and 12,890 for the three masks, respectively. The exact number of genes tested varied by trait owing to sample size. We first used a Bonferroni-corrected exome-wide threshold for 12,890 genes, which corresponds to a threshold of P < 3.88 × 10−6. Analogous to single-variant association, we passed genes that met this association threshold for additional consideration with hierarchical false-discovery rate (FDR) correction, as described below.
Hierarchical FDR correction for testing multiple traits and variants
To control for multiple testing across 64 traits, we adopted an FDR controlling procedure70, using a two-stage hierarchical strategy (described in the Supplementary Information). Stage 1 identifies the set of R variants (or genes) associated with at least one trait (P < 5 × 10−7 for single-variant unconditional results and P < 3.88 × 10−6 for gene-based results), controlling genome-wide FDR across all variants at P = 0.05. Stage 2 identifies all traits associated with the discovered variants in a manner that guarantees an average FDR P < 0.05.
Genotype validation
We validated exome-sequencing-based genotype calls using Sanger sequencing for METSIM carriers of 13 trait-associated very rare variants with MAF < 0.1% in seven genes, finding concordance for 107 out of 108 (99.1%) non-reference genotypes evaluated.
Replication in additional Finnish cohorts
We attempted to replicate significant single-variant associations (P < 5 × 10−7) and follow up suggestive single-variant associations (P < 5 × 10−5) using imputed array data from up to 24,776 individuals from three cohort studies: Northern Finland Birth Cohort 196618, the Helsinki Birth Cohort Study19 and FINRISK study participants not included in FinMetSeq16,17.
For each cohort, before phasing we performed genotype quality control batch-wise using standard quality thresholds. We pre-phased array genotypes with Eagle71 (v.2.3) and imputed genotypes genome-wide with IMPUTE72 (v.2.3.1) using 2,690 sequenced Finnish genomes and 5,092 sequenced Finnish exomes. We assessed imputation quality by confirming sex, comparing sample allele frequencies with reference population estimates and examining imputation quality (INFO score) distributions. We excluded any variant with INFO < 0.7 within a given batch from all replication/follow-up analyses.
For each cohort, we matched, harmonized, covariate adjusted and transformed available phenotypes as described above for FinMetSeq, and ran single-variant association using the EMMAX linear mixed model implemented in EPACTS, after generating kinship matrices from linkage disequilibrium-pruned (command: plink –indep-pairwise 50 5 0.2) directly genotyped variants with MAF > 5%.
Association to disease end points
From >1,100 disease end points available for analysis in FinnGen, we selected 22 that we considered most relevant to the traits analysed in FinMetSeq, identifying variant associations as described previously33.
Association replication in UK Biobank
For eight FinMetSeq anthropometric and blood pressure traits available in UK Biobank (height, weight, body mass index, hip circumference, waist circumference, fat percentage, systolic blood pressure and diastolic blood pressure), we extracted, for variants reaching P < 5 × 10−7 in our combined analysis, trait-variant association statistics from http://www.nealelab.is/uk-biobank. Of the 8 traits, 7 had at least one associated variant and 23 of the total of 31 variants were available in UK Biobank. A comparison of association results is in Supplementary Table 15.
Population genetic analyses
Identifying unrelated individuals
To identify nearly independent common SNVs, we removed SNVs with MAF < 5% and pruned the remaining SNVs in windows of 50 SNVs, in steps of 5 SNVs, such that no pair of SNVs had r2 > 0.2. We used KING73 to estimate pairwise relationships among the exome-sequenced individuals, removing one individual from each pair inferred by KING to have a relationship of third degree or closer, yielding 14,874 unrelated individuals for population genetic analyses.
Enrichment of predicted-deleterious alleles in Finland
We assessed enrichment of predicted-deleterious alleles in Finland by comparing the 14,874 nearly unrelated FinMetSeq individuals to the 14,944 NFE control exomes in gnomAD (after removing NFE individuals from countries with substantial Finnish populations, Estonia and Sweden). We analysed the two most common alleles at each site with base quality score >10, mapping quality score >20, and coverage equal to or greater than that found in ≥80% of variable sites (17.73× in FinMetSeq, 32.27× in gnomAD), resulting in around 38.6 Mb for comparisons. We contrasted the proportional site frequency spectra for FinMetSeq and NFE for five functional variant categories (PTVs, missense, synonymous, untranslated regions and intronic variants) after down-sampling both datasets to 18,000 chromosomes.
We also assessed the enrichment of deleterious alleles within subpopulations of the FinMetSeq dataset. We applied Chromopainter and fineSTRUCTURE to 2,644 unrelated FinMetSeq individuals whose parents were both born in the same municipality to identify 16 subpopulation clusters74 (Supplementary Information). Of the 16 clusters, we used as the reference population a cluster for which the highest proportion of the parents of its members were from early-settlement Finland (Northern Savonia population 3 (NSv3), Supplementary Table 17). We used the twelve clusters with >100 members in subsequent analyses (Supplementary Table 17). We then compared the ratio of the site frequency spectra to the reference for PTVs, missense and synonymous variants, down-sampling both datasets to 200 haploid chromosomes. For each comparison, we computed statistical evidence for enrichment or depletion at a given allele count bin by exact binomial test against a null of equal number of variants found in both the test and reference cluster.
Geographical clustering of predicted functionally deleterious alleles
We first generated a distance matrix tabulating the pairwise geographical distance between the birthplaces of all available parents of unrelated sequenced individuals. For each variant of interest, we computed for the minor allele carriers in FinMetSeq the mean distance among all parent pairs. We evaluated statistical significance of geographical clustering by comparing the observed mean distance to mean distances for up to 10,000,000 sets of randomly drawn non-carrier individuals matched by cohort status and number of parents with birthplace information available.
To assess whether PTVs or missense variants may be more geographically clustered than synonymous variants, we first identified a set of near-independent variants (r2 > 0.02) with MAC ≥ 3 and MAF ≤ 5% among the 14,874 unrelated individuals. For each variant, we computed the mean pairwise geographical distance between the birthplaces across all pairs of the available parents of carriers of the minor allele and regressed this mean distance on variant class (PTVs, missense or synonymous) and MAC, MAC2 and MAC3 (Supplementary Table 16). For those variants in gnomAD, we also assessed whether variants enriched in FinMetSeq compared to NFE are more likely to be geographically clustered. As above, we computed the mean pairwise distances among parents of carriers of the minor allele and regressed mean distance on the logarithm of enrichment and MAC, MAC2 and MAC3 (Supplementary Table 19). In both analyses, we assessed a model with the interaction terms but report only the model without interactions if the interactions were not significant.
Heritability estimates and genetic correlations
We used genome-wide array genotype data on the 13,326 unrelated individuals for whom both exome sequencing and array data were available to estimate heritability and genetic correlations for the 64 traits. We constructed a genetic relationship matrix with PLINK75 (v.1.90b, https://www.cog-genomics.org/plink2) by applying additional filters for MAF > 1% and genotype missingness rate < 2% to the set of previously used genotyped SNVs, leaving 205,149 SNVs for genetic relationship matrix calculation. We used the exact mixed model approach of biMM76 (v.1.0.0, http://www.helsinki.fi/~mjxpirin/download.html) to estimate the heritability of our 64 traits and the genetic correlation of the 2,016 trait pairs.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.
Data availability
The sequencing data can be accessed through dbGaP (https://www.ncbi.nlm.nih.gov/gap/) using study numbers phs000756 and phs000752. Association results can be accessed at http://pheweb.sph.umich.edu/FinMetSeq/ and are searchable via the Type 2 Diabetes Knowledge Portal (http://www.type2diabetesgenetics.org/). Summary statistics are also available through the NHGRI-EBI GWAS Catalog at https://www.ebi.ac.uk/gwas/downloads/summary-statistics.
Change history
04 November 2019
An Amendment to this paper has been published and can be accessed via a link at the top of the paper.
References
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at https://www.bioRxiv.org/content/10.1101/148353v1 (2017).
Marouli, E. et al. Rare and low-frequency coding variants alter human adult height. Nature 542, 186–190 (2017).
Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019).
Timpson, N. J., Greenwood, C. M. T., Soranzo, N., Lawson, D. J. & Richards, J. B. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat. Rev. Genet. 19, 110–124 (2018).
Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proc. Natl Acad. Sci. USA 111, E455–E464 (2014).
Xue, Y. et al. Enrichment of low-frequency functional variants revealed by whole-genome sequencing of multiple isolated European populations. Nat. Commun. 8, 15927 (2017).
Southam, L. et al. Whole genome sequencing and imputation in isolated populations identify genetic associations with medically-relevant complex traits. Nat. Commun. 8, 15606 (2017).
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Jakkula, E. et al. The genome-wide patterns of variation expose significant substructure in a founder population. Am. J. Hum. Genet. 83, 787–794 (2008).
Polvi, A. et al. The Finnish disease heritage database (FinDis) update—a database for the genes mutated in the Finnish disease heritage brought to the next-generation sequencing era. Hum. Mutat. 34, 1458–1466 (2013).
Manning, A. et al. A low-frequency inactivating AKT2 variant enriched in the Finnish population is associated with fasting insulin levels and type 2 diabetes risk. Diabetes 66, 2019–2032 (2017).
Lim, E. T. et al. Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet. 10, e1004494 (2014).
Service, S. K. et al. Re-sequencing expands our understanding of the phenotypic impact of variants at GWAS loci. PLoS Genet. 10, e1004147 (2014).
Würtz, P. et al. Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies. Am. J. Epidemiol. 186, 1084–1096 (2017).
Laakso, M. et al. The Metabolic Syndrome in Men study: a resource for studies of metabolic and cardiovascular diseases. J. Lipid Res. 58, 481–493 (2017).
Borodulin, K. et al. Forty-year trends in cardiovascular risk factors in Finland. Eur. J. Public Health 25, 539–546 (2015).
Abraham, G. et al. Genomic prediction of coronary heart disease. Eur. Heart J. 37, 3267–3278 (2016).
Sabatti, C. et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46 (2009).
Pulizzi, N. et al. Interaction between prenatal growth and high-risk genotypes in the development of type 2 diabetes. Diabetologia 52, 825–829 (2009).
Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteomics 13, 397–406 (2014).
Corsetti, J. P. et al. Thrombospondin-4 polymorphism (A387P) predicts cardiovascular risk in postinfarction patients with high HDL cholesterol and C-reactive protein levels. Thromb. Haemost. 106, 1170–1178 (2011).
Zhang, X. J. et al. Association between single nucleotide polymorphisms in thrombospondins genes and coronary artery disease: a meta-analysis. Thromb. Res. 136, 45–51 (2015).
Beygo, J. et al. New insights into the imprinted MEG8-DMR in 14q32 and clinical and molecular description of novel patients with Temple syndrome. Eur. J. Hum. Genet. 25, 935–945 (2017).
Wallace, C. et al. The imprinted DLK1-MEG3 gene region on chromosome 14q32.2 alters susceptibility to type 1 diabetes. Nat. Genet. 42, 68–71 (2010).
Day, F. R. et al. Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nat. Genet. 49, 834–841 (2017).
Perry, J. R. et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature 514, 92–97 (2014).
Cleaton, M. A. et al. Fetus-derived DLK1 is required for maternal metabolic adaptations to pregnancy and is associated with fetal growth restriction. Nat. Genet. 48, 1473–1480 (2016).
Chaves, J. A. et al. Genomic variation at the tips of the adaptive radiation of Darwin’s finches. Mol. Ecol. 25, 5282–5295 (2016).
Surakka, I. et al. The impact of low-frequency and rare variants on lipid levels. Nat. Genet. 47, 589–597 (2015).
Ding, Y. et al. Plasma glycine and risk of acute myocardial infarction in patients with suspected stable angina pectoris. J. Am. Heart Assoc. 5, e002621 (2015).
Wittemans, L. B. L. et al. Assessing the causal association of glycine with risk of cardio-metabolic diseases. Nat. Commun. 10, 1060 (2019).
Perry, R. J. et al. Acetate mediates a microbiome–brain–β-cell axis to promote metabolic syndrome. Nature 534, 213–217 (2016).
Tabbassum, R. et al. Genetics of human plasma lipidome: understanding lipid metabolism and its link to diseases beyond traditional lipids. Preprint at https://www.biorxiv.org/content/10.1101/457960v1 (2018).
Casanova, M. L. et al. Exocrine pancreatic disorders in transsgenic mice expressing human keratin 8. J. Clin. Invest. 103, 1587–1595 (1999).
Surendran, P. et al. Trans-ancestry meta-analyses identify rare and common variants associated with blood pressure and hypertension. Nat. Genet. 48, 1151–1161 (2016).
Liu, C. et al. Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nat. Genet. 48, 1162–1170 (2016).
Palmer, C. & Pe’er, I. Statistical correction of the winner’s curse explains replication variability in quantitative trait genome-wide association studies. PLoS Genet. 13, e1006916 (2017).
Norio, R. Finnish Disease Heritage I: characteristics, causes, background. Hum. Genet. 112, 441–456 (2003).
Service, S. et al. Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nat. Genet. 38, 556–560 (2006).
Chiang, C. W. K. et al. Genomic history of the Sardinian population. Nat. Genet. 50, 1426–1434 (2018).
Rivas, M. A. et al. Insights into the genetic epidemiology of Crohn’s and rare diseases in the Ashkenazi Jewish population. PLoS Genet. 14, e1007329 (2018).
Bastarache, L. et al. Phenotype risk scores identify patients with unrecognized Mendelian disease patterns. Science 359, 1233–1239 (2018).
Niemi, M. E. K. et al. Common genetic variants contribute to risk of rare severe neurodevelopmental disorders. Nature 562, 268–271 (2018).
Surakka, I. The rate of false polymorphisms introduced when imputing genotypes from global imputation panels. Preprint at https://www.biorxiv.org/content/10.1101/080770v1 (2016).
Collins, F. S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372, 793–795 (2015).
Stancáková, A. et al. Changes in insulin sensitivity and insulin release in relation to glycemia and glucose tolerance in 6,414 Finnish men. Diabetes 58, 1212–1221 (2009).
Borodulin, K. et al. Cohort profile: the National FINRISK Study. Int. J. Epidemiol. 47, 696–696i (2017).
Wu, J. et al. A summary of the effects of antihypertensive medications on measured blood pressure. Am. J. Hypertens. 18, 935–942 (2005).
Tobin, M. D., Sheehan, N. A., Scurrah, K. J. & Burton, P. R. Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure. Stat. Med. 24, 2911–2935 (2005).
Liu, D. J. et al. Exome-wide association study of plasma lipids in >300,000 individuals. Nat. Genet. 49, 1758–1766 (2017).
Friedewald, W. T., Levy, R. I. & Fredrickson, D. S. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin. Chem. 18, 499–502 (1972).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Jun, G. et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 91, 839–848 (2012).
Tan, A., Abecasis, G. R. & Kang, H. M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).
Davis, J. P. et al. Common, low-frequency, and rare genetic variants associated with lipoprotein subclasses and triglyceride measures in Finnish men from the METSIM study. PLoS Genet. 13, e1007079 (2017).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
The Haplotype Reference Consortium. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).
Schwarz, J. M., Cooper, D. N., Schuelke, M. & Seelow, D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat. Methods 11, 361–362 (2014).
Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009).
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Kettunen, J. et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA. Nat. Commun. 7, 11122 (2016).
Kettunen, J. et al. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nat. Genet. 44, 269–276 (2012).
Teslovich, T. M. et al. Identification of seven novel loci associated with amino acid levels using single-variant and gene-based tests in 8545 Finnish men from the METSIM study. Hum. Mol. Genet. 27, 1664–1674 (2018).
Inouye, M. et al. Novel loci for metabolic networks and multi-tissue expression studies reveal genes for atherosclerosis. PLoS Genet. 8, e1002907 (2012).
Lee, S. et al. Optimal unified approach for rare-variant association testing with application to small-sample case–control whole-exome sequencing studies. Am. J. Hum. Genet. 91, 224–237 (2012).
Peterson, C. B., Bogomolov, M., Benjamini, Y. & Sabatti, C. Many phenotypes without many false discoveries: error controlling strategies for multitrait association studies. Genet. Epidemiol. 40, 45–56 (2016).
Loh, P. R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Pirinen, M. et al. biMM: efficient estimation of genetic variances and covariances for cohorts with high-dimensional phenotype measurements. Bioinformatics 33, 2405–2407 (2017).
Acknowledgements
We thank T. Teshiba for coordinating ethical permissions and samples; S. Kerminen, D. Lawson and G. Busby for discussions and providing scripts to run fineSTRUCTURE. S.R. was supported by the Academy of Finland Center of Excellence in Complex Disease Genetics (312062), Academy of Finland (285380), the Finnish Foundation for Cardiovascular Research, the Sigrid Juselius Foundation, Biocentrum Helsinki and University of Helsinki HiLIFE Fellow grant. V.R. acknowledges support by RFBR, research project 18-04-00789 A. V.S. was supported by the Finnish Foundation for Cardiovascular Research. C.S. and L.S. received funding from HG006695, HL113315 and MH105578. M.A.-K. is supported by a Senior Research Fellowship from the National Health and Medical Research Council (NHMRC) of Australia (APP1158958) and works in a unit that is supported by the University of Bristol and UK Medical Research Council (MC_UU_12013/1). The Baker Institute is supported in part by the Victorian Government’s Operational Infrastructure Support Program. A.U.J., D.R., L.J.S., H.M.S., R.W., P.Y., X.Y. and M.B. received funding from DK062370. S.K.S., C.W.K.C. and N.B.F. received funding from HL113315 and NS062691. The METSIM study was supported by grants from Academy of Finland (321428), the Sigrid Juselius Foundation, the Finnish Foundation for Cardiovascular Research, Kuopio University Hospital and the Centre of Excellence of Cardiovascular and Metabolic Diseases is supported by the Academy of Finland (M.L.). Sequencing was funded by 5U54HG003079. A.E.L., K.M.S., H.J.A., C.C.C., C.J.K., K.L.K., D.C.K., D.E.L., J.N., T.J.N., S.K.D., N.O.S., I.M.H. and R.K.W. were funded by 5U54HG003079 and 5UM1HG008853-03.
Author information
Authors and Affiliations
Consortia
Contributions
A.E.L., L.J.S., R.K.W., A. Palotie, V.S., M.L., S.R., M.B. and N.B.F. designed the study. A.E.L., K.M.S., H.J.A., R.S.F., D.C.K., D.E.L., J.N., T.J.N. and J.V. produced and quality-controlled the sequence data. A.E.L., A.S.H., A.U.J., A. Pietilä, H.M.S., M.A.-K., V.S. and M.L. collected, quality-controlled and/or prepared the clinical data for association analysis. A.E.L., K.M.S., C.W.K.C., S.K.S., A.S.H., L.S., M.P., C.C.C., A.U.J., C.J.K., K.L.K., V.R., D.R., J.V., R.W., P.Y. and X.Y. analysed data. A.S.H., J.G.E., M.A.-K., M.-R.J. and M.M. collected, quality-controlled and analysed replication data. H.L., S.K.D., N.O.S., I.M.H., C.S., S.R., M.B. and N.B.F. supervised experiments and analyses. A.E.L., K.M.S., C.W.K.C., S.K.S., C.S., M.B. and N.B.F. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
: V.S. has participated in a conference trip sponsored by Novo Nordisk and received a honorarium from the same source for participating in an advisory board meeting. He also has ongoing research collaboration with Bayer. H.L. is a member of the Nordic Expert group unconditionally supported by Gedeon Richter Nordics and has received an honorarium from Orion. All other authors have no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Peer review information Nature thanks Timothy Frayling, Alan Shuldiner, André G. Uitterlinden, Daniel E. Weeks for their contribution to the peer review of this work.
Extended data figures and tables
Extended Data Fig. 1 Allele frequency comparisons between FinMetSeq and NFE from gnomAD.
a, Distribution of allelic frequencies between FinMetSeq and gnomAD NFE. The comparison of allele frequencies shows the excess of variants at higher frequency in Finland as a result of the multiple bottlenecks experienced in Finnish population history. b, Proportional site frequency spectra between FinMetSeq and gnomAD NFE by variant annotation class. In general, we find a depletion of the variants in the rarest frequency class, as well as enrichment of variants in the intermediate to common frequency range. The site frequency spectra were down-sampled to 18,000 chromosomes for each data set. c, Comparison of MAFs for trait-associated variants in FinMetSeq and NFE gnomAD. Plotted in the grey background is a two-dimensional histogram of variants with non-zero allele frequencies in both gnomAD and FinMetSeq but no trait associations. Variants associated with at least one trait are coloured and scaled inversely proportional to the logarithm of the association P value. Variants >10× enriched in FinMetSeq compared to NFE are pink, those <10× enriched are in blue. The dashed line is the line of equal frequency. Two-sided uncorrected P values are from a regression of trait on the count of alternative allele at each variant. The number of independent individuals used in each point is listed in Supplementary Table 5.
Extended Data Fig. 2 Heritability of and correlations between traits.
a, b, Traits are in the same order, clockwise in a, and left to right and top to bottom in b, following the trait group colour key. a, Heritability estimated in 13,342 unrelated individuals (for abbreviations see Supplementary Table 4; for details see Supplementary Table 6). b, Heat map of the absolute Pearson correlations of standardized trait values (top right triangle) and the absolute values of estimated pairwise genetic correlations (bottom left triangle). Genetic correlations are estimated in 13,342 unrelated individuals. Values in grey below the diagonal had trait heritability less than 1.5× the s.e. of heritability.
Extended Data Fig. 3 Properties of associations shared between traits.
a, Shared genomic associations by pairs of traits. For traits x and y, colour in row x and column y reflects the number of loci associated with both traits divided by the number of loci associated with trait x. Traits are presented in the same order as in Extended Data Fig. 2a, and the side and top colour bars reflect trait groups. b, Relationship between estimated genetic correlation and extent of sharing of genetic associations. For each trait pair, the extent of locus sharing is defined as the number of loci associated with both traits divided by the total number of loci associated with either trait. Analysis using the absolute value of the Pearson correlation of the residual series results in a very similar pattern. The number of trait pairs in each x-axis category is as follows: 0–1%, 819; 1–10%, 204; 11–20%, 102; 21–30%, 41; 31–40%, 29; 41–50%, 16; >50%, 13. The bar within each box is the median, the box represents the upper and lower quartiles, whiskers extend to 1.5× the interquartile range and points represent outliers.
Extended Data Fig. 4 Gene-based association of extremely rare variants in APOB with serum total cholesterol.
Top, the distribution of the covariate-adjusted and inverse-normal transformed phenotype. Bottom, the association statistics for each variant included in the gene-based test along with the trait value for minor allele carriers of each variant (orange triangles). SV.P is the P value from the analysis of each variant in a single-variant analysis. The number of independent individuals in the analysis is 19,291.
Extended Data Fig. 5 Gene-based association of rare variants in SECTM1 with HDL2 cholesterol.
Top, the distribution of the covariate-adjusted and inverse-normal transformed phenotype. Bottom, the association statistics for each variant included in the gene-based test, along with the trait value for minor allele carriers of each variant (orange triangles). SV.P is the P value from the analysis of each variant in a single-variant analysis. The number of independent individuals in the analysis is 10,984.
Extended Data Fig. 6 Gene-based association of extremely rare variants in ALDH1L1 with glycine levels.
Top, the distribution of the covariate-adjusted and inverse-normal transformed phenotype. Bottom, the association statistics for each variant included in the gene-based test, along with the trait value for minor allele carriers of each variant (orange triangles). SV.P is the P value from the analysis of each variant in a single-variant analysis. The number of independent individuals in the analysis is 8,206.
Extended Data Fig. 7 Population structure of the FinMetSeq dataset, by region.
Population structure, by region, from a principal component analysis of exome-sequencing variant data (MAF > 1%) for 14,874 unrelated individuals with known parental birthplaces. Colour indicates individuals with both parents born in the same region; grey indicates individuals with different parental birth regions or missing information for one parent. Ctf, Central Finland; COs, Central Ostrobothnia; Kai, Kainuu; Khm, Kanta-Hame; Kyl, Kymenlaakso; Lap, Lapland; Nka, Northern Karelia; NOs, Northern Ostrobothnia; NSv, Northern Savonia; Nfi, individuals born outside Finland and lacking data on parental birthplace; Osb, Ostrobothnia; Phm, Paijat-Hame; Prk, Pirkanmaa; SKa, Southern Karelia; SuK, surrendered Karelia; SOs, Southern Ostrobothnia; SSv, Southern Savonia; Stk, Satakunta; Swf, Southwest Finland; Usm, Uusimaa; X, split parental birthplaces. Large solid circles represent the centre of each region. A map of Finland with regions labelled is supplied for reference.
Extended Data Fig. 8 Hierarchical clustering tree produced by fineSTRUCTURE.
We identified 16 subpopulations within the FinMetSeq dataset by applying a haplotype-based clustering algorithm, fineSTRUCTURE, on 2,644 unrelated individuals born by 1955 whose parents were both born in the same municipality (Methods). Each subpopulation is named based on the most common parental birth location among its members. Kai, Kainuu; Lap, Lapland; NKa, North Karelia; NOs, North Ostrobothnia; NSv, North Savonia; SOs, South Ostrobothnia; SuK, Surrendered Karelia. A map of Finland with regions labelled is supplied for reference. If multiple subpopulations share the same location label, the subpopulation is further distinguished with a numeral. NSv3 is used as an internal reference for the enrichment analysis. See Supplementary Table 17 for more detailed demographic descriptions of each subpopulation.
Extended Data Fig. 9 Regional variation in allele frequencies by functional annotation.
Enrichment of variants by allelic class in regional subpopulations of late-settlement Finland (defined in Supplementary Table 17). Each bin represents the ratio of variants in the subpopulation compared to the reference subpopulation (NSv3), after down-sampling the frequency spectra of all populations to 200 chromosomes. Pink cells represent enrichment (ratio >1), blue cells represent depletion (ratio <1). Sample sizes and confidence intervals for each enrichment ratio and the associated P values are presented in Supplementary Table 18. The results are consistent with multiple bottlenecks in late-settlement Finland, particularly for populations in Lapland and Northern Ostrobothnia. *P < 0.05; **P < 0.01; ***P < 0.005.
Supplementary information
Supplementary Information
This file contains the Supplementary Results, Supplementary Methods, Supplementary References and a full list of members of FinnGen.
Supplementary Tables
This file contains Supplementary Tables 1–22 with a full guide.
Rights and permissions
About this article
Cite this article
Locke, A.E., Steinberg, K.M., Chiang, C.W.K. et al. Exome sequencing of Finnish isolates enhances rare-variant association power. Nature 572, 323–328 (2019). https://doi.org/10.1038/s41586-019-1457-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-019-1457-z
This article is cited by
-
Deciphering the genetic structure of the Quebec founder population using genealogies
European Journal of Human Genetics (2024)
-
Mineral Metabolism and Polycystic Ovary Syndrome and Metabolic Risk Factors: A Mendelian Randomization Study
Reproductive Sciences (2024)
-
Genome-wide characterization of circulating metabolic biomarkers
Nature (2024)
-
Causal associations between type 1 diabetes mellitus and cardiovascular diseases: a Mendelian randomization study
Cardiovascular Diabetology (2023)
-
KIF15 missense variant is associated with the early onset of idiopathic pulmonary fibrosis
Respiratory Research (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.