FinnGen: Unique genetic insights from combining isolated population and national health register data

Population isolates such as Finland provide benefits in genetic studies because the allelic spectrum of damaging alleles in any gene is often concentrated on a small number of low-frequency variants (0.1<=minor allele frequency<5%), which survived the founding bottleneck, as opposed to being distributed over a much larger number of ultra-rare variants. While this advantage is well-established in Mendelian genetics, its value in common disease genetics has been less explored. FinnGen aims to study the genome and national health register data of 500,000 Finns, already reaching 224,737 genotyped and phenotyped participants. Given the relatively high median age of participants (63 years) and dominance of hospital-based recruitment, FinnGen is enriched for many disease endpoints often underrepresented in population-based studies (e.g. rarer immune-mediated diseases and late onset degenerative and ophthalmologic endpoints). We report here a genome-wide association study (GWAS) of 1,932 clinical endpoints defined from nationwide health registries. We identify genome-wide significant associations at 2,491 independent loci. Among these, finemapping implicates 148 putatively causal coding variants associated with 202 endpoints, 104 with low allele frequency (AF<10%) of which 62 were over two-fold enriched in Finland. We studied a benchmark set of 15 diseases that had previously been investigated in large genome-wide association studies. FinnGen discovery analyses were meta-analysed in Estonian and UK biobanks. We identify 30 novel associations, primarily low-frequency variants strongly enriched, in or specific to, the Finnish population and Uralic language family neighbors in Estonia and Russia. These findings demonstrate the power of bottlenecked populations to find unique entry points into the biology of common diseases through low-frequency, high impact variants. Such high impact variants have a potential to contribute to medical translation including drug discovery.

ABSTRACT Population isolates such as Finland provide benefits in genetic studies because the allelic spectrum of damaging alleles in any gene is often concentrated on a small number of low-frequency variants (0.1% ≤ minor allele frequency < 5%), which survived the founding bottleneck, as opposed to being distributed over a much larger number of ultra--rare variants. While this advantage is well-established in Mendelian genetics, its value in common disease genetics has been less explored.
FinnGen aims to study the genome and national health register data of 500,000 Finns, already reaching 224,737 genotyped and phenotyped participants. Given the relatively high median age of participants (63 years) and dominance of hospital-based recruitment, FinnGen is enriched for many disease endpoints often underrepresented in population-based studies (e.g., rarer immunemediated diseases and late onset degenerative and ophthalmologic endpoints). We report here a genome-wide association study (GWAS) of 1,932 clinical endpoints defined from nationwide health registries. We identify genome--wide significant associations at 2,491 independent loci. Among these, finemapping implicates 148 putatively causal coding variants associated with 202 endpoints, 104 with low allele frequency (AF<10%) of which 62 were over two-fold enriched in Finland.
We studied a benchmark set of 15 diseases that had previously been investigated in large genomewide association studies. FinnGen discovery analyses were meta-analysed in Estonian and UK biobanks. We identify 30 novel associations, primarily low-frequency variants strongly enriched, in or specific to, the Finnish population and Uralic language family neighbors in Estonia and Russia.
These findings demonstrate the power of bottlenecked populations to find unique entry points into the biology of common diseases through low-frequency, high impact variants. Such high impact variants have a potential to contribute to medical translation including drug discovery.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 6, 2022. ; https://doi.org/10.1101/2022.03.03.22271360 doi: medRxiv preprint Main Large biobank studies have become an important source for genetic discoveries. The FinnGen study aims to construct a resource combining the power of nationwide biobanks, structured national healthcare data and a unique, isolated population. Due to increased genetic drift, isolated populations with recent bottlenecks can have deleterious, disease predisposing alleles at considerably higher frequencies than permitted by selection in larger and older outbred populations. Counterbalancing this enrichment of specific low-frequency alleles, the other consequence of a recent bottleneck is that isolated populations have considerably fewer very rare variants overall [1][2][3] . As a result, isolated populations provide an opportunity to identify high-impact disease-causing variants that are extremely rare in other populations 1,[4][5][6][7] . In Finland, a strong founding bottleneck occurred ~120 generations ago, followed by rapid population expansion. This bottleneck effect has resulted in numerous strongly deleterious alleles that are more frequent in Finland compared to other Europeans. This is manifested in the Finnish Disease Heritage, a set of 36 mostly recessive diseases that are more prevalent in Finland than elsewhere in the world 8 . This population history (facilitating identification of low frequency deleterious alleles) combined with longitudinal information from registers recording hospital in-patient and outpatient diagnoses, purchases of prescription medications and many other national health registries centrally collected for decades provides unique opportunities for understanding the genetic basis of health and disease.
FinnGen is a public-private partnership research project combining imputed genotype data generated from newly collected and legacy samples of Finnish biobanks and digital health record data from Finnish health registries (https://www.finngen.fi/en) aiming to provide new insight in disease genetics. It includes nine Finnish biobanks, universities and university hospitals, 13 international pharmaceutical industry partners and Finnish biobank cooperative (FINBB) in a precompetitive partnership. As of August 2020 (Release 5 described in this paper) samples from 412,000 individuals had been collected and 224,737 analysed with an aim of a cohort of 500 000 participants (See Supplementary Methods Section 2). The project utilizes the nationwide longitudinal health register data collected since 1969 from every resident in Finland.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 6, 2022. ; https://doi.org/10.1101/2022.03.03.22271360 doi: medRxiv preprint Here, we describe the FinnGen project and its current genotype and phenotype content and highlight a series of genetic discoveries from the first data collection phase. In accompanying manuscripts, we describe more detailed studies showcasing different aspects of the rich data available from population registries. We first show that FinnGen register-based phenotypes are comparable to those used in disease specific genome-wide association studies (GWAS) in 15 previously well-studied common diseases. We demonstrate the power of the combination of isolated population and register data to discover novel low frequency variant associations, even in previously well-studied diseases where FinnGen has a much smaller number of cases than in published disease specific GWASs. Finally, via genome-wide association study of 1932 endpoints followed by statistical fine-mapping, we demonstrate the ability to identify likely causal coding variants even with very low allele frequencies.
Phenotyping and genotyping In Finland, similar to the other Nordic countries, there are nationwide electronic health registers originally established primarily for administrative purposes to monitor the usage of healthcare nationwide and over the lifespan of each Finnish resident. These registers have almost complete coverage of major health-related events, such as hospitalizations, prescription drug purchases (not including hospital administered medications), medical procedures or deaths with a history of data collection spanning more than 50 years. Health register based phenotypes ("endpoints") were created by combining data (mainly using International Classification of Diseases (ICD) and Anatomical Chemical Therapeutic (ACT) classification codes) from one or more nationwide heath registers (Supplementary Table 1). For a phenome-wide GWAS, we have initially constructed more than 2800 endpoints by combining data from different health registers including hospital discharge register, prescription medication purchase register and cancer register (  Table 2) genotyped with non-custom genotyping arrays (See Supplementary Methods Section 3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) 7 for QC details). We developed and utilized a population specific imputation reference panel of 3,775 high-coverage (25-30x)  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)  Figure 6). We estimated that 165,448 individuals (73.6%) have 3rd degree or closer relatives among FinnGen participants, which is higher than the estimated 30.3% in the UKBB 10 , but is partially explained by family based legacy cohorts. We removed 5780 duplicates/monozygotic twins (one from each pair removed) and genetic population outliers (See Supplementary Methods section 4) and built a set of approximately unrelated individuals where the relation between any couple is of degree 3 or higher. In total there were 156,977 independent individuals, which were used to compute PCA and the 61,980 related individuals were projected onto those PCs (see Supplementary Methods Section 4). The first two PCs captured the wellknown east-west and north-south genetic differences in Finland ( Figure 1B) 11 . Out of the total of 218,957 genotyped samples remaining, we had phenotype data for 218,792 individuals (56.5% females [123,579]), which were then used in all analyses.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Genome-wide association studies utilizing nationwide health registries To benchmark our register-based phenotyping and explore the value of the isolated setting of Finland, we selected 15 diseases with over 1000 cases in FinnGen and for which well-powered GWASs have been published. We evaluated the accuracy of our phenotyping by comparing the genetic correlations and effect sizes with the earlier GWAS results (Supplementary Table 6). None of the genetic correlations (GC) were significantly lower than 1 (lowest GC 0.89 [SE: 0.07] in AMD, Supplementary Table 6). For diseases with a large number of cases in FinnGen the effect sizes of lead variants in known loci were largely consistent between FinnGen and the literature meta-analyses, demonstrating that our register-based phenotyping is comparable to existing disease-specific GWAS studies ( Figure 1E, Supplementary File 1). The effect sizes expectedly varied more in diseases with a smaller number of cases in FinnGen.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 6, 2022. GWAS of these 15 diseases identified 235 loci and 275 independent genome-wide significant associations outside of the HLA region (GRCh38 Chr6:25Mb-34Mb). FinnGen PheWAS of Imputed classical HLA gene alleles is reported by Ritari et al 12 . 44 of the non-HLA associations were driven by low frequency (we define "low frequency" as AF < 0.1 in FinnGen) lead variants that were over two-fold enriched in Finns as compared to non-Finnish-Swedish-Estonian European (NFSEE) in gnomAD v2.0.1 13 . We use NFSEE as a general continental European reference point, excluding individuals from Finland, Sweden and Estonia. As there were large-scale migrations from Finland to Sweden in the 20th century, many of the chromosomes from Swedish sequencing studies are of recent Finnish origin, and the geographically close and linguistically and genetically similar 13 population of Estonia is likely to share elements of the same ancestral founder effect.
Replication of many such Finnish-enriched variant associations is hindered by low allele frequencies or missingness in other European populations. As Finns are genetically more similar to Estonians than other Europeans 13 , we therefore first conducted replication in 136,724 individuals from the Estonian biobank and then extended to UKBB (see Supplementary Table 7 for endpoint definitions and case/control numbers and Methods). The effect sizes in genome-wide significant hits in FinnGen were mostly concordant between Estonia (average inverse variance weighted slope 1.5 [FinnGen higher] and r2 0.69) and UKBB (slope 1.1, r2 0.84) (Supplementary Figure 8). Most likely due to slightly different ascertainment schemes, FinnGen had higher case prevalence in the 15 disease diagnoses than UKBB, however the Estonian biobank had the highest case prevalence in ophthalmic diseases (age-related macular degeneration, glaucoma) and inflammatory skin conditions (atopic dermatitis and psoriasis) (Figure 2A).
After meta-analysis with Estonia and UKBB, 241 of the 275 associations remained genome-wide significant (Supplementary Table 8). We further meta-analysed 232 loci that did not meet the genome-wide significance threshold in FinnGen (5 * 10 -8 < p < 1* 10 -6 ) and 57 of those were genome-wide significant after meta-analysis, resulting in 298 genome-wide significant meta-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) In order to determine whether the observed associations had been previously reported, we queried the GWAS catalog (and largest recent relevant GWASs) for genome-wide significant (p < 5*10 -8 ) variants that are in linkage disequilibrium (LD) (r2>0.1 in FinnGen imputation panel) with observed lead variants in FinnGen. As the lowest allele frequency of novel findings was low  Table 8). As expected, we observed that lead variants in novel loci were mostly of low frequency (27 lead variants had MAF<10% in FinnGen) and enriched in Finland as compared to known loci from previous GWAS. In most cases, the allele frequencies of lower frequency variants (MAF < 10% in FinnGen) were the highest in FinnGen followed by Estonia and lowest in NFSEE in gnomAD ( Figure 2C).
Next, we performed statistical fine-mapping of FinnGen associations (see Methods) on all 298 genome-wide significant associations and observed a coding variant (missense, frameshift, canonical splice site, stop gained, stop lost, inframe deletion) with posterior inclusion probability (PIP) >= 0.05 in 44 (18.7%) of the 95% credible sets (17 coding variants had PIP>0.5). Here onwards, we report coding variants with PIP>0.05 as "putatively causal". We recognize that there may be occasions where the causal variant assignment to a coding variant is incorrect (see our accompanying papers 15,16 for discussions on fine-mapping calibration and replicability). In addition to identifying putative causal coding variants, we sought to identify potential gene expression regulatory mechanisms by colocalizing credible sets with fine-mapped expression quantitative trait locus (eQTL) datasets from GTEx 17 and eQTL catalogue 18 (see Methods).
To describe the allele frequency spectrum and putative mechanisms of action of risk variants we LD pruned the 298 genome wide significant associations and prioritised the most significant . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 6, 2022. ; https://doi.org/10.1101/2022.03.03.22271360 doi: medRxiv preprint phenotype among the same hits to represent a single putative causal variant (LD r2 between lead variants < 0.2), resulting in 281 unique associations.
The majority of the 281 unique associations were common variant associations, however 73 of those had a lead variant frequency less than 10% in FinnGen and 38 of them were more than two times enriched in Finland vs. NFSEE. We observed a coding variant more often in the credible sets of over two-fold enriched associations (19/38, 50%) than non-enriched associations (9/35, 25.7%) at lower frequencies (MAF < 10%).
Following the discovery of 27 novel unique associations for 30 endpoints, we sought to determine potential mechanisms of action through identification of coding variants in their credible sets and potential regulatory effects by colocalization with eQTL associations from GTEx and eQTL catalogue 18 . We identified putative causal coding variants in 9/27 loci and eQTL colocalization in 5/27 loci. In three out of the five eqtl loci, we observed a coding variant in credible sets (ILR4,MYH14,CFI). The two remaining eQTLs colocalizations were breast cancer locus colocalizing with H2BP2 eQTL in lung tissue and T2D colocalizing with PRRG4 in lipopolysaccharide stimulated monocytes. The disease relevance of these eQTLs is not evident.
However, no credible coding variants or eQTL were identified in 16/27 loci (Table 1, Supplementary Table 8). The fraction of associations where we observed eQTL was small . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)  19 , ** denotes values of infinity, resulting from MAFNFSEE being 0.00. 1 Coding variant RSID in PIP given in parenthesis if a coding variant was observed in the credible set. 2 HGVS notation protein coding change if either lead variant was coding or coding credible was observed in the credible set (if either one exists) 3 Coding variant consequence given in parenthesis in case lead variant was not a coding variant and a coding variant was observed in the credible set. 4 Gene corresponding to the variant function. In case a lead variant was not a coding variant, but there was a coding variant in the credible set, the credible set coding variant gene is given in parenthesis.  The identification of a novel signal for IBD mapping to a single variant in an intron of TNRC18, highlights the value of Finngen for discovery, even when the case sample size is far below that of existing meta-analyses. This variant has a strong risk increasing effect (AF 3.6%, OR 3.2, p-value 2.4*10 -61 ) eclipsing the significance of signals at IL23R, NOD2 and MHC. The variant is 114-fold . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 6, 2022. ; https://doi.org/10.1101/2022.03.03.22271360 doi: medRxiv preprint enriched in Finns compared to NFSEE Europeans where the allele frequency is too low (0.04%) to have been identified in previous GWAS. We were however able to unequivocally replicate this association in the Estonian biobank (AF 1.3%, OR 3.9, p-value 2.8*10 -06 ) owing to the relatively higher frequency in the genetically related Estonian population. This variant was also associated with risk to multiple other inflammatory conditions evaluated in FinnGen, including interstitial lung disease (OR 1.43, p 6.3*10 -26 ), ankylosing spondylitis (OR 4.2, p 1.8*10 -34 ), iridocyclitis (OR 2.3, p 1.2*10 -27 ) and psoriasis (OR 1.6, p 1.1*10 -13 ). However, the same allele appears protective The highest number of (eight loci) novel and enriched low frequency associations were identified in type 2 diabetes, most likely due to the large number of T2D patients in FinnGen release 5 (29,193). Other noteworthy findings from this set of 30 novel findings for 15 well-studied diseases are described further in BOX 1.

Coding variant associations
Motivated by the identification of high effect coding variant associations within the selected 15 diseases, we performed a phenome wide GWAS followed by fine-mapping in order to identify To put the frequency spectrum and putative acting mechanisms in interpretable context, we chose a single most significant association per signal by LD based merging (r2>0.3 lead variants merged) resulting in 1838 unique associations in 681 endpoints (Supplementary Table 10 Table   10). The majority (43)  We next wanted to quantify the benefits of Finnish population isolate in GWAS discovery. To this end, we assessed if Finnish enriched lower frequency (MAF<10%) variants were more likely to be associated with a phenotype than would be expected by chance. We randomly sampled 1,000,000 times the number of genome-wide significant variants observed (169) from a set of frequency matched variants (MAF<10%) that were not associated with any endpoint (p>0.001).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 6, 2022. ; https://doi.org/10.1101/2022.03.03.22271360 doi: medRxiv preprint Only six out of 1 million random draws had a higher proportion of two-fold Finnish enriched variants than was observed in the significant associations (51.5% observed vs. 37% expected; pvalue 2.8*10 -5 ). . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Known rare and low-frequency pathogenic variant associations
Among the genome-wide significant coding variant associations, we identified 13 variant associations (AF range 0.04%-2%) classified as Pathogenic/Likely pathogenic in ClinVar (Supplementary Table 8
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Coding variants associated with medication use
An exceptional registry available in FinnGen is a prescription medication purchase registry (KELA, Table 1) linking all prescription medication purchases for all FinnGen participants since 1995. Using prescription records from this registry, we identified two enriched low-frequency coding variants that were associated with drug purchase of statin medications (3 or more purchases per individual) ( Table 2). A missense variant in TM6SF2 (p.Leu156Pro, rs187429064) was associated with decreased likelihood of being prescribed statins (AF 5.2%, gnomAD NFSEE 1.2%, OR 0.86, p-value 3.8 * 10 -13 ) but with an increased likelihood for insulin medication for diabetes (OR 1.17, 8.2 * 10 -11 ) and T2D (OR 1.15, p-value 2.6 * 10 -8 ). In addition, the same variant showed strong association with strongly increased risk of hepatocellular carcinoma (ICD10 C22 "hepatic and bile duct cancer") (OR 3.7, p-value 5.9 * 10 -10 ). Consistent with decreasing the likelihood of being prescribed statins, TM6SF2 p.Leu156Pro and another independent (r2 0.003) missense variant (p.Gly167Lys, rs58542926) have previously been associated with decreased LDL and total cholesterol (TC) levels 36 . In a mouse model, both p.Gly167Lys and Leu156Pro lead to increased protein turnover and reduced cellular protein levels of TM6SF2 37 . TM6SF2 p.Gly167Lys decreases hepatic large VLDL particle secretion and increases intracellular lipid accumulation 38,39 , which likely explain its associations with non-alcoholic fatty liver disease 40

Conclusions
In this paper and accompanying publications, we present FinnGen, one of the largest nationwide genetic studies with access to comprehensive electronic health register data of all participants.
The final aim of the study is to collect 500,000 biobank participants by the end of 2023. The interim releases of FinnGen have already contributed to many new discoveries and insights into human genetic variation and how it affects disease and health 19,[44][45][46][47][48] , including contributions to COVID-19 host genetics initiative 49 and global biobank meta-analysis initiative 50 . Summary statistics from each data release will be made publicly available after a one year embargo period and all summary statistics described here are freely available. (www.finngen.fi/en/access_results).
An important feature of FinnGen compared to other similar projects, such as the UKBB 10 , is the unique genetic makeup of the Finnish population. In the GWAS of selected, well-studied diseases, we were able to identify several novel associations with a fraction of the cases compared to the largest published GWA studies. These associations were, as expected, largely observed with variants that were increased in frequency in the Finnish population bottleneck and would have required prohibitively large sample sizes in older, non-bottlenecked populations ( Figure 2D). Moreover, in the GWAS of 1,932 endpoints, we observed that over two-fold Finnish enriched variants were 1.6 times more likely to be associated with a phenotype than would be expected by chance.
Further, we observed that putative coding variant associations were not only of lower allele frequency but also more often enriched in Finland than non-coding variant associations ( Figure   3). This observation is expected as coding variant associations are more deleterious on average and selection drives the allele frequencies down. However, some of these deleterious alleles survived the bottleneck and increased in frequency, facilitating the identification of their associations with diseases.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. diseases studied in this paper, the sample prevalence in FinnGen was higher than in UKBB. The difference was the most extreme in Alzheimer's disease (2.7% in FinnGen vs. 0.2% in UKBB), a disease of old age, and the most similar in asthma (9.4% vs. 7.4%) (Figure 2A). FinnGen also has a relatively high sample prevalence of some severe mental disorder cases such as schizophrenia (2.5%, n=5,562) and bipolar disease (2.1%, n=4,501) that are often underrepresented in biobank studies. A key aspect of the recruitment strategy is the Finnish biobank legislation that enables participants to donate samples with a broad consent to medical research in general. This makes recruitment cost-effective as the same samples and data can be used, after appropriate application steps, for many medical research studies. However, due to the recruitment strategy, FinnGen is not epidemiologically representative.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Phenotyping . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Some of the endpoints have a high number of overlapping cases and to avoid reporting highly repetitive endpoints, we clustered all endpoints if there was an overlap of >50% of cases between them and chose the one with the most genome-wide significant hits. On a few occasions a manual choice was made to select the most representative endpoint among the correlating endpoints. After clustering, we had 1932 endpoints for the main GWAS analysis.

Genotyping and QC
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Association analysis and fine-mapping . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 6, 2022. ; https://doi.org/10.1101/2022.03.03.22271360 doi: medRxiv preprint Mixed model logistic regression method SAIGE (version 0.35.8.8) 53 , was used for association analysis. We used sex, age, genotyping batch and 10 PCs as covariates (see extended methods for details). We used SuSiE 54 for fine-mapping. We finemapped all regions with variants with p-value< 10 -6 and extended regions 1.5MB upstream and downstream from each lead variant.
Finally overlapping regions were merged and subjected to fine-mapping. Major histocompatibility region (chr 6: 25-36 Mb) was excluded due to complex LD structure. We allowed up to 10 causal variants per region and SuSiE reports a 95% credible set for each independent signal. As LD, we used in-sample dosages (i.e cases and controls used for each phenotype) computed with LDStore2. FinnGen Fine-mapping pipeline is available in GitHub (https://github.com/FINNGEN/finemapping-pipeline).
To define independent signals within a locus, we utilized fine-mapping results. For each locus we report credible set as an independent hit if it represents primary strongest signal with lead pvalue < 5*10 -8 and for secondary hits we require genome-wide significance and log bayes factor (BF)>2. The BF filtering was necessary due to SuSiE sometimes reporting multiple credible sets for a single very strong signal but this is indicated in SuSiE as low BF (the model does not improve by adding another signal in the region i.e. not independent signal).

Estimation of expected number of enriched variant associations
We aimed to estimate if we observed more over two-fold Finnish enriched variant associations in lower frequency range (MAF< 10%) than would be expected by chance. To this end, we sampled a subset of variants (MAF<10%) that were not associated with any endpoint in FinnGen We used the Pan UKBB (https://pan.ukbb.broadinstitute.org/) project European subset association analysis summary statistics in UKBB replication 56 (See Supplementary Table 6).
As both Estonian biobank and UKBB are on human genome build 37, we lifted over the coordinates to build 38 to match FinnGen. Variants were then matched based on chromosome, position, reference and alternative alleles.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 6, 2022

Colocalization
We applied colocalization to all fine-mapped regions. As a colocalization approach we used the We performed colocalization between FinnGen endpoints, eQTL catalogue 18 as well as selected 36 continuous endpoints and 57 biomarkers from UKBB 16 . eQTL catalogue and UKBB traits were processed with a functionally equivalent fine-mapping pipeline to FinnGen.

Automatic annotation of known GWAS hits
In order to identify novel hits from the GWAS results, we compared the fine-mapped results against genome-wide significant hits (p<5*10 -8 ) in the GWAS Catalog association database 58 and manually curated genome-wide significant hits from large GWA studies (Table 1). We checked and reported separately 1) matches in credible set variants and 2) matches with any variants in LD with a lead variant (highest PIP) after fine-mapping. LD lookup variants were chosen with the following criteria: 1) They were less than 1500 kb away from the lead variant, 2) They had a p-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 6, 2022 This file contains Supplementary Tables 1-10 and table legends is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 6, 2022. ; https://doi.org/10.1101/2022.03.03.22271360 doi: medRxiv preprint muscle in the Finnish FUSION study 59 . Variants in MYH14 can cause autosomal dominant peripheral neuropathy and deafness (https://www.omim.org/entry/608568) but no cardiac phenotype has been previously reported.
In the fourth association (AF 1.3%, OR 1.5, p 8.17*10 -11 ) we observed a splice donor variant (c.105+1G>T) in the SYNPO2L gene (PIP 0.13). The variant was extremely rare in Estonia (0.05%) but showed consistent effect direction (OR 1.23) although not significant due to lack of power in such rare variant association. The variant was absent in UKBB and the AF in NFSEE in gnomad is extremely low (0.02%). An intronic common variant in SYNPO2L has been previously associated with atrial fibrillation 62 and our results provide direct coding variant evidence of SYNPO2L being the causal gene in this locus. Further insights on atrial fibrillation from FinnGen coding variants have been described by Sun et al 21 .
Common variants in the locus have been previously associated with prostate cancer 63 . BIK (BCL-2 interacting killer) gene product is a pro-apoptotic protein that has been suggested to act as a tumor suppressor gene 64 and to be a marker for a more aggressive breast cancer 65 , however no direct genetic evidence has previously been published on variants in BIK causing prostate cancer. codes for small nucleolar RNA (SNORA), the role of which in cancer are not very well characterized but differential expression associations of several SNORAs with different cancers have been observed 66 . Hyperactivation of STAT3 however has been observed in the majority of cancers and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Asthma
Another interesting locus where we observed coding variants (missense in IL4R, p.Ala82Thr, PIP 0.21) within credible sets was associated with both asthma (FinnGen AF 8.2%, OR 0.86, p 2.5*10 -12 , meta p 5.31*10 -9 ) and psoriasis (FinnGen OR 1.28, p 3.48* 10-9 , meta p 1.92*10 -11 ) but with an opposing direction of effect. IL4R codes for IL4α subunit that is part of receptor complexes for both cytokines IL4 and IL13, which are key cytokines in the type II inflammatory response triggered by allergens or parasites (see 68 for a detailed review). The key role of type II inflammatory response in asthma has long been recognized 69 and variants in 5q31 locus containing the genes coding for IL4 and IL13 have been associated with asthma in GWA studies 70 . Asthma, atopic dermatitis and hay fever often co-occur and are referred collectively to as atopic diseases 71 . The effect direction of this association was consistent with that of asthma in atopic dermatitis (OR 0.9, p 2.4 * 10 -3 ). The reversed effect direction in psoriasis was surprising as there is no evidence that in psoriasis there would be a contribution of type II inflammation but type | and Th17 mediated inflammation 72 .