Abstract
Genome-wide association studies (GWAS) have significantly advanced the understanding of genetic mechanisms underlying complex human diseases and traits by systematically identifying genetic variants linked to diverse phenotype traits across diverse populations. Large-scale analyses that combine multiple phenotypes are especially valuable, as they can reveal shared genetic architectures and patterns of comorbidity, refining disease classification and risk prediction.
Here, we conducted comprehensive GWAS analyses based on Estonian Biobank EHR data, focusing on 4,884 ICD-10-based disease phenotypes, in a cohort of 206,159 Estonian Biobank participants. By analysing their genotype data, altogether 18,977,777 SNV and indel variants (including common, low-frequency and rare variants), our analyses revealed 2,127 unique genome-wide significant loci, including 778 putatively novel locus-phenotype associations.
Further investigation into coding variants revealed 835 significant associations, including well-established ACMG pathogenic variants and multiple putatively novel associations. Notably, a missense variant in SCN11A, which encodes the Nav1.9 sodium channel involved in pain signalling, was associated with decreased migraine risk, while an Estonian-enriched GOT1 variant considerably affected aspartate transaminase enzymatic function.
Our study highlights the potential of population-based biobanks in discovering both common and rare genetic associations, contributing to the identification of novel disease associated loci and expanding the catalog of human disease related genetic variants.
Introduction
Genome-wide association studies (GWAS) have significantly advanced our understanding of the genetic underpinnings of human complex diseases and traits by systematically identifying genetic variants associated with various phenotypes across diverse populations (Visscher et al., 2017; Uffelmann et al., 2021). GWAS has led to numerous breakthroughs, including discovering thousands of genetic loci influencing diseases such as type 2 diabetes (T2D), cardiovascular disease, psychiatric disorders, and autoimmune conditions (Tam et al., 2019; Uffelmann et al., 2021). These discoveries have facilitated the improvement of polygenic risk scores (PRS), enabling the development of more personalised medical predictions and interventions (Torkamani et al., 2018).
As the studied sample size is critical in GWA studies, larger study cohorts significantly enhance the statistical power to detect disease associated variants. The latter is especially obvious for rarer variants with small effect sizes. Some of the most extensive GWAS studies focusing on single disease traits include T2D, involving approximately 1.4 million individuals (Mahajan et al., 2022), and major depression with over 1.2 million participants (Howard et al., 2019), while some quantitative traits (eg. height) have encompassed 5.4 million samples (Yengo et al., 2022). While large single-trait GWAS provide valuable insights into genetic architecture of these traits, comprehensive analyses combining multiple diseases can facilitate the identification of shared genetic background and comorbidity patterns, enabling to better understand the underlying molecular mechanisms and cellular processes.
During the last decade, numerous large-scale population-based collections have emerged globally, offering extensive datasets from thousands to millions of individuals who are deeply phenotyped and genotyped, typically characterised through detailed questionnaires, comprehensive laboratory assessments, and electronic health record (EHR) linking, without selecting participants based on specific diseases (Hewitt et al., 2011). Among the most prominent are the UK Biobank (REF here) and the FinnGen project (REF here), which combines several Finnish biobanks. Several initiatives are focusing on diverse and also non-European populations, including the China Kadoorie Biobank, BioBank Japan, the Mexican Biobank Project, and the African BioGenome Project (Bycroft et al., 2018; Kurki et al., 2023; Chen et al., 2005; Nagai et al., 2017; Silva-Zolezzi et al., 2009; Ebenezer et al., 2022). The Estonian Biobank (hereinafter EstBB) includes approximately 20% of Estonia’s adult population and integrates comprehensive linking to EHRs and medical records, facilitating detailed analyses of primary care phenotypes (Milani et al., 2024; Leitsalu et al., 2015).
Despite their strengths, large-scale biobanks often face some limitations, including challenges in harmonising phenotypes, reliance on self-reported data that is prone to recall bias and other inaccuracies, and underrepresentation of primary-care and deeply characterised phenotypes. These limitations restrict the scope of genetic analyses, particularly when investigating complex traits that involve heterogeneous subtypes or endophenotypes, which require precise classification and systematic integration to uncover relationships between phenotypes and their underlying genetic underpinnings. The Estonian Biobank addresses these challenges through Estonia’s centralised health care system, high-quality linking to standardised electronic health records and national registries, and active refinement of phenotypic definitions by combining multiple validated data sources, thus reducing potential misclassification and enabling more comprehensive insights into primary-care and complex trait genetics (Leitsalu et al., 2015; Milani et al., 2024).
In this study, we established 4,884 distinct phenotypes based on ICD-10 codes originating from national electronic health records (EHR) and performed GWAS to generate a comprehensive atlas of genetic associations in the Estonian Biobank data (n=206,159). Our post-GWAS analysis identified 2,127 unique genome-wide significant loci, 778 putatively novel phenotype-locus associations, and 835 significantly associated coding variants. These results contribute to refining future phenotype definitions, uncovering novel genetic associations, and providing a foundation for future integrative and functional studies in human genetics and genomics.
Results
Estonian EHR registries have been based on the ICD-10 classification since 2004, making it a national standard in medical and research settings. Estonian Biobank has a collection of health data from various sources (Milani et al., 2024). Importantly, self-reported diagnoses made up only 1% (632,062 out of 47,419,565) of all the records, with the vast majority derived from EHRs (Figure 1A). This demonstrates the overwhelming reliance on clinically verified diagnoses, providing a comprehensive view of disease occurrence and longitudinal progression within the dataset.
A) The distribution of the data sources. The bars contain the numbers of records and individuals to which the records from each category are related. B) Distribution of EstBB participants with at least one recorded trait in ICD-10 disease categories.
First, we analysed the distribution of medical records linked to the 206,159 biobank participants (Figure 1A, see Methods) and focused on 4,884 ICD-10 based disease phenotypes, comprising 1,209 main diagnostic categories and 3,675 more granular, clinically detailed subcategories (Supplementary Table 1). Nearly all participants had records related to “Non-diagnostic medical codes” (ICD-10 codes “R00” to “Z99”), with high prevalence in categories related to “Genitourinary” (N00-N99), “Musculoskeletal” (M00-M99), “Skin and subcutaneous tissue” (L00-L99) diseases. In contrast, moderate prevalence of 53-57% was observed for “Mental health” (F00-F99), “Metabolic” (E00-E99), and “Neoplasms” (C00-D48), while “Congenital anomalies” (Q04-Q99) were the least common (Figure 1B; Supplementary Table 2).
Genetic and phenotypic correlations across analysed disease phenotypes
Secondly, we analysed imputed genotype data, consisting of 18,977,777 variants, including 17.4M single nucleotide variants (SNVs) and 1.6M insertions and deletions (indels). Among those, 6,589,347 were common variants (with minor allele frequency, MAF>0.05), 4,065,207 low-frequency variants (0.005<AF≤0.05) and 3,453,053 rare variants (0.001<AF≤0.005) that provide insight into less prevalent but functionally relevant genetic changes (Figure S2A), and 4,850,101 were ultra-rare variants (AF ≤0.001). Altogether 18,748,626 (98.8%) of analysed variants had an imputation INFO score greater than 0.8 (Figure S2B).
Next, we calculated the SNP-heritability estimates for all studied disease phenotypes. The mean SNP-heritability value (h2) across all 4,884 phenotypes was 0.059 (Figure S3), with 5.5% showing statistically significant heritability (Bonferroni-corrected p ≤ 0.05; P = 1.02×10−5). The overall distribution of obtained h2 values was similar with the one observed in UKBB (Neale Lab, unpublished). The traits with the highest confidently estimated SNP-heritability were “Other specified hypothyroidism” (ICD-10 code ‘E03.8’, h² = 0.176), “Obesity due to excess calories” (E66.0, h² =0.223), “Myopia” (H52.1, h² = 0.175), “Overweight and obesity” (E66, h² = 0.224) and “Type 2 diabetes mellitus without complications” (E11.9, h² = 0.161) (Figure 2B).
A) The relation between h2 values and Neff values for the traits that have more than 5,000 cases. Different colours represent different levels of effective sample size (Neff) confidence, from lighter shades for none h2 confidence to darker shades representing higher levels of h2 confidence. Larger circles represent a higher significance level in the association studies regarding minus log P-values. B) Five the highest confidently estimated h2 values. The plot illustrates the heritability estimates of various phenotypes, with horizontal error bars representing the standard errors of these estimates. The length of the error bars indicates the confidence intervals of the heritability estimates. The number in parentheses next to each phenotype name represents the prevalence of that condition within the studied population. C) Pairs of the phenotypes where both genetic and phenotype correlations are statistically significant (Bonferroni-corrected P = 1.46×10−6); D)Pairs of the phenotypes where both genetic and phenotype correlations are not statistically significant; E) Pairs of the phenotypes where only phenotype correlations are statistically significant; F) Pairs of the phenotypes where only genetic correlations are statistically significant.
We further estimated the genetic and phenotypic correlations between 178 disease phenotypes which had more than 5,000 cases. Among these, 565 pairs (out of all 31,684 unique pairs) demonstrated statistically significant correlations (p ≤ 1.46 × 10−6) at both the genetic and phenotypic levels, while an additional 26 pairs were significant only in genetic correlation and 21,594 pairs were significant only in phenotypic correlation (Figure 2C-F). As expected, among these 565 phenotype pairs, 25 were classified under the same ICD hierarchy categories. Examples include “Mixed asthma” (ICD-10 code “J45.8”) and “Other and unspecified asthma” (J45.9), as well as “Nontoxic single thyroid nodule” (E04.1) and “Nontoxic multinodular goiter” (E04.2). Additionally, 181 pairs included at least one phenotype from the “Non-diagnostic medical codes” group (R00-Z99), reflecting correlations beyond specific disease diagnoses.
The strongest positive genetic correlations were observed between “Essential hypertension” (ICD-10 code “I10”) and “Hypertensive heart disease without heart failure” (I11.9) (rg = 0.98, phenotypic correlation = 0.55), as well as between “Adult osteochondrosis of the spine” (M42.1) and “Abdominal and pelvic pain” (R10) (rg = 0.98, phenotypic correlation = 0.07). And not surprisingly, the strongest negative correlations were found between “Hypermetropia” (H52.0) and “Myopia” (H52.1) (rg = −0.83, phenotypic correlation = −0.12) and also between “Myopia” (H52.1) and “Presbyopia” (H52.4) (rg = −0.56, phenotypic correlation = −0.06).
Disease associated loci
Next, we determined that 953 out of the 4,884 phenotypes have at least one genome-wide significant locus, totalling 2,127 unique significant loci and 2,811 significant locus-phenotype combinations (1915 loci and 2,468 associations with HLA region excluded). As expected, some of the loci were associated with more than one phenotype. The phenotypes in the “Endocrine, nutritional, and metabolic diseases” group show the highest number of significant loci - 564, closely followed by conditions affecting the “Circulatory system” - 253 and “Skin and subcutaneous tissue” - 230 (Figure S4; Supplementary Table 1). The traits from the “Respiratory infections” group have high numbers of cases; however, the numbers of significant loci for such traits are relatively low (e.g., “Acute upper respiratory infections of multiple and unspecified sites” (J06): 160,082 cases, one genome-wide significant locus; “Acute pharyngitis” (J02): 87,204 cases; 2 significant loci). These diagnoses have mostly environmental causes, probably explaining a limited number of associated loci. Meanwhile, common conditions like “Obesity” and “Type 2 diabetes mellitus” have a considerable number of loci (37 and 21, respectively) but not the highest number of cases (32,933 and 16,492, respectively), suggesting that both genetics and environment contribute to their progression. Interestingly, “Myopia” (H52.1; 51,585 cases) had the highest number of genome-wide significant loci in our analysis (72). In contrast, “Conjunctivitis” (H10) has more cases (72,459) but fewer associated loci (2), as it is primarily caused by viral infections (Figure 3A).
A) Relation between number of loci and number of cases. Larger circles represent a higher significance level of the associated locus (minus log P-value). B) Proportional distribution of signals between phenotype groups in terms of novelty by overlap with GWAS Catalog. The numbers on the bars show the actual numbers of unique loci in each group.
By comparing significant loci and their proxies with the GWAS Catalog, we estimated that 1,690 (68.48%) genome-wide significant locus-phenotype associations were known, and 778 (31.52%) have not been reported previously (Figure S5). Across all phenotypes, we identified novel SNVs associated with 589 disease phenotypes. Of these, 359 were linked to more specific ICD-10 subcategories (four-character codes), while 230 were associated with broader disease groups (three-character codes), considerably expanding the catalog of locus-phenotype associations. The majority of known loci were found in categories such as “Endocrine, nutritional, and metabolic diseases” - 86.9%, “Eye and adnexa” - 76.8%, and “Circulatory system diseases” - 73.5%, reflecting on the fact that these groups of traits are well-studied and frequently encountered in routine healthcare settings (Figure 3B).
For a more stringent approach we also applied a study-wide significance threshold (p ≤ 1×10−11). Considering that, the highest percentage of study-wide significant novel loci were in the “Blood disorders” (16 loci, ICD-10 codes “D50-D89”), followed by “Congenital diseases” (2 loci, ICD-10 codes “Q00-Q99”). In terms of frequency of these variants in the population, 435 (34 study-wide significant) of them were common variants and 258 (14 study-wide significant) were the low-frequency variants.
Beyond numerous well-established disease-associated genes, including F5, TERT, FTO, ABCG8, APOE, PITX2, RHCE, and GC (Figure S6), we identified several novel associations. A notable example trait is pityriasis versicolor (B36.0), a common fungal skin infection caused by Malassezia yeast, which despite its high prevalence, has to our knowledge not previously been analysed from host genetics perspective. Our GWAS of 8,293 cases identified 24 independent loci, including missense variants in SH2B3, PTPN22, IL13, and HLA genes, as well as a loss-of-function splice variant in GSDMB, indicating that immune responses play a key role in susceptibility to this fungal infection. These genetic mechanisms mirror the previously observed overlap between dermatological and autoimmune conditions (Ivert et al., 2021; Haapaniemi et al., 2024).
Coding variant associations
We identified a total of 3,410 genome-wide significant coding variant associations that affect protein sequence or function, with 2,149 located within the HLA region and 1,261 outside it. Given the known complexity and polygenicity of immune-related traits, the high concentration of variants in the HLA region aligns with previous findings (Butler-Laporte et al., 2023). Outside the HLA region, 514 unique coding variants associated with at least one phenotype and were distributed across 349 unique genes (Figure 4; Supplementary Table 3). Notably, the non-HLA variants were linked to 300 distinct traits, whereas 123 traits had significant associations within the HLA region, reinforcing the high impact of immune-related mechanisms in common complex diseases. Of these, 78 traits were exclusively associated with HLA region variants, highlighting the need for a more focused investigation into immune-mediated mechanisms, including HLA allele associations and the proportion of true-positive non-HLA gene hits within this region.
Scatterplots have been grouped based on ICD-10 codes (Supplementary Table 2). x-axis is the odds ratio (OR) from 0 to 2.5. y-axis displays the allele frequency (AF) on a log scale. Colors indicate the most severe ClinVar based classification of each variant. Select genes are annotated within each category. Variants located within the HLA locus have been excluded.
Notable low-frequency frameshift variants
The EstBB dataset contains a significant number of predicted loss-of-function variants, some of which map to genes with known medical relevance. Several variants demonstrated exceptionally strong effects on disease risk and have previously not been witnessed in a biobank setting. Most notably, the frameshift variant rs558269137 (FLG:p.Ser761CysfsTer36) was linked to congenital ichthyosis (Q80) with an odds ratio of 13.306 (SE = 2.80; P = 1.05 × 10⁻34; AF = 0.0166). It should be noted that ClinVar classifies this FLG variant as “likely pathogenic”. Filaggrin (FLG) is a critical structural protein for epidermal barrier formation, playing a key role in aggregating keratin filaments and facilitating corneocyte maturation (Kim & Lim, 2024). Loss-of-function mutations in FLG lead to compromised skin barrier integrity, resulting in excessive transepidermal moisture loss and susceptibility to hyperkeratosis (Moosbrugger-Martinz et al., 2022).
Another interesting example of a high-impact variant is the GJB2:p.Gly12ValfsTer (rs80338939) mutation, resulting in a frameshift deletion affecting the connexin-26 protein, which plays a crucial role in cellular communication within the cochlea (Ke et al., 2025). This variant is a well-established cause of autosomal recessive hearing loss, and our analysis further confirms its recessive effect on disease risk in EstBB. The heterozygous carriers exhibit a modest but significant increase in risk (OR = 1.23, SE = 0.05, P = 9.6 × 10⁻⁷), while homozygous individuals display an extreme risk elevation (OR = 56.76, SE = 19.89, P = 6.3 × 10⁻⁴¹). Notably, the allele frequency of this variant in EstBB is 0.0245, which is approximately 2.5 times higher than in other non-Finnish European populations (AF = 0.0099).
The high-quality imputation in EstBB enables the reliable detection of numerous frameshift and insertion-deletion variants, ensuring accurate association testing even for rare loss-of-function mutations with clinical relevance. Moreover, such variants emphasize the value of population specific biobanks, which provide a unique opportunity to study broadly sampled cohorts and allow for novel interpretations of potential disease-causing variants.
Coding variant pleiotropy
Several coding variants identified in our analysis exhibit pleiotropic effects, influencing multiple phenotypes across disease categories. For example, the immunomodulatory variant rs2476601 (PTPN22:p.Trp620Arg) displayed significant correlations in eight different disease groups (mycoses; vitamin B12 anemia; thyroid disorders; type 1 diabetes; retinal disorders; vitiligo and rheumatoid arthritis). Meanwhile, three different missense variants within the APOE gene (rs7412, rs429358, rs769452) were linked to a broad range of hyperlipidemia subtypes and dementia-related conditions, including both Alzheimer’s disease and vascular dementia.
The TM6SF2:p.Glu167Lys (rs58542926; AF = 0.0676) variant presents an intriguing duality, exhibiting opposing effects on hepatic and circulating lipid levels. In our analysis, this variant increased the risk of non-alcoholic fatty liver disease (NAFLD; K76.0) (OR = 1.30, P = 4.3 × 10⁻¹⁴), while simultaneously protecting against mixed hyperlipidemia (E78.2) (OR = 0.84, P = 2.6 × 10⁻¹⁷). This pattern aligns with TM6SF2’s role in hepatic lipid metabolism, where Glu167Lys impairs very-low-density lipoprotein secretion, leading to hepatic triglyceride accumulation, but lowering circulating lipid levels (Pirola & Sookoian et al., 2015). While LD score regression (rg = 0.73, P = 8.2 × 10⁻⁶) suggests a strong genetic correlation between NAFLD and hyperlipidemia, the Glu167Lys variant disrupts this expected relationship, reinforcing the concept of a trade-off mutation: detrimental for hepatic fat storage, but beneficial for systemic lipid homeostasis.
The IFNL4:p.Gly23AlafsTer158 (rs11322783, also known as INFL4-ΔG/TT; AF = 0.343) frameshift variant demonstrates a notable dual effect on viral hepatitis outcomes. Our analysis reveals that this frameshift deletion is protective against acute hepatitis A (B15), with a decreased risk (OR = 0.843, SE = 0.027, P = 1.88 × 10⁻⁹). Conversely, it is associated with an increased risk for chronic viral hepatitis C (B18), exhibiting an elevated risk (OR = 1.204, SE = 0.0384, p = 5.39 × 10⁻⁹). The IFNL4 gene encodes interferon lambda 4, a cytokine involved in the antiviral immune response. Previous studies have established that the INFL4-ΔG/TT variant is associated with impaired clearance of hepatitis C virus (HCV), leading to chronic infection. This association is attributed to the functional IFNL4 protein’s modulation of the immune response, which may inadvertently hinder the clearance of HCV (Møhlenberg et al., 2022). In contrast, the protective effect against acute hepatitis A observed in our analysis suggests that the absence of functional IFNL4 may enhance the immune system’s ability to clear hepatitis A virus (HAV) during acute infection. To our knowledge, this specific dual effect of the INFL4-ΔG/TT variant has not been previously reported, and future studies should further validate whether this dual association reflects a true immunological mechanism influencing viral clearance dynamics or arises from statistical confounding, diagnostic bias, or phenotypic misclassification.
Drug targets and ACMG genes
Identifying clinically actionable genetic variants is a key element of precision medicine, and large-scale biobank studies provide a powerful framework for uncovering key associations in medically relevant genes. Among the significant non-HLA region variants, we identified 24 pharmacologically active known drug target genes (Wishart et al., 2017). Notably, four of these genes (PCSK9, TYR, SCN10A, and KCNJ11) harboured protective variants, acting as natural models for drug-mimicking effects. Such genetic variants demonstrate the benefits of broad GWAS approaches in identifying potential therapeutic targets.
Moreover, our analyses identified key variants in six clinically significant ACMG genes (Miller et al., 2023), each with established roles in either metabolic disorders, cardiovascular disease, and hereditary cancer syndromes (Figure S8). We detected strong associations between APOB missense variants and disorders of lipoprotein metabolism (OR up to 12, p ≤ 10⁻⁴¹), consistent with its well-known role in familial hypercholesterolemia. The classic hemochromatosis mutation (HFE:p.Cys282Tyr) remains the strongest predictor of iron overload disorders (OR = 5.53, P < 10⁻⁴³). Additionally, we confirm the protective effect of PCSK9 variants on lipid levels, reinforcing the therapeutic relevance of PCSK9 inhibitors. All in all, our analysis identified twenty variants in ACMG genes, of which eight variants are considered to be “pathogenic” according to ClinVar classification. A few notable examples also displayed at least two-fold enrichment among the Estonian population, such as APOB variants with lipid metabolism disorders (rs17240441 and rs12713843) and BRCA1 variants with cancer susceptibility (rs80357906 and rs80357711), reinforcing the role of population-specific biobanks in global efforts to refine disease risk prediction.
Novel variant protecting against migraine
Our study identified over 100 genome-wide significant coding variants exhibiting at least a two-fold enrichment in EstBB compared to other European populations (Supplementary Table 3). An intriguing example of a novel EstBB-enriched variant in a known trait is SCN11A:p.Gly1736Val (rs143852849; AF = 0.0101). This variant associates with reduced susceptibility to migraines (G43; OR = 0.709; SE = 0.041; P = 2.74 × 10-9) and a broader reduction in other headache syndromes (G44; OR = 0.846; SE = 0.037; P = 1.16 × 10-4) (Figure 5A). However, no significant association was observed with non-headache pain conditions (Figure S9).
A) Manhattan plot of broad migraine (G43) GWAS, highlighting SCN11A locus lead hit. The x axis represents the genome in physical order, and the y axis represents P values (–log10(P value)) of association. The red horizontal line represents the P value threshold of 5 × 10−8. Loci with lead hits with P value below 1 × 10−8 have been highlighted. B) Structural representation of the voltage-gated sodium channel Nav1.9. The extracellular domain is shown in light blue, the transmembrane domain in yellow, and the intracellular domain in green. The C-terminal α-helix domain, where the variant is located, is enlarged in the inset. The Gly1736Val substitution is highlighted in orange. The model was generated using AlphaFold2, with disordered low-confidence regions omitted for clarity. C) PheWAS volcano plot of SCN11A:p.Gly1736Val (rs143852849) in EstBB. The x-axis represents the effect size (beta), while the y-axis displays the –log10(P value) of association. Bonferroni-significant traits are highlighted in pink. The table on the right lists pain-related ICD-10 categories with corresponding odds ratios, standard errors, and P values. D) Density plot of aspartate aminotransferase (AST) activity from laboratory measurements. Non-carriers (blue, N=111,770) and heterozygous carriers (orange, N=735) are shown. The normal reference range (5–34 U/L) is shaded. Mean AST levels for each group are indicated. Data was obtained from clinical laboratory measurements for LOINC code 1920-8, using chronologically earliest measurement per participant.
SCN11A encodes the voltage-gated sodium channel Nav1.9, which is predominantly expressed in nociceptive neurons and plays a crucial role in pain transmission (Dib-Hajj et al., 2015). Loss-of-function mutations in SCN11A have been linked to congenital insensitivity to pain (Ginanneschi et al., 2019), while gain-of-function mutations can cause chronic pain syndromes (Leipold et al., 2019). The SCN11A:p.Gly1736Val missense variant appears to fine-tune pain perception, conferring a protective effect against migraines. The variant is notably enriched in the Estonian population when compared to other non-Finnish Europeans (4.7-fold enrichment). Structural analysis suggests that SCN11A:p.Gly1736 is located in a conserved C-terminal intracellular α-helix domain of Nav1.9 with an unknown function (Figure 5B).
This general region has been implicated in channel trafficking and membrane localization (Sizova et al., 2020), thereby regulating sodium channel surface expression, and mutations in this domain may alter receptor migration. The α-helix domain is vertebrate-conserved up to zebrafish, indicating functional importance across evolution. Despite its potential impact on channel function, both ClinVar and AlphaMissense classify SCN11A:p.Gly1736Val as benign (AM score = 0.256), although CADD score of 18.84 suggests that this substitution has a functional consequence. Further studies are required to determine whether this variant influences Nav1.9 localization, stability, or function in nociceptive neurons, and how it affects neuronal excitability and pain perception. Additionally, identifying the binding partners of this α-helix domain will be essential to understand its role in channel trafficking and neuronal excitability, as well as its potential therapeutic relevance.
Estonian-enriched aspartate aminotransferase variant in liver diseases
A further example of a population-enriched risk allele with significant clinical implications is the GOT1:p.Gln208Glu (rs374966349; AF = 0.0046) variant, which increases the risk of chronic liver diseases by 2- to 5.8-fold (Figure 5C, Figure S10). This rare variant encodes a glutamic acid substitution of the aspartate aminotransferase (AST) enzyme, and it is nearly 10 times more enriched in the EstBB dataset than in other European populations. Elevated AST activity is a well-established biomarker of liver dysfunction. Our analysis of clinical measurements (LOINC 1920-8: “Aspartate aminotransferase in Serum or Plasma”) across 112,505 EstBB participants demonstrates a substantial association with GOT1:p.Gln208Glu (β = +29.57 U/L; SE = 2.44; P = 7.28 × 10⁻³⁴) (Figure 5D).
The GOT1:p.Gln208Glu variant is absent from the ClinVar registry and has a CADD deleteriousness score of 14.88, suggesting a potential functional impact. However, AlphaMissense predicts the variant to be likely benign (AM score = 0.0785), indicating that while structurally disruptive, its clinical significance may depend on additional genetic or molecular factors. Structural analysis suggests that Gln208 is located on a surface-exposed α-helix of AST, distant from the active site and its homodimerization interface (Figure S11).
The replacement of glutamine with glutamate at this position introduces a negatively charged residue, which may influence protein-protein interactions or enzyme stability. Previous studies indicate that the p.Gln208Glu variant is associated with macro-enzyme aspartate aminotransferase (macro-AST), a benign condition characterised by persistently elevated AST levels due to the formation of complexes between AST and immunoglobulins in the bloodstream. It is hypothesized that this protein domain may facilitate macro-AST complex formation, where AST aberrantly binds to circulating immunoglobulins without underlying liver pathology. The introduction of a negatively charged glutamic acid at position 208 could enhance AST’s affinity for immunoglobulins, further promoting this interaction (Kulecka et al., 2017; Mueser et al., 2020;, Pei et al., 2023). To our knowledge, this variant has not been previously reported in a population-wide setting.
Discussion
Genome-wide association studies (GWAS) play a crucial role in understanding complex traits and diseases by identifying genetic variants linked to multiple phenotypes. Large-scale analyses reveal shared genetic architectures and comorbidity patterns, improving disease classification, risk prediction, and treatment strategies. Here, we report a large-scale GWAS of 4,884 phenotypes in the Estonian Biobank, demonstrating the power of population-specific biobanks in detecting common and rare genetic associations. Moreover, such biobank-scale GWASs can identify underrepresented primary care phenotypes, which are often overlooked due to the dominance of hospital-based datasets. Our approach expands the catalog of clinically relevant variants and provides a foundation for future research on disease mechanisms and risk prediction.
As expected, most associations identified in our study have been previously reported. In total, 1,690 (68.48%) genome-wide significant locus-phenotype associations were found in the GWAS Catalog (Verma et al., 2024; Cerezo et al., 2025). The highest number of known loci were linked to endocrine and metabolic traits (490 loci), circulatory diseases (186 loci), and eye and adnexa-related traits (156 loci), reflecting the extensive study of these disease categories. Additionally, the overall distribution of SNP heritability values in our study closely aligns with previous findings from the UK Biobank (Neale Lab, unpublished; Bycroft et al., 2018), further validating the Estonian Biobank phenotypes and the robustness of our approach.
Our analysis identified 778 genome-wide significant associations (31.52% of all significant associations) that are LD-independent of previously reported loci. Remarkably, 91 of those were significant when the more stringent study-wise significance threshold was applied, demonstrating robustness of our approach for large-scale discovery studies. Previous studies in population-based collections have demonstrated similar results, illustrating the benefits of analysing homogeneous population samples (Kurki et al. 2023). A substantial proportion of these novel associations were linked to ICD-10 subcategories, which can be considered endophenotypes, offering a more granular view of disease etiology. While recent decades have focused on GWAS of broad disease categories, smaller, more specific conditions have often been overlooked, limiting insights into their genetic basis and relationships with other traits. Identifying genetic components at this scale across 4,884 phenotypes provides an overview of the shared genetic architectures and unique opportunity to refine our understanding of comorbidities.
Biobank-scale studies provide a powerful framework for identifying coding variants with strong clinical and biological implications. Our study uncovered 510 unique coding variants across 345 genes, which were linked to 300 distinct traits outside the HLA region. Beyond common disease associations, our analysis identified high-impact frameshift variants, such as FLG:p.Ser761CysfsTer in congenital ichthyosis and GJB2:p.Gly12ValfsTer in hereditary hearing loss. These findings reinforce the value of population-specific genetic studies in identifying variants that may be overlooked in broader genomic datasets. Additionally, pleiotropic opposing effects were evident in several key variants, such as IFNL4:p.Gly23AlafsTer158, which correlated with acute and chronic viral hepatitis, and TM6SF2:p.Glu167Lys, which demonstrated an inverse relationship between hepatic lipid accumulation and systemic lipid homeostasis. The discovery of clinically actionable variants, including drug-mimicking protective variants in PCSK9, TYR, SCN10A, and KCNJ11, further illustrates the potential for genetics-driven drug discovery. Additionally, our identification of ACMG-listed pathogenic variants, such as APOB missense mutations in hyperlipidemia and HFE:p.Cys282Tyr in iron overload disorders, confirms that biobank datasets can be used to investigate how the interplay between common and rare variants influences variable penetrance in disease susceptibility.
Unlike traditional medical genetics, which often focuses on pathogenic variants, biobank-scale studies enable the discovery of protective genetic factors that may inform novel therapeutic hypotheses. The described SCN11A:p.Gly1736Val variant is an intriguing candidate for further research into pain modulation. Nav1.9 is an established target for neuropathic pain (Bjornsdottir et al., 2023), but the novel Gly1736Val variant provides human genetic evidence for its resilience role against migraine. Further validation is needed to determine whether this variant modifies receptor migration, leading to reduced sodium channel density on nociceptive neurons. If confirmed, this could provide a novel target for migraine therapeutics, as short α-helical peptides derived from this domain could serve as a drug backbone for peptidomimetics, thereby offering a potential therapeutic approach for migraine prevention.
Population-specific biobanks also provide a unique opportunity to study locally enriched variants, offering increased power to detect associations, particularly when linked to a wide range of phenotypic traits. In this study, we identified GOT1:p.Gln208Glu, a variant that dramatically increases AST enzymatic activity while also elevating liver disease risk by at least two-fold. Given its strong effect, individuals carrying GOT1:p.Gln208Glu could benefit from targeted screening for chronic liver diseases, which may be underdiagnosed in standard clinical practice. Future research should explore longitudinal AST trends in carriers, assess potential metabolic consequences of altered AST activity, and conduct functional assays to determine the mechanistic impact on enzyme function and stability. Altogether, these coding variant changes illustrate how population-specific biobanks enhance the discovery of both protective and risk variants, providing insights into disease mechanisms and potential therapeutic targets. Variants with strong effects in isolated populations can refine our understanding of genetic risk, develop novel intervention methods and inform early diagnostic strategies.
Although extremely valuable as an exploratory discovery step, one limitation of this GWAS-based approach primarily stems from using ICD-10 codes based disease phenotypes, which, although cost- and time-efficient, can severely limit the statistical power. Whilst ICD-10 codes are commonly transformed into Phecodes to better reflect the biological basis of medical traits (Verma et al., 2024), in this study, we chose to analyze ICD-10 codes directly. This approach enables us to investigate differences between related diagnoses and to identify genetic similarities and distinctions among specific sub-phenotypes that have not been previously analyzed individually.
Additionally, phenotypes defined using heterogeneous criteria can yield misleading associations, such as false enrichments or inflated signals, particularly when sex-specific distributions are highly skewed. For example, phenotype “Desensitization to allergens” (Z51.6) is overwhelmingly female-biased (1122 out of 1127 cases) due to biological factors, such as pregnancy-related rhesus conflict treatments. Results from highly specific phenotypes must be evaluated individually, as sample overlaps or the specificity of the local medical system may contribute to false-positive signals. Furthermore, our phenotypic inclusion criteria required only a single recorded diagnosis event per participant, which may lead to false-positive case assignments. Isolated diagnoses could result from coding errors, transient conditions, local medical practices, or misclassification, potentially inflating associations. More stringent criteria, such as requiring multiple diagnosis events or prescription records, would improve trait specificity and reduce the risk of spurious signals in follow-up studies. Moreover, methodological limitations exist due to imprecise estimation of heritability for low-prevalence phenotypes, as evidenced by 1,267 out of 4,884 phenotypes showing negative heritability estimates - an acknowledged limitation of linkage disequilibrium score regression (LDSC) arising from insufficient case counts.
Additional challenges arise in the analysis of coding variants. Without LD pruning, some identified variants may be in linkage disequilibrium with causal variants rather than being directly responsible for the observed associations. These findings require further scrutiny through fine-mapping and functional validation to distinguish true causal variants from linked signals. Conversely, our stringent filtering of the HLA locus excluded a substantial portion of this highly polymorphic region, likely omitting true-positive associations. Given the biological relevance of HLA-linked coding variants in immune-related traits, deeper analyses are needed to systematically evaluate their potential contributions.
Leveraging the extensive phenotype coverage of the Estonian Biobank, our study advances the understanding of genetic determinants underlying human disease and related traits. By mapping genotype-phenotype relationships at scale, we expand the catalog of known associations and contribute to a broader taxonomy of human diseases, refining genetic risk across interrelated conditions. These findings provide a foundation for future research, support precision medicine approaches, and facilitate targeted investigations into specific traits and diseases through refined genetic analyses.
Methods
Phenotype definitions
The Estonian Biobank is a volunteer-based biobank comprising 212,955 participants. All participants provided broad informed consent, allowing for the regular linkage of their health data with the National Health Insurance Fund and other relevant databases to obtain ICD-10 code based events. Most electronic health records have been collected since 2004 (Leitsalu et al., 2015), and analyses were carried out with the data freeze 2023v01. Data from the national health databases of the Estonian Health Insurance Fund formed the primary source, contributing 34,923,603 records (74% of the total data). Hospital records data followed, with 8,454,839 records (18%), and text-mined information from medical epicrises provided 3,409,061 records (7%) (Figure 1A). Regarding treatment types, outpatient care in primary care settings accounted for the most recorded interactions, with 28,807,523 cases. Inpatient care within secondary care facilities was represented with 1,354,634 cases. For this study, analyses were restricted to individuals of European ancestry, as non-European participants represent only 0.31% of the biobank. The final sample included 206,159 participants who had more than five distinct ICD-10 codes assigned to them. We further limited our analysis to phenotypes with more than 100 recorded cases. Participants were defined as cases if they had at least one instance of the relevant ICD-10 code in their electronic health records, while those without a recorded diagnosis were considered controls. This approach resulted in 4,884 binary traits.
Genotyping and imputation
All Estonian Biobank participants have been genotyped at the Core Genotyping Lab of the Institute of Genomics, University of Tartu, using Illumina Global Screening Array v3.0_EST. All the genetic variants were imputed using an Estonian-specific imputation panel; 295,810 have been directly genotyped (Mitt et al., 2017). Samples were genotyped and PLINK format files were created using Illumina GenomeStudio v2.0.4. Individuals were excluded from the analysis if their call rate was < 95%, if they were outliers of the absolute value of heterozygosity (> 3SD from the mean), or if sex defined based on heterozygosity of the X chromosome did not match sex in phenotype data (Mitt et al., 2017). Before imputation, variants were filtered by call rate < 95%, Hardy-Weinberg equilibrium p-value < 10−4 (autosomal variants only), and minor allele frequency < 1%. Genotyped variant positions were in build 37 and were lifted over to build 38 using Picard. Phasing was performed using the Beagle v5.4 software (Browning et al., 2021). Imputation was performed with Beagle v5.4 software (beagle.22Jul22.46e.jar) and default settings. The dataset was split into batches of 5,000. A population-specific reference panel consisting of 2,695 whole-genome sequenced samples was utilised for imputation, and standard Beagle hg38 recombination maps were used. Based on principal component analysis, samples not of European ancestry were removed. Duplicate and monozygous twin detection was performed with KING 2.2.7 (Manichaikul et al., 2010), and one sample from each pair of duplicates was excluded.
Genome-wide association studies
Association analysis in Estonian biobank was carried out for all variants with an INFO score ≥ 0.4 for common variants and INFO score ≥ 0.8 for rare variants (MAF < 0.01), using the additive model as implemented in REGENIE v3.2.1 with standard binary trait settings (Mbatchou et al., 2021). The analysis was performed in two steps: first, it fits a null model using leave-one-chromosome-out (LOCO) cross-validation, and the second step performs association testing using Firth’s logistic regression for rare variant analysis to handle potential biases (Suhas et al., 2023). Logistic regression was carried out with adjustment for current age, age², sex, and 10 PCs as covariates, analysing only variants with a minimum minor allele count of 2. Before the GWAS, all the phenotypes were divided into non-sex-specific, male, and female sex-specific groups, which were analysed separately. For this analysis, the pipeline implemented in Nextflow version 22.04.3 was used. Manhattan plots were generated using the topr R package (Juliusdottir, 2023). After analyses, quality control was conducted to ensure that all the GWASs had the same number of SNVs and chromosomes and filtering variants to a minor allele frequency greater than 0.01. The lead variants were identified through distance-based clumping, using a P-value threshold of P = 5 × 10−8 and iteratively removing variants within a ±1Mb window of each lead variant.
Heritability estimation
The SNP heritability was estimated using LD Score Regression software (LDSC) v1.0.1. LD score references for European LD Scores from 1000 Genomes were used, and all the variants were filtered to keep a subset of HapMap 3 variants for the analysis (Browning et al., 2011; The International HapMap Consortium, 2003). The LDSC tool was used twice per phenotype: once to compute liability-scale heritability adjusting for sample prevalence and once to calculate observed-scale heritability. The Python script for processing the summary statistics GWAS files and LDSC was adapted from Bulik-Sullivan et al., 2015.
The study’s next step was calculating the effective sample size (Neff) value to see how the h2 value related to sample size (Neale Lab, n.d.). Because some of the phenotypes represented in the study are relatively rare and have unreliable h2 estimates, we plotted the effective sample size (Neff) against h2 estimate, similar to the analyses conducted by Neale Lab (Neale Lab, n.d.). Neff was calculated using the following formula:
Neff = 4/((1/Ncases)+(1/Ncontrols))
Data was then categorised based on the Neff of less than 4,500, between 4,500 and 20,000, between 20,000 and 40,000, and 40,000 or higher. Corresponding to these conditions, h2 confidence labels “None,” “Low,” “Medium,” and “High” are assigned. If a row in the dataset does not meet the specified conditions, it is labelled “Unknown.” This classification was adapted from UKBB analysis from Neale’s lab, unpublished. Heritability P values were corrected using Bonferroni correction.
Phenotypic correlation analysis
To estimate the correlations between the pairs of the phenotypes, we quantified disease co-occurrence using the comorbidity score (ϕ), which measures the association between two diseases beyond chance (Sadegh et al., 2023; Valderas et al., 2009). We constructed a contingency table for each disease pair to capture the number of individuals with both diseases, one disease, or neither. The comorbidity coefficient ϕ ranges from −1 (negative association) to 1 (positive association), with 0 indicating no association. This analysis separated the data into two subsets: one with 3-digit ICD-10 codes (e.g., E06) and one with 4-digit codes (e.g., E03.8). We also used only the ICD-10 codes with more than 5,000 cases to ensure further analysis compatibility with the genetic correlations.
We then applied Fisher’s Z-transformation to normalise comorbidity scores to compare the strength of disease associations. This transformation converted the coefficients into Z-scores following a normal distribution, allowing reliable comparison across different sample sizes.
Genetic correlation analysis
We have collected the phenotypes for the genetic correlations between phenotypes with more than 5,000 cases and at least one significantly associated SNV in the GWAS output. We then combined the phenotypes from 4-digit categories with the 3-digit ones that do not have sub-4-digit codes so we could treat them as separate diagnoses. The genetic correlations were calculated using LD Score Regression software (LDSC) v1.0.1. LD score references for European LD Scores from 1000 Genomes were used, and all the variants were filtered to keep only a subset of HapMap 3 variants for the analysis (Browning et al., 2011; The International HapMap Consortium, 2003). The Python script for processing the summary statistics GWAS files and LDSC was adapted from Bulik-Sullivan et al., 2015.
Overlap with GWAS Catalog
To distinguish the lead variants not described publicly previously, we determined the novel variants by overlapping the identified lead variants from our analysis with the previously known genetic associations from the GWAS Catalog (version v1.0) (Cerezo et al., 2025).
In order to identify the lead variants, we performed clumping in two rounds using PLINK v1.9 (Chang et al., 2009). As an LD reference panel, an Estonian-specific subset of ∼2500 whole-genome sequenced (WGS) samples from the Estonian Biobank was used (Mitt et al., 2017). In the first round, it identifies significant SNPs at a specified P value threshold of 5 × 10−8 and independent at an LD threshold of r² < 0.6 within 500 kb. Second clumping round was performed with a stricter LD threshold of r² < 0.1 to identify independent lead variants. Finally, the proxies for these lead variants were calculated based on their physical proximity and LD relationships.
After the lead variants and corresponding proxies were identified, they were overlapped with the variants in GWAS Catalog based on the combination of the chromosome and position of the SNVs. The ones present in the GWAS Catalog dataset were considered to be “Known,” and the ones that were unique for our dataset were labelled as “Novel.” If at least one proxy SNP per locus was considered to be “Known”, the locus was considered to be “Known” as well. This approach was adapted from Verma et al., 2024.
Coding variant annotation and analysis
Genome-wide significant SNVs from all GWAS analyses were cross-referenced with exonic variants to identify coding variants associated with phenotypic traits. For this, we used a genome-wide significance threshold of P = 5 × 10⁻⁸. Exonic variant definitions were obtained from gnomAD v4.1.0, which includes 730,947 exomes and 76,215 whole genomes mapped to the GRCh38 reference genome. Non-Finnish European allele frequencies were extracted from gnomAD to facilitate population-specific comparisons.
To characterize the functional consequences of coding variants, we used the Variant Effect Predictor (McLaren et al., 2016), integrating annotations from ClinVar, CADD (Combined Annotation Dependent Depletion), and AlphaMissense to assess potential pathogenicity. The human leukocyte antigen (HLA) region was defined as chr6:25,000,000-35,000,000 (hg38). Protein structure- or function-altering variants were defined as those annotated with at least one of the following consequence terms: missense_variant, stop_gained, frameshift_variant, stop_lost, splice_region_variant, inframe_deletion, inframe_insertion, start_lost, or NMD_transcript_variant.
To identify pharmacologically relevant genetic associations, we screened the resulting variants against known drug target genes using DrugBank v5.1.9. The list of medically relevant genes was obtained from ACMG SF v3.2, which includes genes recommended for secondary findings reporting in clinical sequencing. To investigate the potential structural consequences of identified variants, we employed AlphaFold2 to generate protein structure predictions for SCN11A (Jumper et al., 2021). The resulting models were examined for changes in domain stability and potential functional disruptions. Visualization of the protein structures was performed using ChimeraX (Pettersen et al., 2021). To evaluate the broader phenotypic impact of coding variants, PheWAS was conducted using precomputed Regenie association results, allowing for the retention of related individuals in the dataset.
AST enzymatic activity values (LOINC: 1920-8) were obtained from the Estonian Biobank. Basic quality control was performed by excluding participants with extreme outlier values. To minimize the influence of medical interventions, only the earliest available measurement per participant was retained for analysis. All statistical analyses and data visualization were performed in R version 4.2.1.
Ethics statement
The activities of the EstBB are regulated by the Human Genes Research Act, which was adopted in 2000 specifically for the operations of EstBB. Individual level data analysis in EstBB was carried out under ethical approval 1.1-12/624 from the Estonian Committee on Bioethics and Human Research (Estonian Ministry of Social Affairs), using data according to release application 3-10/GI/31689 from the Estonian Biobank.
Funding statement
This work was funded by the European Union through Horizon 2020 and Horizon Europe research and innovation program under grants no. 874627 (EXPANSE) (JK); 894987 (GENOMEPEP) (EA); 101153901 (CHRONOPIA) (TP); 101096888 (DISCERN) (JK); 101137201 (CLARITY) (EA, KB, LT); 101057721 (PROPHET) (AR); 101128023 (JAPreventNCD) (AR); 101080009 (CAN.HEAL) (AR); 101137154 (WISDOM) (EA); 101137278 (CVDLINK) (UV); and Estonian Research Council Grants PRG555 (Evaluation of genetic variants potentially linked to actionable health risks) (AR); and PRG1291 (Systematic phenome-wide search for genetic modulators in health and disease) (AA, EA, KB, JK, UV, PP).
Data Availability
GWAS summary statistics will be made publicly available upon publication in a peer-reviewed journal. Pseudonymised data and/or biological samples can be accessed for research and development purposes in accordance with the Estonian Human Genome Research Act (https://www.riigiteataja.ee/en/eli/ee/531102013003/consolide/current). To access the raw data, the research proposal must be approved by the Scientific Advisory Committee of the Estonian Biobank as well as by the Estonian Committee on Bioethics and Human Research. For more details on data access and relevant documents, please see https://genomics.ut.ee/en/content/estonian-biobank#dataaccess.
Competing Interest Statement
The authors declare that they have no competing interests.
Supplementary figures
A) Age distributions for male and female participants in Estonian Biobank. B) Pie chart of male and female percentage distribution in the analysed dataset.
A) Distribution of minor allele frequency (MAF) values. B) Distribution of the INFO score values.
Different colours represent different levels of effective sample size (Neff) confidence, from lighter shades for none h2 confidence to darker shades representing higher levels of h2 confidence. Larger circles represent a higher significance level in the association studies regarding log P values.
Distribution of the number of significant loci by trait group
The numbers on the bars show the actual numbers of unique loci in each group.
The x-axis represents genomic position, while the y-axis shows –log₁₀(P value) for the most significant association per variant, independent of trait. Loci with lead hits with P value below 1 × 10−100 have been highlighted. The –log₁₀(P) values are capped at 300, meaning some variants may exceed this threshold.
The x axis represents the genome in physical order, and the y axis represents P values (–log10(P value)) of association. The red horizontal line represents the P value threshold of 5 × 10−8. Loci with lead hits with P value below 1 × 10−8 have been highlighted.
The x-axis represents the odds ratio (log scale), while the y-axis displays the –log10(P value) of associations. Colors correspond to different genes, and point size represents the CADD score of each variant. Only protein structure or function altering variants have been included. The dashed line at OR = 1 indicates no effect. For further details, see Supplementary Table 4.
A) PheWAS volcano plot of SCN11A:p.Gly1736Val (rs143852849) in EstBB. The x-axis represents the effect size (beta), while the y-axis displays the –log10(P value) of association. Bonferroni-significant traits are highlighted in pink. The table on the right lists pain-related ICD-10 categories with corresponding odds ratios, standard errors, and P values. B) LocusZoom scatterplot of SCN11A:p.Gly1736Val, showing the locus surrounding the lead hit in the migraine GWAS (G43). The lead variant of the original GWAS is highlighted with a pink triangle. Blue dots represent P values before conditional analysis, while pink dots indicate results after conditioning.
A) Manhattan plot of the category “Other diseases of liver” (K76) highlighting GOT1 locus. Both common (AF>0.01) and rare (AF<0.01) variants have been plotted. The x axis represents the genome in physical order, and the y axis represents P values (–log10(P value)) of association. The red horizontal line represents the P value threshold of 5 × 10−8. Loci with lead hits with P value below 1 × 10−8 have been highlighted. B) LocusZoom scatterplot of GOT1:p.Gln208Glu, showing the locus surrounding the variant in the category “Other diseases of liver” (K76). The lead variant is highlighted with a pink triangle. Blue dots represent P values before conditional analysis, while pink dots indicate results after conditioning.
The left monomer represents the wild-type (WT) variant with Gln208, while the right monomer displays the Gln208Glu substitution (yellow). Key catalytic residues within both active sites are shown in red. The structural model is derived from the experimentally determined crystal structure of the AST (PDB: 3II0).
Acknowledgements
We want to acknowledge the participants of the Estonian Biobank for their contributions. The Estonian Genome Center analyses were partially carried out in the High Performance Computing Center, University of Tartu. This work was written at writing retreats and writing days organised by the Institute of Genomics, University of Tartu. The Estonian Biobank Research Team was responsible for data collection, genotyping, quality control and imputation and consisted of Andres Metspalu (andres.metspalu{at}ut.ee), Mait Metspalu (mait.metspalu{at}ut.ee), Lili Milani (lili.milani{at}ut.ee), Reedik Mägi (reedik.magi{at}ut.ee), Mari Nelis (mari.nelis{at}ut.ee) and Georgi Hudjashov (georgi.hudjashov{at}ut.ee). We also thank Andres Veidenberg, Abdelrahman Elnahas and Kadri Reis for their contribution in this work.
Footnotes
↵# These authors jointly supervised this work