A genome-wide association study for nonalcoholic fatty liver disease identifies novel genetic loci and trait-relevant candidate genes in the Million Veteran Program.

Nonalcoholic fatty liver disease (NAFLD) is a prevalent, heritable trait that can progress to cancer and liver failure. Using our recently developed proxy definition for NAFLD based on chronic liver enzyme elevation without other causes of liver disease or alcohol misuse, we performed a multi-ancestry genome-wide association study in the Million Veteran Program with 90,408 NAFLD cases and 128,187 controls. Seventy-seven loci exceeded genome-wide significance of which 70 were novel, with an additional European-American specific and two African-American specific loci. Twelve of these loci were also significantly associated with quantitative hepatic fat on radiological imaging (n=44,289). Gene prioritization based on coding annotations, gene expression from GTEx, and functional genomic annotation identified candidate genes at 97% of loci. At eight loci, the allele associated with lower gene expression in liver was also associated with reduced risk of NAFLD, suggesting potential therapeutic relevance. Functional genomic annotation and gene-set enrichment demonstrated that associated loci were relevant to liver biology. We expand the catalog of genes influencing NAFLD, and provide a novel resource to understand its disease initiation, progression and therapy.


Introduction 86
Chronic liver disease is a major contributor to global morbidity and mortality, with complications 87 of cirrhosis and hepatocellular carcinoma 1 . In particular, nonalcoholic fatty liver disease (NAFLD) 88 is an increasingly common cause of chronic liver disease with an estimated world prevalence of 89 25% among adults and associated metabolic risk factors 1-5 . In the United States (US), NAFLD 90 prevalence is projected to reach 33.5% among adult population by 2030, due in large part to the 91 rising obesity and associated metabolic disorders 6 . NAFLD is defined by ≥5% fat accumulation in 92 the liver (hepatic steatosis) in the absence of other known causes for liver disease, based on liver 93 biopsy and/or non-invasive radiological imaging 3,4 . The clinical spectrum of NAFLD ranges from 94 benign steatosis to nonalcoholic steatohepatitis (NASH) involving inflammation and 95 hepatocellular injury with fibrosis progression. At least 20% of patients with NAFLD develop NASH 96 with increased risk of consequent cirrhosis and liver cancer 5,6 . To date, there is no licensed drug 97 approved to treat NAFLD and prevent its progression. 98 Individual susceptibility to NAFLD involves both genetic and environmental factors. Risk 99 factors for NAFLD include obesity (in particular, abdominal adiposity), insulin resistance and 100 features of metabolic syndrome 2,5-7 , with current estimates of NAFLD heritability ranging 101 between 20% to 50% 8 . Several genetic variants that promote the full spectrum of fatty liver 102 disease have been identified in genome-wide association studies (GWAS) utilizing cohorts based 103 on liver biopsy, imaging, and/or isolated liver enzyme values 9-22 . The most prominent of these 104 include p.I148M in PNPLA3 and p.E167K in TM6SF2, which increase NAFLD risk, and a loss-of-105 function variant in HSD17B13 that confers protection against NASH 16 . However, the limited 106 number of genetic associations in NAFLD contrasts with other cardiometabolic disorders where 107 hundreds of loci have been mapped to date, traits that include obesity 23,24 , type 2 diabetes 25 and 108 plasma lipids 26 . This also highlights the need for expanded discovery based on larger sample size 109 and population diversity, with further integration with existing functional genomics data sets to 110 identify candidate genes from leading, non-coding associations 27 . 111 The Million Veteran Program (MVP) is among the world's largest and ancestrally diverse 112 biobanks 28 . The availability of comprehensive, longitudinally collected Veterans Health 113 Administration (VA) electronic health records for US Veteran participants in the MVP also makes 114 it a promising resource for precision medicine. As NAFLD is markedly underdiagnosed clinically 115 due to limited access to liver biopsy and variable use of imaging modalities 4 , we recently 116 developed and validated a proxy phenotype for NAFLD to facilitate case identification in MVP 21 . 117 The proxy NAFLD phenotype is based on chronically elevated serum alanine aminotransferase 118 (cALT) levels while excluding other conditions that are known to increase liver enzymes (e.g. viral 119 hepatitis, alcohol dependence, autoimmune liver disease and known hereditary liver disease). 120 We applied this cALT-based proxy NAFLD phenotype to the current build of 430,400 genotyped 121 MVP participants with defined ancestry classification 29 , and identified 90,408 NAFLD cases and 122 128,187 controls (inclusion/exclusion criteria for the remaining samples and study design 123 described in Supplementary Figure 1 and Figure 1). We performed a primary GWAS and 124 identified 77 trans-ancestry loci that reached genome-wide significance. We used additional 125 approaches to define NAFLD heritability and genetic correlations with various traits including 126 quantitative hepatic fat measured by liver imaging with computed tomography (CT) and magnetic 127 resonance imaging (MRI), in addition to identifying coding variants in putative causal genes. 128 129 130 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

MVP population 145
To identify loci associated with NAFLD, we performed ancestry-specific genome-wide scans by 146 meta-analyzing summary statistics derived from each ancestry followed by trans-ancestry meta-147 analysis combining data across all ancestries and stages (Methods and Figure 1). In the trans-148 ancestry scan across stages, 77 independent sentinel SNPs exceeded genome-wide significance 149 (P < 5x10 -8 ), of which 70 were novel whereas 7 (namely PNPLA3, TM6SF2, ERLIN1, TNKS 150 [PPP1R3B], MARC1, HSD17B13, and LYPLAL1) had previously reported genome-wide significant 151 associations with NAFLD (within 500kb and/or CEU r 2 LD > 0.05; Figure 2 and Supplemental Table  152 2) 9-22 . In addition, 55 loci in EAs, 8 loci in AAs, and 3 loci in HISPs, exceeded genome-wide 153 significance (Supplemental Tables 3-5 and Supplemental Figures 2-4). One SNP (rs4940689) 154 reached genome-wide significance in an ancestry-specific analysis of EAs only (Supplemental 155   Table 3), whereas two SNPs (rs144127357; rs2666559) reached genome-wide significance among 156 AAs only (Supplemental Table 4). No loci in ASNs achieved genome-wide significance, likely due 157 to limited sample size in this group (Supplemental Figure 5). 158 Among the eight AA-specific lead SNPs, three were intronic: rs115038698 in the ABCB4 159 locus, rs144127357 in TJP2, and rs2666559 in NRXN2. Two of these variants were nearly  Table 2). All 77 SNPs showed directional concordance in 171 effect estimates between the two stages. External replication for our trans-ancestry lead SNPs in 172 PMBB (n=72 of our loci were genotyped) with 2,570 cases and 3,802 controls demonstrated that 173 8 out of 72 variants were directionally consistent and nominally associated (signed binomial-test 174 P=4.4x10 -4 ). Furthermore, 4 out of 8 loci discovered in the AA-specific scan (signed bionomial-175 test P=2.5x10 -5 ) and 1 of 3 loci discovered in the HISP-specific scan (signed bionomial test P=0.07) 176 were also directionally consistent and nominally associated in PMBB (Supplemental Tables 6-9). 177 In summary, we found 73 novel loci associated with NAFLD that were identified by trans and 178 single-ancestry association studies and supported by replication in multiple stages and studies. 179 180

Concordance of cALT-based NAFLD loci with CT/MRI-based quantitative hepatic fat 181
To place our discoveries into physiological context, we next investigated the extent to which the 182 77 trans-ancestry SNPs from our NAFLD GWAS associated with quantitative measures of hepatic 183 fat, derived from CT/MRI imaging studies (Methods). We performed a trans-ancestry meta-184 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint To discover additional variants independent of our lead NAFLD signals, we next performed 196 approximate conditional analysis using the leading sentinel variants at our 77 trans-ancestry 197 associated loci. We detected a total of 41 conditionally independent SNPs flanking three known 198 (PNPLA3, HSD17B13, and ERLIN1) and 17 novel NAFLD loci in EA (Supplemental Table 11). Nine 199 conditionally independent SNPs were observed at the PNPLA3 locus in MVP, indicative of the 200 complexity of this locus. For one of the novel loci, located on chromosome 12 between 121-201 122Mb, the trans-ancestry lead variant (rs1626329) was located in P2RX7, whereas the lead peak 202 for EA mapped to HNF1A (rs1169292, Figure 3). Both are strongly linked to distinct coding 203 variants (P2RX7: rs1718119, Ala348Thr and HNF1A: rs1169288, Ile27Leu) and are compelling 204 candidate genes for metabolic liver disease. In AA, we observed eight conditionally independent 205 variants at four genomic loci, one at PNPLA3 and three novel loci (Supplemental Table 12), 206 including four in GPT, two in AKNA, one in ABCB4. In HISP, two conditionally independent variants 207 in the PNPLA3 locus were identified (Supplemental Table 13). Collectively, 51 additional variants 208 were identified at 24 loci across ancestries based on conditional analysis. 209 210 Fine mapping to define potential causal variants in 95% credible sets 211 To leverage increased sample size and population diversity to improve fine-mapping resolution, 212 we computed 95% credible sets using Wakefield's approximate Bayes' factors 30 derived from the 213 trans-ancestry meta-regression, EA, AA, and HISP scans (Supplemental Table 14-17, Methods). 214 In a comparison of the trans-ancestry and EA-only scans, the trans-ancestry meta-regression 215 reduced the median 95% credible set size from 9 (IQR 3 -17) to 7.5 variants (IQR 2 -13). A total 216 of 11 distinct NAFLD associations were resolved to a single SNP in the trans-ancestry meta-217 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint regression, with 4 additional loci suggestive a single SNP in the EA (n=2) and AA (n=2) ancestry-218 specific meta-analyses that were not already resolved to a single SNP via the trans-ancestry 219 analysis. 220

221
Heritability of NAFLD and genetic correlations with other phenotypes. 222 To tabulate trait heritability and genetic correlation with others, we utilized LD score regression 31-223 33 (Methods). Consistent with our discovery of novel genetic associations, we estimated the SNP-224 based liability-scaled heritability at 16% (95% CI: 12-19, P < 1.0x10 -6 ) in EA. Genome-wide genetic 225 correlations of NAFLD were calculated with a total of 774 complex traits and diseases by 226 comparing allelic effects using LD score regression with the EA-specific NAFLD summary statistics. 227 A total of 116 significant associations were observed (Bonferroni correction for 774 traits P < 228 6.5x10 -5 , Supplemental

Liver-specific enrichment of NAFLD heritability 235
To ascertain the tissues contributing to the disease-association underlying NAFLD heritability, we 236 performed tissue-specific analysis using stratified LD score regression. The strongest associations 237 were observed in genomic annotations surveyed in liver, hepatocytes, adipose, and immune cell 238 types among others (e.g., liver histone H3K36me3 and H3K4me1, adipose nuclei H3K27ac, spleen 239 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. To identify additional coding variants that may drive the association between the lead SNPs and 259 NAFLD risk, we investigated predicted loss of function (pLoF) and missense variants strongly 260 linked to the identified NAFLD lead variants (r 2 > 0.7, Supplemental Table 25-28). Four previously 261 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint  We performed colocalization analyses with gene expression and splicing across 48 tissues 276 measured by the GTEx project, and overlapped our lead SNPs with histone quantitative trait locus 277 (QTL) data from livers to identify NAFLD-associated variants that are also associated with change 278 in gene expression (eQTLs), splice isoforms (sQTLs), or histone modifications (hQTLs, Methods, 279 Supplemental Table 29). Across all tissues, a total of 123 genes were prioritized with 20 genes in 280 liver tissue (Methods). In liver tissue alone, a total of eight variant-gene pairs were identified 281 where the allele associated with protection against NAFLD was also associated with reduced gene 282 expression (i.e., the direction of effect was concordant between the GTEx eQTL and GWAS 283 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. DDX42, TRC4AP, and APOL3) that were affected in at least two tissues (Supplemental Table 30). 287 Finally, two of our lead SNPs were in high LD (r 2 > 0.8) with variants that regulated H3K27ac levels 288 in liver tissue (hQTLs), namely EFHD1 (hQTL SNPs rs2140773, rs7604422 in EFHD1) and FADS2 289 loci (hQTL rs174566 in FADS2) 36 . proteins, 86 nodes were observed, with a PPI enrichment (P < 9.0x10 -8 ) indicating that the 303 network has substantially more interactions than expected by chance (Supplemental Table 32  304 and Supplemental Figure 6). 305 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint For each gene identified from all of the above described analyses, we counted the number 306 of times that the gene was identified for each of the analyses (DEPICT gene prediction, coding 307 variant linkage, QTL colocalization, promotor Capture-C and/or ATAC-Seq peak overlap, and PPI 308 network analysis) and divided this by the number of analyses (e.g., 8). We labeled this measure 309 as the gene nomination score, which reflects the cumulative evidence supporting the respective 310 gene as causal for the observed association. Based on our gene nomination scheme, we found 311 evidence for a single gene nomination at 52 genomic loci, two genes at 14 loci, and three genes 312 at six loci. Six loci had more than three genes nominated (one of which was HLA), and only two 313 loci lacked any data to support a nomination (Supplemental Table 33). We further prioritized 314 those loci which were prioritized by at least 3 sources of evidence (or 4 sources of evidence for 315 coding variants). This resulted in a total of 27 loci supported by multiple lines of evidence (Table  316 1), which included 6 loci with co-localizing eQTLs in liver or adipose tissues and connection to the 317 predicted gene via Promoter CaptureC data (i.e., EPHA2, IL1RN, SHROOM3, HKDC1, PANX1, 318

DHODH;HP). 319
Interestingly, 14 of the nominated genes are transcription factors (TF) (Supplemental 320 CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; Polygenic Risk Score analyses. 328 We calculated a candidate SNP polygenic risk score (PRS) based on Stage 1 350K dataset (primary 329 set) to perform a Phewas in an independent sample in MVP (Stage 2 replication set). We observed 330 that an increased NAFLD PRS was associated with abnormal results of function study of liver 331 (Bonferroni P < 3.1 x 10 -5 ), and showed suggestive significance with bacterial pneumonia, otalgia, 332 gout and other crystal arthropathies and non-infectious gastroenteritis (P < 0.001, Supplemental 333 Table 35). Furthermore, a NAFLD PRS based on the Stage 1 set was positively associated with 334 NAFLD prediction in the Stage 2 replication set (P=3.8 x 10 -5 , Supplemental Table 36). 335 336

Investigation of pleiotropy of lead NAFLD SNPs. 337
We next sought to identify additional traits that were also associated with our 77 trans-ancestry 338 lead SNPs. First, we performed a LabWAS of distinct clinical laboratory test results 38 in MVP 339 (Methods), yielding 304 significant SNP-trait associations (Supplemental Table 37 which identified various SNP-trait associations that mapped to loci previously associated with 342 liver traits, cardiometabolic traits, as well as additional enriched association for gallstones, gout, 343 arthritis, and hernias (Supplemental Tables 38 and 39). In particular, we examined all 344 associations for PheCode 571.5, "Other chronic nonalcoholic liver disease" which comprised 345 1,664 cases and 400,055 controls. Of the n=73 variants found, we noted that 14/73 were both 346 nominally associated and directionally consistent with our scan (signed binomial test P=3.4x10 -347 9 ), providing additional validation for our scan (Supplementary Table 40). Third, we performed a 348 SNP lookup using the curated data in the IEU OpenGWAS project (Supplemental Tables 41 and  349 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021.

42), which identified 2,891 genome-wide significant SNP-trait associations for trans-ancestry 350
SNPs, and additional 283 SNP-trait associations for the ancestry-specific lead SNPs. Finally, we 351 performed cross-trait colocalization analyses using COLOC of EA, AA, and HISP lead loci with 36 352 other GWAS statistics of cardiometabolic and blood cell related traits (Methods). This resulted in 353 significant regional colocalization for 64 SNP-trait pairs in EA, 32 SNP-trait pairs in AA, and 12 SNP 354 trait pairs in HISP (Supplemental Table 43 traits and inflammatory traits (e.g., C-reactive protein, white blood cell count, lymphocyte count). 364 Finally, 25 loci showed association with all three traits: liver, cardiometabolic, and inflammation. 365 Notably, among 12 loci that showed significant association with hepatic fat (color-coded in red 366 in Figure 4), 11 were associated with both liver and metabolic traits, including five that were also 367 associated with inflammatory traits. Collectively, our findings identify novel NAFLD-associated 368 genetic loci with pleotropic effects that may impact hepatic, metabolic and inflammatory traits. 369 370 371 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Discussion 372
In this study, the largest and most diverse GWAS of NAFLD to date, we report a total of 77 trans-373 ancestry (of which 70 are novel) and 3 additional ancestry-specific loci that show significant 374 genome-wide association with NAFLD. While our NAFLD definition is a proxy for chronic 375 hepatocellular injury in the absence of other known causes of liver disease, we further used 376 CT/MRI imaging data to compare to what extent these SNPs also associated with hepatic fat 377 accumulation. Overall, 24 (~30%) of these loci were nominally associated with hepatic fat based 378 on CT or MRI, and the majority of these overlapping SNPs were associated with metabolic and/or 379 inflammatory traits. Thus, SNPs that are associated with liver enzymes, metabolic risk factors, 380 and inflammatory biomarkers may be the most likely to be associated with liver steatosis and 381 should be prioritized for further follow-up. Furthermore, detailed genetic correlation analyses 382 showed significant enrichment of these SNPs for cardiometabolic traits, metabolic pathways, and 383 genomic annotations relevant for NAFLD. We found that most of our index NAFLD-associated 384 SNPs were associated with metabolic and/or inflammatory traits -the most common being lipid-385 related, followed by glycemic traits, hypertension, and cardiovascular disease, as well as 386 cholelithiasis (gallstones), cholecystitis, osteoarthritis, hypothyroidism, and thrombophlebitis. 387 Collectively, our findings offer a comprehensive and refined view of the genetic contribution to 388 NAFLD with potential clinical, pathogenic, and therapeutic relevance. Integration of these with 389 extant phenotypic association data sets allowed us to further characterize the functional 390 mechanisms through which our identified loci may mediate NAFLD risk. 391 Previous studies for liver enzyme levels, particularly serum ALT activity, have been 392 performed 10,11,16 . While there is overlap in the discoveries made by studies of natural variation 393 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021.
in circulating levels of this biomarker, our cohort and approach to phenotyping make our results 394 and interpretation unique. First, the diversity of our cohort provides both additional power and 395 potential for discovery, as the bulk of studies to date have been performed in predominantly 396 European-ancestry cohorts. Second, our approach ascertains individuals with chronic elevation 397 of this enzyme, consistent with genuine chronic liver disease. At the same time, we excluded 398 individuals with known causes of liver disease outside of NAFLD via ICD code definition, which 399 served to further enrich for metabolic disorders in our cohort. We further excluded control 400 individuals who maybe have intermittent ALT elevation, focusing on a healthier, 'super-control' 401 subset of the population. The result is that our approach should have higher specificity to 402 ascertained risk alleles that predispose to metabolic-induced fatty liver disease. In contrast, a 403 standard-ALT scan would be powered to discover the full spectrum of causes of liver disease (and 404 perhaps many more loci), many of which will not be specific to NAFLD and may be due to other 405 causes. As we have shown in validation studies using quantitative measures of hepatic fat as well 406 as ICD-code definitions of NAFLD, our results are highly directionally concordant, demonstrating 407 the relevance of our proxy phenotype to liver disease and physiology. Genetic correlation analysis 408 demonstrated strong correlation with cardiometabolic traits and disease, again consistent with 409 the relevance of our trait relative to simple enzyme measures. 410 There are several aspects of our study that are worth highlighting. We demonstrate the 411 strength of trans-ancestry GWAS for the discovery and interrogation of NAFLD susceptibility loci, 412 discoveries made possible by the diversity and sample size of the Million Veteran Program cohort, 413 of which 25% of participants are of non-European ancestry. Utilizing this data allows us to narrow 414 down putatively causal variants through trans-ancestry fine-mapping and construction of 415 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint credible sets likely to harbor the likely culprit variant(s). Construction of credible sets using trans-416 ancestry data has been shown to facilitate fine-mapping by producing smaller credible sets 417 compared to sets based on single ancestries 39 , an effect we also observed at our loci. Moreover, 418 we identified eight NAFLD-associated loci in AAs. In particular, the lead SNP at the ABCB4 locus 419 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; participate. This study analyzed clinical data through July 2017 for participants who were enrolled 504 since January 2011. All MVP study participants provided blood samples for DNA extraction and 505 genotyping, completed surveys about their health, lifestyle, and military experiences. Consent to 506 participate and permission to re-contact was provided after veterans received information 507 materials by mail and met with research staff to address their questions. Study participation also 508 includes access to the participant's electronic health records for research purposes. Each Genotyping: DNA extracted from buffy coat was genotyped using a custom Affymetrix Axiom 519 biobank array. The MVP 1.0 genotyping array contains a total of 723,305 SNPs, enriched for 1) 520 low frequency variants in AA and HISP populations, and 2) variants associated with diseases 521 common to the VA population 28 . 522 Genotype quality-control: The MVP genomics working group applied standard quality control and 523 genotype calling algorithms to the data in three batches using the Affymetrix Power Tools Suite 524 (v1.18). Excluded were duplicate samples, samples with more heterozygosity than expected, and 525 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint samples with an over 2.5% missing genotype calls. We excluded related individuals (halfway 526 between second-and third-degree relatives or closer) with KING software 72 . Before imputation, 527 variants that were poorly called or that deviated from their expected allele frequency based on 528 reference data from the 1000 Genomes Project were excluded 73 . After prephasing using EAGLE 529 hepatitis and/or ascites, alcoholic fibrosis and sclerosis of liver, alcoholic cirrhosis of liver and/or 546 ascites, alcoholic hepatic failure and/or coma, and unspecified alcoholic liver disease). The 547 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint control group was defined by having a: normal ALT (≤30 U/L for men, ≤20 U/L for women) and 548 no apparent causes of liver disease or alcohol use disorder or related conditions 21 . Habitual 549 alcohol consumption was assessed with the age-adjusted Alcohol Use Disorders Identification 550 Test (AUDIT-C) score, a validated questionnaire annually administered by VA primary care 551 practitioners and used previously in MVP 75,76 . 552 553

Single-variant autosomal analyses. 554
We tested imputed SNPs that passed quality control (i.e. HWE > 1x10 -10 , INFO > 0.3, call rate > 555 0.975) for association with NAFLD through logistic regression assuming an additive model of 556 variants with MAF > 0.1% in European American (EA), and MAF > 1% in African Americans (AA), 557 Hispanics (HISP), and Asians (ASN) using PLINK2a software 77 . Covariates included age, gender, 558 age-adjusted AUDIT-C score, and 10 principal components of genetic ancestry. We aggregated 559 association summary statistics from the ancestry-specific analyses and performed a trans-560 ancestry meta-analysis. The association summary statistics for each analysis were meta-analyzed 561 in a fixed-effects model using METAL with inverse-variance weighting of log odds ratios 78 . 562 Variants were clumped using a range of 500kb and/or CEU r 2 LD > 0.05, and were considered 563 genome-wide significant if they passed the conventional p-value threshold of 5.0x10 -8 . 564 565 Secondary signal analysis. 566 GCTA 79 was used to conduct conditional analyses to detect ancestry-specific distinct association 567 signals at each of the lead SNPs utilizing the GWAS summary statistics in EA, AA, and HISP; these 568 ancestry-stratified MVP cohorts were used to model LD patterns between variants. The reference 569 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint panel of genotypes consisted of the variants with allele frequencies > 0.1% in EA, >1% in AA, and 570 >1% in HISP that passed quality control criteria in the MVP-specific NAFLD GWAS (INFO > 0.3, 571 HWE P > 1.0x10 -10 , call rate > 0.975). For each lead SNP, conditionally independent variants that 572 reached locus-wide significance (P < 1.0x10 -5 ) were considered secondary signals of distinct 573 association. If the minimum distance between any distinct signals from two separate loci was less 574 than 500kb, we performed an additional conditional analysis that included both regions and 575 reassessed the independence of each signal.  Attenuation was measured in Hounsfield units. The difference between the spleen and liver 591 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint attenuation was measured for PMBB; a ratio between liver attenuation/spleen attenuation was 592 used for MESA and Amish; and liver attenuation/phantom attenuation ratio in FHS as previously 593 described by Speliotes et al 12 . Abdominal MRI data from UK Biobank data were used to quantify 594 liver fat using a two-stage machine learning approach with deep convolutional neural networks 81 . identified, the SNP was set to missing for that respective study. The study-specific ancestry-605 stratified summary statistics were first standardized to generate standard scores or normal 606 deviates (z-scores), and then meta-analyzed using METAL in a fixed-effects model with inverse-607 variance weighting of regression coefficients 78 . In a first round of meta-analysis, ancestry-specific 608 summary statistics were generated, which then served as input for a subsequent round of meta-609 analysis that represents the trans-ancestry effects of our lead SNPs on quantitative hepatic fat. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint LD-score regression was used to estimate the heritability coefficient, and subsequently 613 population and sample prevalence estimates were applied to estimate heritability on the liability 614 scale 83 . A genome-wide genetic correlation analysis was performed to investigate possible co-615 regulation or a shared genetic basis between T2D and other complex traits and diseases. Pairwise 616 genetic correlation coefficients were estimated between the meta-analyzed NAFLD GWAS 617 summary output in EA and each of 774 precomputed and publicly available GWAS summary 618 statistics for complex traits and diseases by using LD score regression through LD Hub v1.9.3 619 (http://ldsc.broadinstitute.org). Statistical significance was set to a Bonferroni-corrected level of 620 P < 6.5 x 10 -5 . 621 622

Tissue-and epigenetic-specific enrichment of NAFLD heritability. 623
We analyzed cell type-specific annotations to identify enrichments of NAFLD heritability. First, a 624 baseline gene model was generated consisting of 53 functional categories, including UCSC gene 625 models, ENCODE functional annotations 84 , Roadmap epigenomic annotations 85 , and FANTOM5 626 enhancers 86 . Gene expression and chromatin data were also analyzed to identify disease-relevant 627 tissues, cell types, and tissue-specific epigenetic annotations. We used LDSC 31-33 to test for 628 enriched heritability in regions surrounding genes with the highest tissue-specific expression. 629 Sources of data that were analyzed included 53 human tissue or cell type RNA-seq data from 630 GTEx 27 ; human, mouse, or rat tissue or cell type array data from the Franke lab 87 ; 3 sets of mouse 631 brain cell type array data from Cahoy et al 88 ; 292 mouse immune cell type array data from 632 ImmGen 89 ; and 396 human epigenetic annotations from the Roadmap Epigenomics Consortium 633

. 634
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Pathway Annotation enrichment. 636
Enrichment analyses in DEPICT 90 were conducted using genome-wide significant (P < 5x10 -8 ) 637 NAFLD GWAS lead SNPs. DEPICT is based on predefined phenotypic gene sets from multiple 638 databases and Affymetrix HGU133a2.0 expression microarray data from >37k subjects to build 639 highly-expressed gene sets for Medical Subject Heading (MeSH) tissue and cell type annotations. 640 Output includes a P-value for enrichment and a yes/no indicator of whether the FDR q-value is 641 significant (P < 0.05). Tissue and gene-set enrichment features are considered. We tested for 642 epigenomic enrichment of genetic variants using GREGOR software 91 . We selected EA-specific 643 NAFLD lead variants with a p-value less than 5x10 −8 . We tested for enrichment of the resulting 644 GWAS lead variants or their LD proxies (r 2 threshold of 0.8 within 1 Mb of the GWAS lead, 1000 645 Genomes Phase I) in genomic features including ENCODE, Epigenome Roadmap, and manually 646 curated data (Supplemental Table 24 CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint used for LD calculation. Coding variants that were in strong LD (r 2 > 0.7) with lead SNPs and had 657 a strong statistical association (P-value < 1x10 -5 ) were considered the putative causal drivers of 658 the observed association at the respective locus. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; replicates of iPSCs, which in turn were generated from peripheral blood mononuclear cells using 679 a previously published protocol 36 . 680

681
Overlap with Promoter Capture-C data. 682 We used two promoter Capture-C datasets from two cell/tissue types to capture physical 683 interactions between gene promoters and their regulatory elements and genes; three biological 684 replicates of HepG2 liver carcinoma cells, and hepatocyte-like cells (HLC) 93 . The detailed protocol 685 to prepare HepG2 or HLC cells for the promoter Capture-C experiment is previously described 36 . 686 Briefly, for each dataset, 10 million cells were used for promoter Capture-C library generation.

Protein-Protein Interaction Network Analysis 698
We employed the search tool for retrieval of interacting genes (STRING) v11 97 (https://string-699 db.org) to seek potential interactions between nominated genes. STRING integrates both known 700 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint and predicted PPIs and can be applied to predict functional interactions of proteins. In our study, 701 the sources for interaction were restricted to the 'Homo Sapiens' species and limited to 702 experimentally validated and curated databases. An interaction score > 0.4 were applied to 703 construct the PPI networks, in which the nodes correspond to the proteins and the edges 704 represent the interactions (Figure 4, Supplemental Table 32).  (Table 1A) and non-coding variants (Table 1B) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint compiled traits. COLOC requires for each SNP data on allele frequency, sample size, beta-767 coefficients and variance or p values. For each association pair COLOC was run with default 768 parameters and priors. COLOC computed posterior probabilities for the following five 769 hypotheses: PP0, no association with trait 1 (NAFLD GWAS signal) or trait 2 (e.g., co-associated 770 metabolic signal); PP1, association with trait 1 only (i.e., no association with trait 2); PP2, 771 association with trait 2 only (i.e., no association with trait 1); PP3, association with trait 1 and 772 trait 2 by two independent signals; and PP4, association with trait 1 and trait 2 by shared variants. For the NAFLD PRS that was generated using the Stage 1 350K dataset, we performed a PheWAS 786 study in the Stage 2 108K replication dataset to fully leverage full catalog of available ICD-9/ICD-787 10 diagnosis codes. Of genotyped veterans, participants were included in the PheWAS analysis if 788 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint their respective electronic health record reflected two or more separate encounters in the VA 789 Healthcare System in MVP up to July 2017. Using this method, a total of 73,580 veterans of EA 790 ancestry were available for PheWAS analysis. ICD-9/ICD-10 diagnosis codes were collapsed to 791 clinical disease groups and corresponding controls using predefined groupings 99 . Phenotypes 792 were required to have a case count over 25 in order to be included in the PheWAS analysis, and 793 a multiple testing thresholds for statistical significance was set to P < 2.8x10 -5 (Bonferroni 794 method). The NAFLD PRS was used as a continuous exposure variable in a logistic regression 795 adjusting for age, sex, age-adjusted AUDIT-C, and 10 principal components in an additive effects 796 model using the PheWAS R package in R v3.2.065. The results from these analyses are reported 797 as odds ratios, in which the estimate is the average change in odds of the PheWAS trait per 798 NAFLD-increasing polygenic risk score. 799 800 Transcription Factor Analysis. 801 We identified nominated genes (Supplemental Table 34 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021. The full summary level association data from the trans-ancestry, European, African American, 852 Hispanic, and Asian meta-analysis from this report will be available through dbGAP (Accession 853 codes will be available before publication). 854 855 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 2, 2021.   CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ;   CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ;  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 2, 2021. ; https://doi.org/10.1101/2020.12.26.20248491 doi: medRxiv preprint