Sequencing of over 100,000 individuals identifies multiple genes and rare variants associated with Crohns disease susceptibility

Genome-wide association studies (GWAS) have identified hundreds of loci associated with Crohns disease (CD); however, as with all complex diseases, deriving pathogenic mechanisms from these non-coding GWAS discoveries has been challenging. To complement GWAS and better define actionable biological targets, we analysed sequence data from more than 30,000 CD cases and 80,000 population controls. We observe rare coding variants in established CD susceptibility genes as well as ten genes where coding variation directly implicates the gene in disease risk for the first time.

CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021. ; mrivas@stanford.edu, john.david.rioux@umontreal.ca , dermot.mcgovern@cshs.org, hhuang@broadinstitute.org, carl.anderson@sanger.ac.uk, mjdaly@broadinstitute.org Abstract Genome-wide association studies (GWAS) have identified hundreds of loci associated with Crohns disease (CD), however, as with all complex diseases, deriving pathogenic mechanisms from these non-coding GWAS discoveries has been challenging. To complement GWAS and better define actionable biological targets, we analysed sequenced data from more than 30,000 CD cases and 80,000 population controls. We observe rare coding variants in established CD susceptibility genes as well as ten genes where coding variation directly implicates the gene in disease risk for the first time.

Results
To further advance the interpretation of GWAS loci -and to define novel CD associated genes using variation rarer than that routinely detected by GWAS -we pursued large-scale exome sequencing at multiple centers using CD case and control collections from more than 35 centers in the International IBD Genetics Consortium. The primary analysis consisted of exome sequencing of 18,828 CD cases across 35 IBD studies and 13,499 non-IBD control samples from the same studies. These samples were all sequenced at the Broad Institute and were supplemented with 22,536 population controls from approved non-IBD studies sequenced contemporaneously at the Broad Institute and accessed from dbGAP (Supplementary Table 1 Figure 1; Online Methods). Sensitivity to detect low-frequency coding variants was evaluated in each callset post-QC by comparison to passing sites in gnomAD v2.1 that had 0.0001 < non-Finnish European (NFE) minor allele frequency (MAF) < 0.1 (Online Methods). We observed that 84% of all exonic single nucleotide polymorphisms (SNPs) in this frequency range were detected in both CD datasets with sufficiently high quality to enter meta-analysis. Analysis of each dataset was conducted in SAIGE using a logistic mixed-model 14 , and meta-analysis was conducted with the standard inverse-variance weighted (IVW) method. Forty-three sites (Supplementary Table 2) failed a heterogeneity-of-effect test (IVW p HET < 0.0001) and were eliminated from further analysis. We did not observe an inflation in the exome-wide distribution of test statistics (Supplementary Figure 2 Association tests were carried out for 164,149 non-synonymous variants with minor allele frequency (gnomAD NFE) between 0.0001 -0.1, yielding a study-wide significance threshold of 3 x 10 -7 (Supplementary Table 3). The most significantly-associated variants (p < 10 -10 ) in this analysis were previously-known CD variants (or variants in linkage disequilibrium [LD] with them, Supplementary Table 3), indicating the QC and analysis pipeline removed highly associated false positives. Twenty-eight variants achieved study-wide significance (p < 3 x 10 -7 ), including known variants within CD genes established in prior GWAS and sequencing efforts: NOD2, IL23R, LRRK2, TYK2, SLC39A8, IRGM and CARD9. Encouraged by this, we then nominated a list of 116 variants (including known variants) with p < 0.0002 for further evaluation.
Additional exome and genome sequencing was undertaken at the Sanger Institute on an independent cohort of 9,731 CD cases ascertained by the UK IBD Genetics Consortium and IBD BioResource. Genome sequencing with a target depth of 15x was performed on 6,000 CD patients. Whole-genome sequences from 11,852 individuals from the INTERVAL blood donor cohort were used as population controls. Another 3,731 CD patients were exome sequenced using the Agilent SureSelect Human All Exon V5 capture. 33,704 individuals without IBD or other related diseases from the UK Biobank were used as controls for the Sanger WES cases. These UK Biobank samples were sequenced by Regeneron using the IDT xGen Exome Research Panel v1.0 (including supplemental probes), and thus QC and subsequent analyses were restricted to the intersection of the Agilent and the IDT capture regions. Exome and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. genome datasets were processed in parallel with similar QC parameters (Online Methods). Association analyses were performed using a logistic mixed-effects model implemented in the REGENIE software, correcting for the case-control imbalance using the Firth correction. 28 of 116 variants were associated (p < 4.3 x 10 -4 (.05/116)) with CD in the meta-analysis of the two Sanger cohorts and 94 replicated the direction of effect seen in the discovery cohort (p = 3.34 x 10 -12 , binomial test). Summary statistics from a German dataset of 4,071 CD cases and 4,223 controls exome-sequenced at Regeneron (Online Methods) were also ascertained and an inverse-variance weighted meta-analysis carried out across all five cohorts ( Table 1). 45 of the 116 variants exceeded the study-wide significance threshold, p < 3 x 10 -7 (Supplementary Table 3).
To identify new loci not yet implicated in CD and independent exonic association signals at known loci, we accounted for the LD between these 45 variants and previously-reported IBD GWAS hits, as well as prior rare variant discoveries (Online Methods). We identified five coding variants in genes not previously implicated in IBD susceptibility as well as six independently associated novel exonic variants in genes previously known to harbor coding mutations underpinning CD or IBD risk, two of which are in NOD2 (Figure 1; Supplementary Table 4). Figure 1. Odds ratio and minor allele frequency for exome-wide significant findings that are not tagging stronger, established non-coding association signals. Known causal candidate: in a credible set from a fine-mapping study 5 with posterior inclusion probability > 5% or reported in previous studies 6,8 (Online Methods). New locus: in a locus not yet implicated by GWAS. New variant in known locus: in a known GWAS locus, but represents an association independent from previously-reported IBD putative causal variants (Online Methods).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021.
Most of the newly implicated CD genes ( Table 2) play roles in biological processes already established in IBD pathogenesis, including: DOK2:P274L (downstream of tyrosine kinase 2: myeloid cell development and negative regulator of TLR2); TAGAP:E147K (Th17 differentiator and antifungal signaler); PTAFR:N114S (regulator of the NLRP3 inflammasome); CCR7:M7V (responsible for homing of T-cells and dendritic cells, lymphocyte egress, regulatory and memory T-cell function); IL10RA:P295L (a regulator of innate/adaptive immune response in which recessive-acting mutations are known to cause severe neonatal enterocolitis); RELA:D288N (Th17 regulator with mutations reported to cause chronic mucocutaneous ulceration [CMCU]); and HGFAC, PDLIM5 and SDF2L1. Further details on the newly-implicated genes are contained in Box 1.  Table 2. Variants achieving study-wide significance that implicate genes directly for the first time.
Four of these variants (in TAGAP 1,15 , SDF2L1 1 , RELA 1 and HGFAC 2 ) are in regions highlighted in prior GWAS but represent independent associations, directly implicating these genes (Online Methods). Pos: genomic position in hg38; Scan p: p-value from the exome-wide discovery including subjects exome sequenced at the Broad Institute; Meta p: p-value from the full meta-analysis of the five cohorts shown in Table 1.
The identified coding variants in RELA, TAGAP, and SDF2L1 are close to, but not in LD (r 2 < 0.05) with common non-coding variants significantly associated with IBD risk via GWAS (Online Methods). These very likely pinpoint the genes dysregulated by the associated common variant and provide a focus for uncovering the function of those variants, perhaps leading to allelic series of perturbations further informing on the mechanism of their contribution to CD pathogenesis. The association at HGFAC likely explains the prior common variant association in the region, but as this was not previously a high-resolution region and the missense variant not . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021. included in the design of Immunochip, the variant was not previously flagged. The two novel NOD2 associations are not in LD with previously reported putative causal variants. One of them modestly reduces basal activity and has at least 2-fold reduction in peptidoglycan induced NF-κB response (Chamaillard et al. 2003). The other novel NOD2 variant is a splice donor variant (Supplementary Table 5).
We performed two gene-based rare-variant (MAF < 0.001) burden tests in the full-exome Nextera and TWIST data sets using SAIGE-GENE 16 , one restricted to loss-of-function variants and another using all non-synonymous variants (Supplementary Table 6). NOD2 unsurprisingly stood out far above the expected distribution (LoF p = 7.7 x 10 -7 ; NonSyn p < 10 -16 ). Only one other gene in either analysis exceeded the threshold expected once in the study by chance (ATG4C, NonSyn p = 3.3 x 10 -6 ). This potentially novel signal in ATG4C was driven by three distinct missense variants with individual p < 0.01 (N75S, R80H and C367Y) (Supplementary Table 7) along with two others with p < 0.05 (K371R, R381X). The ATG4C gene burden signal was examined in the Sanger data sets and replicated, with the meta-analysis reaching exome-wide significance (p = 1.5 x 10 -7 ) driven by several of the same variants. Further examination of results from the single variant tests in ATG4C identified a frameshift variant with frequency of 0.002 (1:62834058-TTG-T) -too high to be included in our burden test -that just missed our threshold for testing in the follow-up cohorts (p = 0.0003, Beta = 0.55 in the Broad meta-analysis). This variant also showed evidence of association in the meta-analysis of the Sanger cohorts (p = 1.3 × 10 -5 ), and also exceeded our study-wide significance threshold in the five-way meta-analysis of all cohorts (p = 1.55 x 10 -8 ). Of further note, an additional ATG4C frameshift variant specifically enriched in Finland (1:62819215:C:CT) is associated with IBD (p = 6.91 x 10 -8 , Beta = 1.20) in the publicly released FinnGen resource (r5.finngen.fi). All variants in burden and individual testing increase risk, and the inclusion of four truncating variants in these analyses suggests that loss-of-function variants in ATG4C strongly increase CD risk.

Discussion
Here, we demonstrate that large-scale exome sequencing can complement GWAS by pinpointing specific genes both indirectly implicated by GWAS as well as those not yet observed in GWAS. With high sensitivity to directly test individual variants down to 0.01% minor allele frequency, as well as assess burden of ultra-rare mutations, we begin to fill in the low-frequency and rare-variant component of the genetic architecture of Crohn's disease. This component was not observable by earlier generations of CD GWAS meta-analyses, which have had more limited coverage of low-frequency and rare variation.
Past findings in IBD 5 , and most other complex diseases, suggest that while coding variants are vastly outnumbered by non-coding variation, they are highly enriched for associations to common and rare disease. Furthermore, associated coding variants tend to have stronger effects than their non-coding counterparts, often keeping them lower in frequency via natural selection. While this alone validates the use of exome sequencing for efficiency's sake, the primary advantage of targeting coding regions for discovery is that coding variants uniquely pinpoint genes, and often pathogenetic mechanisms, in a fashion that is at present far more challenging to achieve routinely for non-coding associations. In the case of several of the new findings (e.g., RELA, TAGAP), the coding variation here provides concrete evidence of genes previously indirectly implicated by independent non-coding GWAS associations. These identify the likely gene underlying these associations and build allelic series of natural perturbations at these genes. Moreover, IL10RA and RELA are known to harbor mutations causing rare, Mendelian, inflammatory GI disorders and this study extends the phenotypic spectrum resulting from their genetic perturbation to more complex forms of CD. From a functional perspective, the novel genes identified in the current study reiterate the central role of innate and adaptive immune cells as well as autophagy in CD pathogenesis. Moreover, the involvement of PDLIM5 as well as the CCR7/CCL19/CCL21 pathway highlights the emerging role of mesenchymal cells in the development and maintenance of intestinal inflammation 17 .
We expect that, in the next year, expanded sequencing efforts underway in ulcerative colitis will come to completion, enabling a more comprehensive survey of low-frequency and rare variation in ulcerative colitis, and IBD in general. Integrated with a much larger GWAS spearheaded in parallel by the IIBDGC, we expect a substantial number of conclusively linked genes and informative allelic series to emerge.

Box 1. Genes newly implicated in Crohn's disease risk.
DOK2 (Docking Protein 2, Downstream of Tyrosine Kinase 2) is a cytoplasmic signaling protein highly expressed in macrophages and T-cells in the terminal ileum. Loss of Dok-2 in mice causes severe DSS-induced colitis with reduced IL-17A and IL-22 expression 18 , and DOK2 is known to be differentially methylated in colonic tissue of IBD patients 19 . DOK2 regulates both TLR2-induced inflammatory signaling and NK cell development, and DOK2 loss-of-function is associated with increased IFN-ɣ production 20,21 . The P274L variant has previously been implicated in atopic eczema where the rare allele was significantly protective for atopic eczema likely by disturbing the RasGAP activation of DOK2 and transcriptomic analyses also suggest that DOK2 is a central hub interacting with CD200R1, IL6R, and STAT3 22 .
TAGAP (T-Cell Activation RhoGTPase Activating Protein) has a pivotal role in TH17 development and modulates the risk of autoimmunity through influencing thymocyte migration in thymic selection 23,24 . TAGAP expression is upregulated in rectal tissue in IBD patients, and TAGAP is required for Dectin-induced anti-fungal signaling and proinflammatory cytokine production in myeloid cells 25,26 .
PTAFR (Platelet Activating Factor Receptor), a hypoxia response gene, has an affinity for bacterial phosphorylcholine (ChoP) moieties 27 and influences development of cigarette-induced Chronic Obstructive Pulmonary Disease (COPD) by inducing neutrophil autophagic death in mice 28 . PTAFR regulates colitis-induced pulmonary inflammation through the NLRP3 inflammasome 29 .

PDLIM5 (PDZ And LIM Domain 5) is a kidney anion exchanger and scaffolding protein.
Genetic variation in this gene is associated with prostate cancer, schizophrenia, diverticular disease, diverticulosis, colorectal cancer, testicular cancer, and self-reported angina 30 . PDLIM5 has been reported to be a STAT3 interaction partner involved in actin binding 31 , with STAT3 previously being identified as an IBD gene 32 . PDLIM5 is highly . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021. expressed in myofibroblast cells, which are important mesenchymal cells of the intestinal lamina propria 17 . SDF2L1 (Stromal Cell Derived Factor 2 Like 1) has been recently identified in the ER stress response in primary intestinal epithelial cells 33 . SDF2L1 is an ER resident protein that functions within a large protein complex regulating the BiP and Erdj3 chaperone cycle to promote protein folding and secretion [34][35][36] . Structurally, Sdf2l1 contains an N-terminal signal peptide for entry into the ER lumen and a C-terminal ER retention signal flanking 3 MIR domains that promote complex assembly. The CD risk variant R161H is located in the third MIR domain. In murine models, deletion of SDF2L1 in the liver resulted in prolonged ER stress and insulin resistance 37 . In the intestine, single cell transcriptional profiling revealed that SDF2L1 is predominantly expressed in highly secretory cell lineages, including mucin-secreting goblet cells and immunoglobulin-secreting plasma cells 38 . Moreover, SDF2L1 expression is dynamically regulated and specifically induced during the acute phase of the unfolded protein response (UPR) 33 . Together, these observations suggest a critical role for SDF2L1 in maintaining ER homeostasis and secretory capacity, which may promote barrier function at the level of mucus integrity and/or neutralization of immunoglobulins and antimicrobial peptides that collectively limit interactions between luminal microbes and the host immune system. CCR7 (chemokine receptor 7) and its ligands CCL19/CCL21 promote homing of T-cells and dendritic cells to T-cell areas of lymphoid tissues where T-cell priming occurs. CCR7 also contributes to adaptive immune functions including thymocyte development, secondary lymphoid organogenesis, high affinity antibody responses, regulatory and memory T-cell function, and lymphocyte egress from tissues. CCR7 expression is upregulated in an inflamed gut in CD 39 , and CCR7 regulates the intestinal TH1/TH17/Treg balance during Crohn's-like murine ileitis 40 . Genetic variation in CCR7 is associated with atopy, asthma, COPD, and IBD in African-Americans 30 . CCL19 and CCL21 are highly expressed in a population of stromal cells (designated as S4) that are expanded in IBD inflamed tissues and that continually produce proinflammatory factors preventing the resolution phase of a wound-healing response 17 .
IL10RA (Interleukin 10 receptor A) is a potent regulator of innate and adaptive immune responses, and IL10RA genetic variants are associated with Very-Early Onset IBD (VEOIBD) cases; a subset of VEOIBD refractory patients respond to hematopoietic stem cell transplantation 41 . IL10R1 knockout mice are susceptible to chemical-induced colitis 42 .
RELA (Nuclear Factor NF-Kappa-B P65 Subunit). NFkB is a ubiquitous transcription factor, and its most abundant form is NFKB1 complexed together with RELA. RELA regulates the Th17 pathway in autoimmune disease models 43 , and the FOXO3-NF-κB RelA protein complexes reduce proinflammatory cell signaling and function 44 . RELA haploinsufficiency causes autosomal dominant chronic mucotaneous ulceration 45 , and RELA is a master transcriptional regulator of epithelial-mesenchymal transition in epithelial cells 46 . Genetic variation in RELA has been associated with SLE, type 2 diabetes, psoriasis, obesity, asthma, and atopic dermatitis 30 .
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
HGFAC -Hepatocyte Growth Factor Activator is a serine endopeptidase that converts Hepatocyte Growth Factor (HGF) to its active form in response to thrombin and kallikrein endopeptidases. HGF contributes to neutrophil recruitment. HGF expression is increased in active UC with animal models, suggesting that HGF-MET signaling exacerbates intestinal inflammation 50,51 . Furthermore, HGF promotes colonic epithelial regeneration and mucosal repair 52,53 . HGFAC variation is also associated with tuberculosis susceptibility 54 .

Supplementary Material / Methods
Broad Institute sequencing pipeline Sample processing. Exome sequencing was performed at the Broad Institute. The sequencing process includes sample prep (Illumina Nextera, IIlumina TruSeq, and Kapa Hyperprep), hybrid capture (Illumina Rapid Capture Enrichment (Nextera) -37Mb target, and TWIST Custom Capture -37Mb target), and sequencing (Illumina HiSeq2000, Illumina HiSeq2500, Illumina HiSeq4000, Illumina HiSeqX, Illumina NovaSeq6000 -76bp and 150bp paired reads). Sequencing was performed at a median depth of 85% targeted bases at >20X. Sequencing reads were mapped by BWA-MEM to the hg38 reference using a 'functional equivalence' pipeline. The mapped reads were then marked for duplicates, and base quality scores were recalibrated. They were then converted to CRAMs using Picard and GATK. The CRAMs were then further compressed using ref-blocking to generate gVCFs. These CRAMs and gVCFs were then used as inputs for joint calling. To perform joint calling, the single-sample gVCFs were hierarchically merged (separately for samples using Nextera and Twist exome capture). Figure 1). We first split multiallelic sites and code genotypes with genotype quality (GQ) lower than 20 as missing. Variants not annotated as frameshift, inframe deletion, inframe insertion, stop lost, stop gained, start lost, splice acceptor, splice donor, splice region, missense, or synonymous were removed from the following analysis. We also removed variants that have known quality issues (have a non-empty QUAL column) in the gnomAD dataset. Sample QC: poor-quality samples that met the following criteria were identified and removed: 1) samples with an extremely large number of singletons (≥ 500); 2) samples with mean GQ < 40; and 3) samples with missingness rates > 10%. Variant QC: low-quality variants that met the following criteria were identified and removed: 1) variants with missingness rate > 5%; 2) variants with . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021. mean read depth (DP) < 10; 3) variants that failed the Hardy-Weinberg Equilibrium (HWE) test for controls with p < 1 x 10 -4 ; and 4) variants with > 10% samples that were heterozygous and with an allelic balance ratio < 0.3 or > 0.7. Variants with different genotypes in WES and WGS in gnomAD were also removed. For Twist exome capture samples, we additionally removed 1) samples that had a significantly high or low inbreeding coefficient (> 0.2 or < -0.2); 2) samples that had a high heterozygosity away from mean (± 5 standard deviations); and 3) related samples, which were removed sequentially by removing the individual with the largest number of related samples (in PLINK, the individual with PI_HAT > 0. We kept samples that were classified as European with prediction probability greater than 80%. For Nextera samples, we used a second random forest classifier to assign EUR samples to AJ, LIT, FIN, or NFE, and a third random forest classifier to clean the AJ/NFE split.

Meta-analysis.
We used METAL 55 with the inverse variance weighted (IVW) fixed-effect model to meta-analyse the SAIGE association statistics from Nextera and Twist samples ( Table 1). The heterogeneity test was performed using Cochran's Q with one degree of freedom.

Sanger Institute sequencing pipeline
Sample processing. Genome sequencing was performed at the Sanger Institute using the Illumina HiSeq X platform with a combination of PCR (n=4751, controls only) and PCR-free library preparation protocols. Sequencing was performed at a median depth of 18.6X. Exome sequencing of cases was performed at the Sanger Institute using the Illumina NovaSeq 6000 and the Agilent SureSelect Human All Exon V5 capture set. Controls from the UK Biobank were sequenced separately as a part of the UKBB WES50K release using Illumina NovaSeq and the IDT xGen Exome Research Panel v1.0 capture set (including supplemental probes). 33,704 UKBB participants were selected for use as controls, excluding participants with recorded or self-reported CD, UC, unspecified noninfective gastroenteritis or colitis, any other immune-mediated disorders, or a history of being prescribed any drugs used to treat IBD. Exome and genome datasets were analysed separately but followed a similar analysis protocol.
Reads were mapped to hg38 reference using BWA-MEM. Variant calls were performed using a GATK 4 Best Practices-like pipeline; per-sample intermediate variant calling was followed by joint genotyping across the individual genome and exome cohorts. For the exome cohort, variant calling was limited to Agilent extended target regions. Per-region VCF shards were imported into the Hail software and combined. Multi-allelic sites were split. For the exome cohort, we subsetted the calls to the intersection of Agilent and IDT exome captures, further . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.15.21258641 doi: medRxiv preprint excluding regions recommended for exclusion by the UKBB due to an error in read mapping that results in no variant calls made.
Population assignment. We selected a set of ~14,000 well-genotyped common variants to identify the genetic ancestry of individual participants through the projection of 1000 Genomes Project cohort-derived principal components. For genomes, due to primarily European genetic ancestry of the controls, we excluded samples outside of four median absolute deviations from the median point of the European ancestry cluster of 1000G. For exomes, we implemented a Random Forest technique that classified samples based on principal components into broad genetic ancestry groups (EUR, AFR, SAS, EAS, admixed), with self-reported ancestry as training labels. For these analyses, we only retained the EUR samples, as the number of cases for other groups was too small for robust association analysis.
Quality control. A combination of hard-cutoff filters and per-ancestry/per-batch outlier filters were used to identify low-quality samples. We applied hard-filters for sample depth (> 12x genomes, > 15x exomes), call rate (> 0.95), chimerism < 0.5 (WGS) and FREEMIX < 0.02 (WGS). We excluded genotype calls with an allelic imbalance (for hets, (ab < 0.20) | (ab > 0.80)), low depth (< 2x), and low GQ (< 20). We then performed per-ancestry and per-sequencing protocol (AGILENT vs IDT for WES, PCR vs PCR-free for WES) filtering of samples falling outside 4 MAD from the median per-batch heterozygosity rate, Ti/Tv rate, number of called SNPs and INTELs, and insertion and deletion counts/ratio. An ancestry-aware relatedness calculation (pc-relate method in Hail 56 ) was used to identify related samples. As our association approach (logistic mixed-models) can control for residual relatedness, we only excluded duplicates or MZ twins from within the cohorts and excluded second and third degree relatives when the kinship was across the cohorts (e.g., parent in WGS, child in WES; kinship metric > 0.1 calculated via PC-Relate method using 10 principal components). In addition, we removed samples that were also present in the Broad Institute's cohorts.
Association testing. Association analysis was performed using a logistic mixed-model implemented in the REGENIE software. A set of high-confidence variants (> 1% MAF, 99% call rate, and in Hardy-Weinberg Equilibrium) was used for t-fitting. To control for case-control imbalance, Firth correction was applied to p-values < 0.05. To control for residual ancestry and sequencing heterogeneity, we calculated 10 principal components on a set of well-genotyped common SNPs, excluding regions with known long-range LD. These were used as covariates for association analyses. Only variants with call-rate above 90% after filtering poor calls were included in the association analysis. For WES, we verified that the > 90% call-rate condition holds true in both AGILENT and IDT samples. Association analysis was performed on QC-passing calls.

Kiel/Regeneron sequencing pipeline
Sample Preparation and Sequencing. The DNA samples were normalized and 100ng of genomic DNA was prepared for exome capture with custom reagents from New England Biolabs, Roche/Kapa, and IDT using a fully-automated approach developed at the Regeneron Genetics Center. Unique, asymmetric 10 base pair barcodes were added to each side of the DNA fragment during library preparation to facilitate multiplexed exome capture and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021. ; sequencing. Equal amounts of sample were pooled prior to exome capture with a slightly modified version of IDT's xGen v1 probes; supplemental probes were added to capture regions of the genome well-covered by a previous capture reagent (NimbleGen VCRome) but poorly covered by the standard xGen probes. Captured fragments were bound to streptavidin-conjugated beads and non-specific DNA fragments were removed by a series of stringent washes according to the manufacturer's recommended protocol (IDT). The captured DNA was PCR amplified and quantified by qRT-PCR (Kapa Biosystems). The multiplexed samples were pooled and then sequenced using 75 bp paired-end reads with two index reads on the Illumina NovaSeq 6000 platform using S2 flow cells.
Variant calling and quality control. Sample read mapping and variant calling, aggregation and quality control were performed via the SPB protocol described in Van Hout et al. 57 . Briefly, for each sample, NovaSeq WES reads are mapped with BWA MEM to the hg38 reference genome. Small variants are identified with WeCall and reported as per-sample gVCFs. These gVCFs are aggregated with GLnexus into a joint-genotyped, multi-sample VCF (pVCF). SNV genotypes with read depth (DP) less than seven and indel genotypes with read depth less than ten are changed to no-call genotypes. After the application of the DP genotype filter, a variant-level allele balance filter is applied, retaining only variants that meet either of the following criteria: (i) at least one homozygous variant carrier or (ii) at least one heterozygous variant carrier with an allele balance (AB) greater than the cutoff.

Analysis.
We combined the gvcf files with bcftools 1.11 using the "merge" command, then imported the joint vcf into Hail. We then split the multiallelic variants and removed variants with "<NON_REF>" alternative alleles. We applied the QC steps and assigned populations as in the Broad Institute sequencing pipeline.
Cross-cohort meta-analysis. We used the Cochran-Mantel-Haenszel (CMH) test to combine association summary statistics between the Broad Institute, Sanger Institute and Kiel/Regeneron cohorts.
Relation to known IBD causal variants. We assigned study-wide significant variants to one of four categories (Supplementary Table 4): 1) Known causal candidates: variants in a fine-mapping credible set 5 with posterior inclusion probability (PIP) > 5%, or reported in the earlier sequencing studies 6,8 . 2) New locus: variants implicating a genetic locus for the first time.
3) Tagging variants: tagging variants with the best PIP in fine-mapping credible sets using conditional analysis (see "Conditional analysis" below). 4) New variants in known locus: variants in known GWAS loci but either have MAF < 0.5% (and thus, no LD to evaluate tagging) or remain study-wide significant after conditional analysis using the LD from gnomAD.

Conditional analysis.
For study-wide significant variants not in a previously reported credible set 5 , we performed a conditional analysis to test whether they are independent from or tagging the known causal variants 5 . We first classified variants as "tagging" if they had r 2 > 0.8 with any variants in the reported credible sets 5 . For other variants, we performed a conditional analysis using 1) the p-value estimates from previous fine-mapping studies for credible set variants and 2) the LD calculated from gnomAD. We were unable to directly fit a multivariate model or use the LD from study subjects, because exome sequencing does not cover the non-coding putative causal variants, and the ImmunoChip does not have good quality for rare coding variants. The . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 21, 2021. ; conditional statistic, , for a variant with marginal statistic of from our study, was ' calculated as follows: in which is the statistic of the variant with the best PIP in the credible set i from the fine-mapping study, is the LD between the two variants, and and are the effective sample sizes for our study and the fine-mapping study respectively. We used the absolute value in this equation because of the challenges to align the alleles across sequencing, the fine-mapping study, and the gnomAD reference panel. Taking the absolute value is a conservative approximation (less likely to declare a variant as novel association) because it assumes that the putative causal variants from fine-mapping have the same direction of effect as the variant being tested when they are in LD. This is very likely to be correct. Exceptions and notes: HGFAC: despite this locus having been reported in an earlier GWAS 2 , the coding variant we identify was missed and is reported in this study for the first time as directly implicating this gene (r 2 = 0.35 with the previously reported GWAS SNP, rs2073505). We thus assign this variant as "New variant in known locus". RELA: similarly to HGFAC, this locus has been reported in an earlier GWAS 2 , but the coding variant we identify was missed and is reported in this study for the first time as directly implicating this gene (r 2 = 0.002 with the previously reported GWAS SNP, rs568617). We thus assign this variant as "New variant in known locus". SDF2L1: this variant has marginal p-value = 2 x 10 -7 and conditional p = 3.4 x 10 -4 . The r 2 between this variant and the non-coding variant with the best PIP from fine-mapping is 0.045. We manually assigned this variant to "New variants in known locus," as this is a missense variant. SLC39A8: the SLC39A8 A391T variant was not reported in the fine-mapping paper, as its genetic region was not included in the ImmunoChip design. Because this variant has been published in several papers as an IBD variant with genetic and functional evidence 58-60 , we assign this variant as "Known causal candidate". TYK2: the TYK2 A928V was not reported in the fine-mapping paper 5 , likely due to a lack of power. Because this variant has been known to be a causal variant for several autoimmune disorders 61 and in another IBD study 62 , we assign this variant as "Known causal candidate". NOD2: a) Previous studies [5][6][7] have shown evidence that the NOD2 S431L variant tags the NOD2 V793M variant, with the latter more likely to be the CD causal variant. In this study, however, S431L reached study-wide significance, but V793M failed to meet the significance cutoff. We therefore retained S431L in Figure 1 for the purpose of keeping this association signal. b) Due to the complexity of the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021.
NOD2 locus, we conducted a haplotype analysis using the Twist subjects and additionally classified signed variants that share the same haplotype with known IBD variants as "tagging". We found that for the NOD2 S47L variant, 18 out of 19 copies of the T allele are on the same haplotype as the fs1007insC variant. We therefore classify S47L as "tagging". c) The NOD2 A755V variant is in LD with rs184788345, the best PIP variant from fine-mapping (r 2 = 0.85). The marginal p-value for A755V is one order of magnitude less significant than rs184788345. Considering A755V is a missense variant while none of the variants in the credible set defined by rs184788345 are coding, we assign A755V as a likely ROse"Known causal candidate".
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.15.21258641 doi: medRxiv preprint