Introduction

Structural variants (SVs) and short tandem repeats (STRs) are important categories of genetic variation that account for the majority of base pair differences between individual genomes and are enriched for associations with gene expression1,2,3. SVs and STRs are comprised of several diverse classes of variants (e.g., deletions, insertions, multi-allelic copy number variants (mCNVs), and mobile element insertions (MEIs)), and multiple algorithmic approaches and deep whole genome sequencing are required to accurately identify and genotype variants in these different classes4. Due to the complexity of calling SVs and STRs, previous genetic association studies have generally not identified a comprehensive set of these variants but rather have focused on one or a few of the class types, and therefore the genomic properties of SVs and STRs associated with gene expression and/or complex traits are not well characterized.

SV classes and STRs vary in genomic properties including size, distribution across the genome, and impact on nucleotide sequences, but previous studies have not investigated whether these differences influence the likelihood of being an expression quantitative trait locus (eQTL), eQTL effect sizes, or the properties of eQTL genes (eGenes) such as gene type or level of evolutionary constraint3,5,6,7. Further, it is unknown if the variant classes may affect gene expression through different mechanisms such as altering gene copy number or three-dimensional spatial features of the genome. A comprehensive SV and STR data set generated using high-depth whole genome sequencing (WGS) from a population sample with corresponding RNA-sequencing data could be used to assess whether genomic features of SV classes and STRs are associated with properties of eGenes and eQTLs.

SVs and STRs have also been associated with complex traits, though they have been studied considerably less often in genome-wide association studies (GWAS) than single nucleotide variants (SNVs), and the overall contribution of SVs and STRs to complex traits is not well understood8,9,10,11,12,13,14,15. One difficulty with studying differences between SV classes and STRs in GWAS is that it is unknown whether the SV classes are differentially tagged by SNVs on genotyping arrays. A collection of hundreds of subjects genotyped for a full range of SVs, STRs, SNVs, and insertion/deletions (indels) could be used to assess the functional impact of SVs and STRs on complex traits using existing SNV-based GWAS and identify dark regions of the genome not captured by array GWAS.

In this study, as part of the i2QTL Consortium, we use RNA-sequencing data from induced pluripotent stem cells (iPSCs) from the iPSCORE and HipSci collections7,16,17 along with a comprehensive call set of SVs and STRs from deep WGS data4 to identify variants associated with iPSC gene expression and characterize the genomic properties of these SV and STR eQTLs. We observe that SVs are more likely to act as eQTLs than SNVs when in distal regions (> 100 kb from eGenes) and that duplications and mCNVs are more likely to have distal eQTLs and multiple eGenes compared to other SVs classes and STRs. eGenes for mCNV eQTLs are also less likely to be protein coding and more likely to have strong effect sizes relative to other SV classes and STRs. We examine the LD of SVs and STRs with GWAS variants and find that mCNVs and duplications are poorly tagged by GWAS SNVs compared to other variant classes. 11.4% of common SVs and STRs are in strong LD with a SNV associated with at least one of 701 unique GWAS traits; and deletion, rMEI, ALU, and STR lead eQTL variants are enriched for GWAS associations establishing that these variant classes have underappreciated roles in common traits. Finally, we find a highly impactful set of SVs and STRs located near high complexity loop anchors that localize near multiple genes in three dimensional space and are enriched for being associated with the expression of multiple genes and GWAS traits. This work establishes that different classes of SVs and STRs vary in their functional properties and provides a valuable, comprehensive eQTL data set for iPSCs.

Results

eQTL mapping

We performed a cis-eQTL analysis using RNA sequencing data from iPSCs derived from 398 donors in the iPSCORE and HipSci projects along with a comprehensive map of genetic variation (37,296 SVs, 588,189 STRs, and ~48 M SNVs and indels (Supplementary Data 1)) generated using deep WGS from these same donors4. These variants include several classes of SVs including biallelic duplications and deletions; multi-allelic copy number variants (mCNVs); mobile element insertions (MEIs) including LINE1, ALU, and SVA; reference mobile element insertions (rMEI); inversions; and unspecified break-ends (BNDs). We identified 16,018 robustly expressed autosomal genes and tested for cis associations between the genotypes of all common (MAF ≥ 0.05) SVs (9,313), STRs (33,608), indels (~1.52 M), and SNVs (~5.83 M) within 1 megabase of a gene body using a linear mixed model approach (Fig. 1a and Supplementary Data 2, “Methods” section). We detected associations between 11,197 eGenes (FDR < 5%, Methods) and 10,904 unique lead variants (lead eVariants), including 145 SVs (1.3%), 140 STRs (1.3%), 2648 indels (24.3%), and 7971 SNVs (73.1%, Fig. 1b and Table 1). We compared our eQTLs to those discovered by GTEx and 1000 Genomes and found that the number of eGenes we identified is consistent with the expected power from using 398 samples (Supplementary Figs. 13)1,3. While SVs and STRs accounted for only 0.1 and 0.38% of tested variants respectively in our analysis, they were highly enriched to be lead eVariants (SVs: OR = 17.9, p = 3.3e-91; STRs: OR = 4.14, p = 1.5e-24; Fisher’s exact test (FET)) and collectively formed lead associations with 3.25% of eGenes (1.73% SVs and 1.52% STRs), indicating that these variant classes have a disproportionate effect on gene expression compared to SNVs and indels.

Fig. 1: eQTL mapping.
figure 1

a Overview of eQTL study design. We performed two eQTL analyses: a joint analysis that used all variants and identified 11,197 eGenes and an SV/STR-only analysis that only used SVs and STRs and identified 6,996 eGenes. b,c Pie charts showing the number of lead variants across the different variant classes for (b) joint and (c) SV/STR-only eQTL analyses. d Venn diagram showing the intersection between the eGenes detected in the joint and the SV/STR-only analysis.

Table 1 Summary of i2QTL variants and eQTL results.

To conduct comparative analyses of the functional properties of the different SV classes and STRs, we performed an SV/STR-only eQTL analysis using the 42,921 common SVs and STRs and excluding SNVs and small indels (Fig. 1a and Supplementary Data 3). We identified 6,966 eGenes (FDR < 5%) associated with 5,343 unique lead eVariants (Table 1 and Fig. 1c). SVs were enriched among lead variants compared to STRs (OR = 1.15, p = 1.7e-5, FET) though the majority of lead eVariants were STRs (4087 eSTRs vs 1231 eSVs). Of the 11,197 eGenes identified in the joint eQTL analysis, 6,507 were also identified in the SV/STR-only eQTL analysis (Fig. 1d). Among these 6,507 shared eGenes, 94.6% (6,155) were mapped to a lead SNV or indel variant in the joint analysis, while the remaining 5.4% (352) were mapped to the same lead SV or STR identified in the SV/STR eQTL analysis. To evaluate how many of the 6,155 shared eGenes were likely driven by the same causal variant, we computed the linkage disequilibrium (LD) between SNV/indel lead variants in the joint eQTL analysis and eSVs and eSTRs from the SV/STR-only eQTL analysis. We found that lead SNVs or indels from the joint analysis were in strong LD (R2 > 0.8) with the lead eSV or eSTR from the SV/STR-only analysis for 14.2% (872/6,155) of shared eGenes. While the true causal variant at these loci is unknown, these data suggest that a substantial number of eQTLs that can be identified using SNVs may be explained by SVs or STRs.

Variant size influences eQTL associations

Given that SVs and STRs have size ranges that span orders of magnitude4, we sought to examine the relationship between variant length and the likelihood of being an eVariant across the different variant classes. We tested whether STRs or deletions, duplications, and mCNVs longer than a particular length threshold were more likely to be eVariants compared to variants shorter than the length threshold. We found that longer deletions, duplications and STRs were more likely to be eVariants and lead eVariants than shorter variants (Fig. 2a and Supplementary Fig. 4). The trend was especially strong for deletions where 36% of variants longer than 50 kb were lead eVariants (OR = 3.11, p = 0.0095, FET). Although a higher proportion of mCNVs were eVariants compared to other classes (Fig. 2a), mCNV length was not strongly associated with eQTL status; only mCNVs longer than 10 kb were significantly more likely to be lead eVariants (OR = 1.59, p = 0.01, FET).

Fig. 2: Variant length influences the likelihood and effect size of eQTLs.
figure 2

a Percentage of variants with length greater than the thresholds on the x axis that were eVariants (grey lines) or lead eVariants (green lines). Points are colored according to the enrichment (log 2 odds ratio) of variants above each threshold among eVariants or lead eVariants relative to variants smaller than the threshold; points circled in red were significant (FET two-sided, p < 0.05). A complete list of p values and odds ratios is provided in Supplementary Fig. 4. (b,c) Association of variant length with eQTL effect size for (b) non-exonic eQTLs or (c) exonic eQTLs mapped to biallelic deletions (n = 1,769 non-exonic, n = 48 exonic), duplications (n = 235 non-exonic, n = 13 exonic), multi-allelic CNVs (n = 1,278 non-exonic, n = 111 exonic) and STRs (n = 13,085 non-exonic, n = 148 exonic). Number of eQTLs for each variant class at defined length is shown in top panels. Points in bottom panels represent the centers of bins with equal numbers of observations and error bars indicate 95% confidence intervals around the mean (1000 bootstraps). Lines represent linear regressions, with 95% confidence intervals shaded, as calculated on unbinned data. p values at the right of each plot indicate the significance of the association between length and absolute effect size (linear regression, t-test) in a model that includes non-mode allele frequency and distance to TSS as covariates. p values presented are not adjusted for multiple testing.

We next sought to examine whether eVariant length for SVs and STRs was predictive of absolute eQTL effect size and if lead eQTLs that overlap (exonic) or do not overlap (non-exonic) exons of the eGene displayed similar effects (Fig. 2b, c). We found that lead eVariant length was significantly associated with the absolute effect size for non-exonic deletion, duplication, mCNV, and STR eQTLs independent of variant distance to the transcription start site and allele frequency (Fig. 2b). However, among exonic eQTLs, only those mapping to deletions had a significant correlation between length and effect size with longer deletions having larger effect sizes (Fig. 2c). These data show that longer variants are more likely to be eVariants for both SVs (excluding mCNVs) and STRs and that among eVariants that do not overlap exons, longer variants tend to have stronger effects on expression.

mCNVs and deletions are enriched for associations with multiple eGenes

We next investigated whether SVs from particular classes were more likely to be eVariants or associated with multiple eGenes compared to STRs which comprised 78% of all tested variants and 70% of eQTLs (Fig. 1c). We found that both mCNVs and deletions were more likely to be eVariants for at least one gene relative to STRs (mCNVs: OR = 2.81, q = 1.57e-50, FET; deletions: OR = 1.35, q = 6.47e-12, FET) and were also more likely to be lead eVariants compared to STRs (mCNVs: OR = 1.70, q = 9.74e-9, FET; deletions: OR = 1.32, q = 4.56e-7, FET) (Fig. 3a and Supplementary Fig. 5). Conversely, BNDs were less likely to be eVariants (OR = 0.75, q = 2.18e-4, FET) or lead eVariants (OR = 0.46, q = 3.63e-11, FET) compared to STRs (Fig. 3a). We next examined how often eVariants from each variant class were associated with multiple eGenes and found that, while many of the SV classes were more likely to be associated with multiple eGenes compared to STRs (Fig. 3b), mCNV eVariants and deletion eVariants were associated with two or more genes 61.7% (271/439) and 44.4% (401/902) of the time respectively. Moreover, 31.2% (137/439) of mCNV eVariants and 13.4% (121/902) of deletion eVariants were associated with at least 4 genes compared to only 7.2% (571/7,915) of STR eVariants (Fig. 3b). These results show that mCNV and deletion eVariants are more frequently associated with the expression of multiple genes compared to STRs and other SVs.

Fig. 3: Properties of SV and STR eQTLs.
figure 3

a Percentage of tested variants from each class that are eVariants (eV), left) or lead eVariants (right) in the SV/STR-only eQTL. Asterisks indicate significant enrichment or depletion of variants among eVariants relative to STRs (FET two-sided, BH alpha < 0.05). b Left panel is a balloon plot where color indicates number of eVariants and size indicates fraction of eVariants in each bin. Right panel shows average number of eGenes per eVariant with 95% confidence intervals. Red points indicate significantly higher numbers of eGenes per eVariant (Mann–Whitney U one-sided, Bonferonni p < 0.05) compared to STRs. c Distribution of the distance of eQTL (left) and lead eQTL (right) variants to the boundary of their eGenes. Percentages indicate the proportion of eQTLs that were at least 250 kb distal to eGene, red asterisks indicate that eQTLs tended to be localized farther from eGenes as compared to STRs (Mann–Whitney U Test one-sided, Bonferonni p < 0.05). Distributions for lead SNV and indel eQTLs were not examined for significant differences relative to STRs. d Fraction of lead eQTLs that were intergenic or overlapped exons, promoters, or introns of their associated eGene or other genes. e Enrichment of lead eQTLs for each class that overlap each genic element compared to variants from all other classes (FET two-sided, BH alpha < 0.05). f Distribution of effect sizes for lead eQTLs that overlapped or did not overlap an exon of their eGene. (G,H) Distribution of absolute effect sizes for SV classes, STRs, and SNV/indels for exonic (g) and non-exonic (h) eQTLs. Vertical dashed lines indicate means. P values are derived from comparing the effect size distributions for mCNVs to the distributions for STRs, SVs, SNVs, and indels (Mann–Whitney U test one-sided, Bonferonni p < 0.05). All p values and odds ratios for (a) and (e) are in Supplementary Figs. S5 and S6. eQTL statistics for SNV and indels are from the joint analysis for all panels.

Genomic localization of SV and STR eQTLs

We examined how eVariants for each variant class were distributed with respect to genes and promoters by evaluating the distance of eVariants to their eGenes (5′-UTR or TSS) and their overlap with genic elements. We found that, for all SV classes and STRs, most eQTLs were located near eGenes (<250 kb, Fig. 3c); however, a significantly larger proportion of eQTLs that were mCNVs or duplications were located far from their eGenes (>250 kb) compared to STRs (mCNVs: 40.5%, OR = 2.26, q = 1.76e-41, duplications: 37.1%, OR = 1.96, q = 3.46e-5; FET) suggesting increased distal regulatory activity for these variant types. SNV and indel eQTLs were generally closer to eGenes than all clases of SV and STR eQTLs (Fig. 3c). We next annotated each variant-gene pair tested for whether the variant overlapped an exon, promoter, or intron for the paired gene; overlapped an exon, promoter, or intron for a different gene; or was intergenic (Fig. 3d and Supplementary Fig. 6). Overall, we observed that 23.1% of lead eQTL variants directly overlapped the eGene with 205 overlapping exons (2.9%), 224 overlapping promoters (8.5%), and 1,180 overlapping only introns (17%) in the associated eGene. Interestingly, mCNVs were the only eQTL variant class whose lead variants were enriched for overlapping exonic regions of eGenes compared to all other variant classes (17.7%, OR = 9.16, q = 4.7e-26, Fig. 3d, e). mCNV lead variants were more likely to overlap gene exons even though a substantial number of mCNV eQTLs were also located far from their eGene (Fig. 3c) suggesting that a subset of mCNV eQTLs may be distal regulatory variants and a subset may affect expression by directly altering eGene copy number. Lead mCNVs, duplications, and deletions were also enriched for overlapping exonic regions of other genes besides their associated eGenes compared to other variant classes (mCNVs: OR = 11.64, q = 1.81e-57; duplications: OR = 5.95, q = 3.29e-6; deletions: OR = 2.34, q = 1.42e-8; FET); conversely, STRs were depleted in other gene exons (3.82%, OR = 0.26, q = 7.7e-37, FET, Fig. 3d, e) and eGene exons (1.98%, OR = 0.32, q = 8.7e-14). Overall, lead mCNV eVariants were more likely than other eVariants to overlap eGene exons while mCNVs and duplications had more distal eQTLs than other variant classes.

We next compared the direction and absolute effect sizes of lead eQTLs that overlapped or did not overlap exons of the eGene (exonic and non-exonic eQTLs) from each variant class to determine whether variants that alter gene copy number differ from regulatory region variants. Exonic and non-exonic lead eQTLs mapped to mCNVs and exonic lead eQTLs mapped to duplications had primarily positive associations with gene expression while exonic lead eQTLs mapped to deletions had mostly negative effects (Fig. 3f). Lead eQTLs mapped to all other variant classes, including SNVs and indels, had bimodal effect size distributions. Comparing the absolute effect sizes of lead eQTLs mapped to each variant class, we found that mCNV lead eQTLs also had significantly larger effect sizes in both exonic (Fig. 3g) and non-exonic (Fig. 3h) contexts compared to lead variants from other SV classes, STRs, SNVs, and indels (Fig. 3g, h). These data show that mCNV eQTLs are unique in that they tend to exert strong positive effects on gene expression, especially mCNV eQTLs that overlap exons which are almost always positively correlated with gene expression.

eVariant type is associated with eGene type and constraint

We next investigated whether the eGene type, such as protein coding or pseudogene, was associated with the variant class of lead variants. We annotated all eGenes with Gencode gene types and calculated whether a given variant class was more or less likely to be a lead variant for eGenes of a particular gene type (Fig. 4a, b and Supplementary Fig. 7a). Notably, a lower proportion of mCNV eGenes were protein coding (OR = 0.23, q = 1.52e-28, FET) and a higher proportion were pseudogenes (OR = 8.57, q = 4.72e-34, FET) or lincRNAs (OR = 2.5, q = 2.29e-4, FET) compared to other variant classes. Duplication and deletion eGenes followed the same trends but did not reach significance. However, STR eQTLs had the opposite pattern and were enriched for protein coding genes (OR = 1.83, q = 8.38e-16, FET) and depleted for pseudogenes (OR = 0.36, q = 1.19e-16, FET) and lincRNAs (OR = 0.62, q = 8.68e-4, FET). We looked at the effect sizes of associations among different gene types and found that lead eQTLs for protein coding eGenes tended to have lower effect sizes compared to lead eQTLs for genes that are not protein coding (Fig. 4c) which is consistent with non-protein coding genes being more tolerant of disruption18. Furthermore, the observation of higher effect sizes among mCNV eQTLs and their increased likelihood to overlap exons of their eGenes may be partly explained by their association with fewer protein coding genes, while the opposite properties were observed among lead eQTLs attributed to STRs, which were less frequently exonic.

Fig. 4: Properties of eGenes associated with different variant classes.
figure 4

a Fraction of eGenes of each Gencode gene subtype mapped to lead variants of each class for the SV/STR-only eQTL. b Enrichment of the proportion of eGenes of each subtype mapped to a variant class compared with the proportion of other eGenes falling into that subtype. Significant associations (FET, BH FDR < 0.05) are indicated with x symbols. c Absolute effect size of associations for genes of each subtype among lead eQTLs. P values indicate significance of Mann–Whitney U test (one-sided) for difference in the effect size distributions of each category compared to protein coding genes (Bonferroni). For boxplots the minimum box edge indicates the first quartile while the maximum box edge indicates the third quartile and the center line indicates the median value. Whiskers of the box plot are drawn at the maximum point (upper whisker) or minimum point (lower whisker) that is within 1.5 times the interquartile range (quartile three—quartile one). d Distribution of ExAC scores for intolerance to loss-of-function variants in a single allele (pLI, red), intolerance to loss-of-function variants in both alleles (pRec, orange), and tolerance to loss-of-function variants in both alleles (pNull, blue) for 5,675 eGenes. e The percentage of eGenes (grey bars) mapped to lead variants that had high (>0.9) pLI (left), pRec (center), or pNull scores (right). Black bars show percentages for 7,337 non-eGenes from the SV/STR-only analysis. P values indicate the significance of the difference between the proportion of high score eGenes and high score non-eGenes for each group individually (FET, BH FDR < 0.05, within each probability score). f Absolute effect size versus pLI score for all eGenes with a pLI score (n = 5,675). Points are equally-sized bins and error bars show 95% confidence intervals (n = 1000 bootstraps) around the mean pLI. Line is a model predicting pLI by eQTL effect size after regressing out variant class and mean log10(TPM) expression level of the gene among expressed samples. p value is for the eQTL effect size term (t test).

Given the differences in eGene types between different variant classes, we hypothesized that eGenes might be under different levels of evolutionary constraint compared to non-eGenes. To test this, we obtained pLI scores (probability that a gene is intolerant to loss of one allele), pRec scores (probability that a gene is intolerant to loss of both alleles), and pNull scores (probability that a gene is tolerant of loss of both copies of gene) from ExAC for 13,012 of the 16,018 genes that were tested for eQTLs18,19. We examined the distributions of these constraint scores for eGenes with lead eVariants from each variant class and observed that eGenes were skewed towards low (<0.9) pLI and pNull scores but more evenly distributed between low and high pRec scores (Fig. 4d and Supplementary Fig. 7b). We found that across variant classes eGenes were significantly depleted for having high pLI scores (> 0.9) and generally enriched to have high pRec and pNull scores compared to non-eGenes (Fig. 4e). This result demonstrates that genes that are intolerant to mutation are less frequently eGenes while genes tolerant of heterozygous or null alleles are more likely to be eGenes, consistent with SNV eQTLs20. Examining this trend among variant classes, mCNVs had the lowest proportion of high pLI eGenes suggesting that mCNV protein coding eGenes are less constrained. Interestingly, eGenes mapped to deletions were most likely to be high pNull suggesting that, due to their severe negative effects on expression, deletion eVariants are under greater selection to affect dispensible genes. Given that some eGenes were classified as under high levels of constraint (pLI > 0.9), we sought to understand whether these genes are also sensitive to high levels of expression modulation. We compared the absolute effect size of lead QTLs to the pLI score of the eGene and found a strong and significant negative correlation between effect size and pLI (Fig. 4f) consistent with a previous report that there is less variation in the expression of highly constrained genes20. Taken together, these results suggest that while eGenes tend to be less constrained than other genes, the eGenes with mCNV or deletion lead eVariants are particularly tolerant of loss-of-function variation.

Multi-eGene eQTLs colocalize with distal chromatin loop anchors

Since chromatin looping has been shown to play a key role in the regulation of genes by positioning regulatory regions near gene promoters21,22,23,24, we sought to determine whether distal eVariants are located near the promoters of their eGenes in three-dimensional space via chromatin looping. We obtained chromatin loop calls from iPSC promoter capture Hi-C data25 that define promoter loops between gene promoters (promoter anchors) and distal sequences (distal anchors, Fig. 5a). We observed that 13,575 of the 16,018 genes tested for eQTLs had at least one promoter loop and that 29.2% of SVs and 30% of STRs tested for eQTLs overlapped a distal anchor. Interestingly, mCNVs were significantly less likely to overlap a distal anchor than other variant classes with only 13.5% of mCNVs overlapping (OR = 0.37, p < 1e-25, FET) likely due to the difficulty of identifying loop anchors near segmental duplications, which frequently overlap mCNVs (Supplementary Fig. 8). Among the 13,575 genes that had at least one loop and were tested for SV/STR eQTLs, we identified 177,571 (31.8%) variant-gene pairs for which the variant was: (1) closer to the distal anchor than to the promoter anchor; (2) at least 50 kb away from the gene body; (3) did not overlap the exon of the tested gene; and (4) was a maximum of 200 kb from a distal anchor, which we defined as distal variant-gene pairs (Fig. 5a). Among these distal variant-gene pairs, we observed 1,598 eGenes (22% of all eGenes) with at least one eVariant located in the distal anchor of a loop to their promoter, 963 (12.4% of all eGenes) of which were mapped to a lead eVariant in the distal anchor; 82% (788/963) of these lead variants were STRs (Fig. 5b, c). Within each variant class, distal variant-gene pairs that overlapped the distal anchor of a loop to the promoter of the tested gene were highly enriched to be eQTLs or lead eQTLs (OR = 3.5–5.4; q = 0.022-9.7e-197; FET, Fig. 5d). The fraction of eQTLs from the joint analysis with SNVs/indels that were loop-acting was nearly the same as the SV/STR only analysis with 12.5% of eGenes having at least one loop-acting eVariant and 10.4% whose lead variant was loop-acting (Supplementary Figs. 5b and 9a,b). The majority of loop-acting eQTLs in the joint analysis are SNVs or indels reflecting the large number of SNV/indel eQTLs (Supplementary Fig. 9c,d). The fraction of all eQTLs versus lead eQTLs that were loop-acting was similar for SNVs and indels but differed for SV classes in the joint analysis such as MEI, rMEI, and duplication lead eQTLs which generally did not intersect loop anchors (Supplementary Fig. 9e,f). Overall, these results indicate that many eQTLs include variants that overlap distal chromatin loop anchors, that variants that overlap distal anchors are more likely to be eQTLs and lead variants, and that there are differences in the fraction of eQTLs and lead eQTLs that are loop acting between variant classes.

Fig. 5: Localization of eQTLs near chromatin loops.
figure 5

a Diagram showing localization of SVs and STRs at loop anchors. eVariants closer to the distal anchor (right of grey dotted line) than the promoter anchor were considered loop-acting eQTLs. b Proportion of eQTLs that were genic (yellow), overlapping or close to distal anchors (green), or distal acting by some other mechanism (grey). c Distal loop-acting eQTLs (n = 2,327 eQTLs for 1,598 eGenes) per SV class. d Percentage of eVariant-eGene pairs where the eVariant (left) or lead eVariant (right) overlaps or does not overlap the distal anchor. p values derived by comparing proportions for each class (FET two sided, Benjamini–Hochberg). e Fraction of tested distal variant-gene pairs (a) that were lead eQTLs versus their distance to the distal anchor. Points represent the means of equally-sized bins; errors bars 95% confidence intervals. Curves are logistic regressions using distance to the loop anchor to predict whether the variant-gene pair was a lead association. Regressions were computed separately for variant-gene pairs inside the loop (left, n = 47,831) or outside the loop (right, n = 294,796). Center panel shows fraction of variant-gene pairs that overlapped distal anchors and were lead eQTLs (n = 41,794). f Number of eVariants connected to gene promoters through chromatin loops (x-axis) and number of these connected genes that are eGenes (y-axis). g,h Percentage of tested variants that were eVariants (g) or lead eVariants (h) stratified by number of genes the variant was linked to through a distal anchor. p values for each bar were derived by comparing the proportion of tested variants that were eVariants and linked to genes to the proportion of variants that were eVariants and not linked to genes (first bar). i Number of eGenes versus number of tested genes per eVariant stratified by whether the genes are linked by loops to the eVariant (blue) or not linked by loops (grey). Lines indicate relationship between number of eGenes per eVariant and number of genes tested for genes that were or were not linked by loops. p value is for loop/nonloop term (t-test).

We next hypothesized that if variants near loops impact gene expression, the location of variants relative to the distal anchor should be related to the chance of that variant being an eQTL. We tested if the distance of a variant to the distal anchor or the variant’s position inside or outside of the loop was predictive of whether the variant is an eVariant using a logistic model (Fig. 5a). For this model, we subsetted the variant-gene pairs to those whose variants were at least 100 kb away from the nearest TSS or 3′-UTR and a maximum of 200 kb from the nearest distal anchor to ensure we were examining interactions around the distal anchor. We observed that variants closer to the distal loop anchor were significantly more likely to be lead eQTLs (p = 0.0005) and that distal variant-gene pairs with a variant inside the loop were more likely to be lead eQTLs than those with variants outside the loop (Fig. 5e, OR = 1.5, p = 2.1e-8, FET). This suggests that variants near distal loop anchors are more likely to affect expression of the looped gene and that variants that do not directly overlap the loop anchor can still affect gene expression, potentially through changes in regulatory elements or loop structure.

Given that variants overlapping distal anchors are more likely to be eQTLs, we hypothesized that variants that are looped to multiple gene promoters may be associated with the expression of many of their looped targets. To examine this, we tested whether the number of looped genes to an eVariant was associated with the number of eGenes for that eVariant. We observed that variants overlapping distal anchors that were connected to multiple genes via chromatin loops tended to be multi-gene eVariants (Fig. 5f). We also found that the likelihood of a variant being an eQTL or lead eQTL increased significantly as the number of genes that the variant was looped to increased (Fig. 5g,h). For example, 41% of variants linked to 6 or more genes by a distal loop anchor were lead eVariants as compared to only 8.5% of distal variants that were not linked to an eGene by loop anchor (OR = 7.62, q = 6.97e-232, FET). One possible explanation of these results is that variants looping to multiple genes are located in gene-dense regions and are therefore tested for more eGenes. To address this, we compared, for each variant, the number of genes that were tested and the number that were identified as eGenes, stratified by whether the genes were connected by loops or not connected by loops, and found variants tended to have more eGenes among looped genes than genes not connected by loops (Fig. 5i, p < 1e-200, t-test, Methods). This trend was consistent across SV classes (Supplementary Fig. 10). These results suggest that variants located in high complexity loop anchors are more likely to be multi-gene eQTLs than variants simply located near many genes.

LD tagging and GWAS associations differ between variant classes

SVs and STRs are typically not assessed in GWAS, so the contribution of classes of non-SNV variation to complex traits and diseases is currently unclear. To examine the extent by which the different SV classes and STRs have been assayed by proxy in GWAS, we calculated LD between i2QTL variants and SNVs present in the UK Biobank (UKBB, ± 50 kb of each SV and STR). We observed strong LD (R2 > 0.8) with UKBB SNVs for a large proportion of STRs (81.7%), ALU and LINE1 elements (79% and 83.7%), and deletions (71.1%), but a markedly lower proportion of duplications (29.2%) and mCNVs (24.2%) were in strong LD with a nearby variant (Fig. 6a). We stratified our analysis of duplications and mCNVs by whether they overlapped a segmental duplication (SD) and found that those that overlapped SDs were less likely to be in strong LD with UKBB variants (18.9% duplications and 16.8% of mCNVs R2 > 0.8) than those that did not overlap SDs (33.3% duplications and 65.7% mCNVs R2 > 0.8, Supplementary Fig. 11), indicating that poor tagging for these classes may in part be due to the presence of repetetive sequences. We also found that only 59% of multi-allelic STRs with four or more alleles were well-tagged by UKBB SNVs. These results suggest that the duplications and mCNVs are generally not assayed by proxy in GWAS, especially when located in segmental duplications.

Fig. 6: Associations between SVs, STRs and GWAS.
figure 6

a Distribution of maximum LD score per i2QTL variant with UKBB variants within 50 kb for each variant type. p values calculated for the LD distribution of each SV class relative to STRs (Mann––Whitney U, Bonferroni). b Fraction of variants of each class that are tagged by a UKBB variant (R2 > 0.8) for lead eVariants (green) versus all other variants in that class (black). Q values indicate enrichment of lead eVariants to be in LD with a UKBB variant versus all other variants tested in the eQTL in the class (FET two-sided, Benjamini–Hochberg). c Fraction of variants of each class that are tagged by a UKBB variant that is associated with at least one trait in the UKBB (p < 5e-8). q values indicate enrichment of lead eVariants to be in strong LD with a UKBB variant that is associated with at least one trait versus all other variants tested in the eQTL in the class (FET two-sided, Benjamini–Hochberg). d Percentage of variants in LD (R2 > 0.8) with a variant significantly linked to at least one GWAS trait when significantly associated with 0, 1, or 2 eGenes or more. To compute annotated q values, we utilized all variants tested in the SV/STR-only eQTL, and for each variant class we performed logistic regression to determine whether the number of eGenes for a variant was associated with whether the variant was in strong LD with a significant GWAS variant (z-test, Benjamini–Hochberg). e Example multi-gene eSTR on chromosome 22 with nine unique eGenes (pink/red) including four genes that the STR loops to. Genes for which the variant is a lead variant are colored red. iPSC Hi-C data is visualized as a heatmap of interaction frequencies. The variant is located between two chromatin subdomains that span ~100 kb on the left side of the variant and ~25 kb on the right side of the variant22. f Example of an mCNV on chromosome 7 that is a multi-gene eQTL associated with seven unique eGenes by looping. Exact p values and odds ratios for (b) and (c) are in Supplemental Fig. 12.

Next, we investigated the extent to which SVs and STRs associated with gene expression were tagged by nearby UKBB SNVs (R2 > 0.8) or linked to diseases and traits via GWAS. We observed that deletions, rMEI, and STR lead eVariants were more likely to be in strong LD with UKBB variants compared to non-lead eVariants of the same class (Fig. 6b and Supplementary Fig. 12a). While >65% of lead eVariants for most SV classes were in strong LD with any nearby UKBB variant, only 26% of mCNV and 21% of duplication lead eVariants were strongly tagged, further supporting that most mCNVs and duplications are not assayed by proxy in GWAS. We then examined how often variants in strong LD with UKBB variants were significantly associated with at least one GWAS trait (p < 5e-8) and found that 11.4% of common STRs and SVs were by proxy associated with at least one of 701 UKBB traits (Fig. 6c and Supplementary Fig. 12b). Lead eVariants were more likely to be in strong LD with significant GWAS variants across all classes; however, enrichment was only significant in STRs, deletions, rMEIs, and ALU elements likely because other classes had too few variants to reach significance (Fig. 6c). As a whole, SVs and STRs were respectively linked to 425 and 625 of 701 distinct GWAS traits, with 412 traits linked to variants of both types. Traits linked to eSVs and eSTRs included diseases such as type 1 diabetes, multiple sclerosis, arthritis, cancers, and heart disease, as well as quantitative traits such as height, body mass index and white blood cell count (Supplementary Data 4).

We hypothesized that multi-eGene eVariants may have greater impact on common traits and examined the LD of these eQTLs with GWAS variants. Interestingly, we found that multi-eGene eVariants were highly enriched to be in strong LD (>0.8 R2) with GWAS variants (Fig. 6d). ~40% of eVariants associated with two or more eGenes were in strong LD with a GWAS variant while only 20% of eVariants associated with one eGene were in strong LD with a GWAS variant. We also observed that ~70% of eVariants associated with three or more eGenes and localized near the eGenes’ promoters via chromatin loops were in strong LD with GWAS traits. For example, a 20 bp eSTR was associated with nine eGenes (seven lead) connected via distal loops (Fig. 6e) and was in strong LD with a UKBB variant linked to 19 distinct traits including asthma and body fat percentage; two of the genes associated with this variant (TCF20 and POLR3H) have also been previously linked to autism26,27. We observed that this variant appears to overlap a chromatin subdomain boundary visible in Hi-C data from iPSCORE22 which is notable given that disease-causing STRs, such as the causal variant for Fragile X syndrome, have been reported to localize to subdomain boundaries28. Additionally, we found a 13 kb deletion on chromosome 7 linked to five eGenes via looping that was also linked to 14 traits (Fig. 6f). These data suggest that multi-gene associations mediated by chromatin looping are frequently linked to common traits, reflecting the impact of modulating the expression of several genes.

Discussion

We identified SVs and STRs associated with gene expression using a comprehensive SV/STR variant call set (Companion Paper) and RNA-sequencing from 398 iPSC samples. We discovered several genomic properties that were associated with gene expression. For deletions, duplications, and STRs, we found increased length was associated with a higher likelihood of being an eVariant and increased effect size for noncoding eVariants. We investigated the properties of eGenes associated with SVs and STRs and showed that they were less constrained than non-eGenes and that highly constrained protein coding eGenes tended to have smaller effect sizes. Distal SV and STR eVariants were enriched for being located near the promoters of their eGenes in three-dimensional space via chromatin looping. We have previously shown that loop detection may be affected by the presence of SVs22, and therefore we have likely underestimated the proportion of distal SV eVariants that mediate their effects on gene expression via chromatin loops. We also show that SV and STR eVariants near high complexity loop anchors with multiple promoter-distal regulatory element interactions are more likely to affect the expression of several genes. These results demonstrate that chromatin looping may be an important mechanism by which SVs and STRs regulate gene expression. Our study presents one of the largest sets of SVs and STRs associated with gene expression and reveals important general genomic properties of both SV and STR eVariants and their corresponding eGenes.

While previous studies have cataloged SV eQTLs, the extent to which different SV classes and STRs differentially impact gene expression has not been thoroughly investigated3,6,7,22. We identified substantial differences between the different SV classes in their genomic locations relative to eGenes; their likelihood of being associated with multiple eGenes; the types of associated eGene (i.e., coding, noncoding, evolutionary constraint); their effect sizes; the extent of linkage disequilibrium with tagging single nucleotide polymorphisms (SNPs) used in GWAS; and their likelihood of being associated with GWAS traits. Interestingly, mCNV eQTLs differed in several respects compared to eQTLs for other variant classes. mCNVs eQTLs were more likely to be associated with the expression of multiple genes, had larger effect sizes, tended to affect noncoding genes, and were more likely to overlap the corresponding eGene or be located far from the eGene. mCNV eQTLS that overlapped exons were also highly enriched for positive associations between copy number and expression relative to other variant classes. Unlike other SV classes, the length of mCNVs was not strongly associated with the probability of being an eQTL. The differences in likelihood of being an eQTL, location, effect size, and types of eGenes for mCNVs are likely related; for instance, less constrained genes tend to have larger eQTL effect sizes, mCNVs tend to be eQTLs for less constrained genes, and mCNV eQTLs tend to have larger effect sizes. Our results indicate that a previous finding that mCNVs were enriched among predicted causal eQTL variants might be driven by the fact that mCNVs often overlap genes and likely cause eQTLs1. We also observed that deletion eQTLs were more likely to be associated with the expression of multiple genes but tended to have smaller effects on gene expression, not overlap genes, and affect less constrained genes. These observations are consistent with gene deletions and subsequent loss of expression having strong deleterious effects. Future studies may focus on whether the differences in eQTLs between variant classes are driven by selective pressures, genomic property differences between the SV classes, or some combination thereof.

The extent to which SVs and STRs contribute to variation in complex traits is not fully known because prior GWAS have generally not assessed SVs and STRs. We used our comprehensive SV/STR call set to estimate how well these variants are tagged by GWAS SNPs and whether they are associated with 701 traits from the UK Biobank. We found that only 26% of mCNV and 21% of duplication lead eVariants were tagged (R2 > 0.8) by a SNP in the UK Biobank, likely due in part to these variants being located in or near segmental duplications, indicating that these variants are generally missed in GWAS studies based on genotyping arrays. Multiallelic STRs are also not tagged well by SNPs and are likely not well-studied by current GWAS. We observed that 11.4% of common SVs and STRs are in strong LD with at least one significant GWAS SNP in the UK Biobank and that lead eSVs were more likely to be associated with traits compared to non-lead eSVs. We also identified a set of high-impact SVs and STRs associated with the expression of multiple genes and localized near the promoters of these genes via chromatin loops which are also highly enriched for GWAS associations. These high-impact variants that are associated with several seemingly unrelated GWAS traits may underly some of the observed pleiotropy in contempary genetic studies29 and indicate that future fine-mapping efforts will greatly benefit from including SVs and STRs.

Our study demonstrates that SVs and STRs play an important role in the regulation of gene expression and that eQTLs for different classes of SVs and STRs differ in their effect sizes, genomic locations, and the types of eGenes they impact. We have also demonstrated that high-impact SVs and STRs, i.e., those associated with the expression of multiple genes via chromatin looping, are associated with a wide range of human traits. We anticipate that these properties of SV and STR eQTLs will be useful for identifying causal variants underlying eQTLs and delineating mechanisms by which structural variation can impact gene expression. The collection of eQTLs identified here, along with the catalog of high-quality SVs and STRs described in a companion paper4, provide a powerful resource for future studies examining how these variants regulate gene expression and contribute to variation in complex traits.

Methods

Variant calls

Single nucleotide variant (SNV), insertion/deletion (indel), structural variant (SV) and short tandem repeat (STR) variant calls for iPSCORE and HipSci samples were discovered and rigorously analyzed in a companion paper (dbGaP: phs0013254). The iPSCORE collection was approved by the Institutional Review Board of the University of California at San Diego (Project #110776ZF). We have complied with all relevant ethical regulations for work with human participants and obtained informed consent. We used five variants callers to identify SVs and STRs. We used the SpeedSeq (SS) SV pipeline30 that combines LUMPY31 read-pair evidence with read depth support from CNVnator32. We also used the Genome STRiP CNVDiscovery pipeline (GS) and Genome STRiP LCNVDiscovery pipeline (GS LCNV)33 that detect SVs based on read depth evidence. We used MELT34 for mobile element insertion discovery and HipSTR6 to identify and genotype short tandem repeats. LUMPY, GS, and GS LCNV each identified biallelic deletions (DEL), biallelic duplications (DUP), and multi-allelic copy number variants (mCNVs). mCNVs are defined as variants that have at least 3 predicted alleles. LUMPY identified inversions (INV) and generic breakends (BND) that can include deletions and duplications that lack read-depth evidence, balanced rearrangements (INVs), MEIs, or other uncategorized breakpoints. As part of the SpeedSeq pipeline we also identified reference mobile elements (rMEIs). For non-reference mobile element insertions we used MELT to identify Alu element insertions (ALU), LINE-1 element insertions (LINE1), and SINE-R/VNTR/Alu element insertions (SVA). HipSTR identifies short tandem repeats (STRs) where at least one individual differed in STR length compared to the reference. We considered copy number variants (CNVs) to include deletions, duplications, and mCNVs. We considered MEI to encompass non-reference mobile element insertion ascertained by MELT, including ALU, LINE1, and SVA elements.

RNA-Seq quality control and processing

As part of the i2QTL Consortium, we have collected a set of RNA sequencing (RNA-seq) samples from 1,367 human induced pluripotent stem cell (iPSC) lines derived from 948 unique donors from five studies: iPSCORE7,17, HipSci16,35, Banovich et al.36, GENESiPS37, and PhLiPS38. Sample processing and quality control was performed across all samples as described below, but the eQTL analysis presented here uses a subset of the total data set corresponding to 388 unique donors from iPSCORE and HipSci that have variant calls from deep whole genome sequencing4. The RNA-seq data were obtained from: (1) 210 iPSCORE RNA-seq samples from dbGaP (phs000924); (2) 288 HipSci cell lines (from 188 individuals) from the ENA project ERP007111 and several EGA projects (Supplementary Table 1); (3) Banovich et al.36 (SRA: SRP093633, http://eqtl.uchicago.edu/yri_ipsc/); (4) GENESiPS (SRA—SRP072417, dbGaP: phs001139.v1.p1); (5) the PhLiPS projects (dbGaP: phs001341.v1.p1.). Data was available from these sources as either FASTQ, BAM or CRAM files. To ensure uniformity in processing, CRAM and BAM files were converted to FASTQ files. The reads in the FASTQ files were then trimmed to remove adapters and low quality bases (using Trim Galore!, http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), followed by read alignment using STAR (version: 020201)39 with the two-pass alignment mode and default parameters as proposed by ENCODE (c.f. STAR manual). All alignments were relative to the GRCh37 reference genome, using Ensembl 7540 for any of the necessary genome annotations. Gene-level RNA expression was quantified from the STAR alignments using featureCounts (v1.6.0)41, which was applied to the primary alignments using the -B and -C options in stranded mode when applicable. In case multiple RNA-seq runs per iPSC-line were generated these were summed to one set of gene-counts per iPSC line.

After feature quantification high-quality RNA-seq samples were identified by applying filters on both Picard (https://broadinstitute.github.io/picard/) and VerifyBamID (http://csg.sph.umich.edu/kang/verifyBamID/) quality measures as well as gene expression levels. We defined high-quality samples as those with > 15 million reads, > 30% coding bases, > 65% coding mRNA bases, a duplication rate lower than 75%, Median 5′ bias below 0.4, a 3′ bias below 4, a 5′–3′ bias between 0.2 and 2, a median coefficient of variation of coverage of the 1000 most expressed genes below 0.8, and a free-mix value below 0.05.

Subsequently, gene expression values were normalized across lines that passed quality control. For this we derived edgeR42,43 corrected transcript per million gene-level quantifications per iPSC line from the feature count information. After this normalization we removed samples that had low expression correlation (<0.6) with the average iPSC expression profile across our study, as measured per chromosome. For the purpose of the eQTL analyses presented here, we used gene expression estimates for 288 HipSci cell lines (188 individuals) and 210 iPSCORE cell lines (210 individuals) that had corresponding deep whole genome sequencing data (WGS) that allowed for comprehensive characterization of SNVs, indels, SVs, and STRs4. This joint data set of variant calls and iPSC gene expression data for 398 individuals is referred to as the i2QTL data set in this manuscript. Additional detail on these methods is described in a paper describing eQTLs on the full data set44.

eQTL analysis

To find eQTLs we tested for associations between variants within a cis-region spanning 1MB up- and downstream of the gene body and 16,018 robustly expressed autosomal genes (expressed in >20% of samples at an average TPM > 0.5 among samples that expressed the gene) in 398 the HipSci and iPSCORE donors (Supplementary Data 1). We performed association tests using a linear mixed model (LMM), accounting for population structure and sample repeat structure as random effects (using a kinship matrix estimated using PLINK45). All models were fit using LIMIX46 (https://limix.readthedocs.io/).

Before QTL testing the gene expression-levels were log transformed and standardized. Significance was tested using a likelihood ratio test. To adjust for global differences in expression across samples, we included the first 50 PEER factors (calculated across all 1,367 lines using log transformed expression values) as covariates in the model. In order to adjust for multiple testing, we used an approximate permutation scheme, analogous to the approach proposed in Ongen et al.47. Briefly, for each gene, we ran LIMIX on 1,000 permutations of the genotypes while keeping covariates, kinship, and expression values fixed. We then adjusted for multiple testing using this empirical null distribution. To control for multiple testing across genes, we used Storey’s q values48. Genes with significant eQTLs were reported at an FDR < 5%.

eQTL input variants and post-processing

Because there are differences in types of SVs (e.g., copy number variants, mobile element insertions) and the output of SV variant callers, genotypes were preprocessed before use in the eQTL analysis. Since some STRs are highly multi-allelic, we used the difference in the number of base pairs with respect to the reference (expansion or contraction), as computed from the sum of the GB format tag in the HipSTR VCF file, as genotypes for eQTL analysis49. Genome STRiP CNVDiscovery and LCNVDiscovery33 variants were encoded with integer diploid copy numbers (CN). SpeedSeq30,31 variants were encoded using their allele balance (AB) fractions at each genotype, which ranges from 0 to 1 based on the amount of evidence for the variant at a site, for greater sensitivity and consistency with Genome STRiP variants, which use a continuous copy number. Finally, for MEIs identified by MELT34, we used traditional genotypes (0/0, 0/1, 1/1) outputted by the software, as these SVs are expected to be largely biallelic and there is no continuous genotype outputs available. Before performing the eQTL analysis, genotypes for all SV callers (excluding MELT) were rank normalized and converted to a 0-2 scale. For MELT variants, reference, heterozygous, and homozygous alternate genotypes were converted to 0, 1, and 2 respectively. Missing genotypes from all variant callers were filled with the mean dosage among non-missing samples prior to the eQTL. GATK50 SNV and indel genotypes were processed in the same way as MELT variants, converting them to 0, 1 and 2.

For the eQTL analysis, we utilized 5,834,257 SNVs, 1,520,762 indels that were present at a minor allele frequency of at least 5% and 9,313 SVs and 33,608 STRs that were present at a non-mode allele frequency of at least 5% among the 398 i2QTL donors with RNA-seq, passed QC, and were within 1MB of at least one of 16,018 expressed genes (see eQTL analysis). Notably, non-mode allele frequency was used for SVs and STRs in order to account for multi-allelic variants. For STRs, the non-mode allele frequency is computed from the difference in length of a genotype from the reference, as detailed in Gymrek et al.49. The structural variant call set includes variants generated from the same caller or different callers that pass QC but may be redundant (overlapping or highly correlated)4. In a companion paper4, we identified these redundant clusters and selected high-quality variants to create a non-redundant set of variants; here we chose to include all variants that passed quality control filters in the eQTL analysis including those that were marked as part of a redundancy cluster in order to maximize the chances of SV associations. Additionally, STRs were required to have a 99% call rate in both iPSCORE and HipSci samples in order to be included in the eQTL to prevent batch effects from affecting eQTLs4. To compute the number of unique variants in downstream analyses, variants were annotated with the redundancy clusters they belonged to4 and variants in the same cluster were considered as a single variant. We thus tested 9,313 non-redundant SVs that were in cis windows of expressed genes. Because variants could be associated with multiple eGenes, we considered eQTLs to be an SV/eGene pair. We performed two independent eQTL analyses: (1) using STRs, SVs, small indels and SNVs (joint eQTL analysis) and (2) using only STRs and SVs (SV/STR-only eQTL analysis; Fig. 1a).

Association of variant length with likelihood and strength of eQTLs

To test whether longer variants were more likely to be eQTLs we restricted our analysis to tested variants from each variant class that was highly polymorphic in length (spanning orders of magnitude within variant class): duplications, deletions, mCNVs, and STRs. For variants of each of these classes, we computed the fraction of variants that were eVariants or lead eVariants that were longer than a given length threshold and calculated an enrichment p value by comparing the proportion of variants that were eVariants or lead eVariants among variants longer than the threshold to the proportion among variants smaller than the threshold (Fisher’s Exact Test, Benjamini–Hochberg (BH)). To compare the association of length of variants with effect size, we utilized all significant eQTLs from each of the aforementioned variant classes and fit a logistic regression model comparing the absolute effect size of these associations with the length of the associated variant using the distance to the nearest TSS of the eGene and the non-mode allele frequency of the variant as covariates. p values from the regression were estimated with the Wald Test and then corrected using the Bonferroni correction.

Properties of SV-QTLs among different variant classes

To determine which SV classes were more likely to be associated with eGenes, we compared the proportion of variants that were eVariants for a given variant class to the proportion that were eVariants for STRs (Fisher’s Exact Test, FDR < 5%, BH). To study the localization of eSVs with respect to eGenes, we used Gencode v1951 annotations to measure distance to the nearest transcription start site for tested genes, as well as categorize variants based on their overlap of introns, exons, and promoters of tested genes, or elements of other genes. A variant was considered to overlap a particular genomic if feature if it overlapped by at least one base pair and each variant was categorized hierarchically into one of the following 7 categories, in order of precedence: 1) exonic eGene, 2) promoter eGene 3) intronic eGene 4) exonic other 5) promoter other, 6) intronic other 7) intergenic (overlapping none of the features).

eGenes properties and constraint

Gene types were annotated using Gencode v19 data for all expressed genes. We then performed enrichment analyses comparing the proportion of eGenes of a specific type mapped to each class to the proportion among all other variant classes (Fisher’s Exact Test, BH). To compare the effect sizes of associations with each gene type, we compared the distribution of effect sizes for lead associations for protein coding eGenes to those of pseudogenes, lincRNA, antisense and all other genes (Mann–Whitney U test, BH). To investigate the constraint of eGenes, we obtained ExAC (v0.3.1)20 pLI, pNull, and pRec estimates for 13,012 expressed genes and restricted our analyses to lead associations with these eGenes. We then compared the proportion of eGenes with high (> 0.9) ExAC scores mapped to either deletions, duplications, mCNV, MEI (LINE1, SVA, and Alu), STR, or all 13,012 eGenes to the proportion of genes with a high score among the 7,337 non-eGenes that were tested in our data set using Fisher’s Exact Test and adjusting for multiple testing with the Bonferonni method. Finally, we fit a logistic model predicting the pLI of an eGene using the eGene’s absolute effect size and including logTPM of the eGene as a covariate to test for an association between eGene effect size and pLI.

eQTL localization near distal anchors of chromatin loops

To examine the localization of SVs and eSVs with respect to chromatin loops in iPSCs, we downloaded previously published iPSC promoter capture Hi-C loop calls25. For this analysis, we obtained loops intersecting the promoter of 13,575 out of the 16,018 expressed genes included in the eQTL analysis, of which 5,803 were eGenes. To identify variants that might affect chromatin looping, we first intersected loop calls with all annotated Gencode v19 promoters. Then, for each variant, we computed the distance from each loop anchor and retained only the variants closer to the distal anchor (i.e. the anchor that does not overlap the promoter). We subsetted this set of variant-gene pairs to those where the variant was: (1) closer to the distal anchor than the promoter anchor (2) at least 50 kb from the promoter (3) at most 200 kb from the distal anchor which comprised 177,571 variant-gene pairs (31.8% of all tested variant-gene pairs). For all these variants, we determined whether they were included in the Hi-C loop (i.e. between the promoter anchor and the distal anchor) or outside the loop. To test whether variants that hit distal loop anchors are enriched to be eQTLs, we categorized variants within 10 kb of a loop anchor as intersecting that anchor. To calculate enrichment, we tested the proportion of eVariants that intersected a distal loop anchor to at least one expressed gene versus the proportion of eVariants that did not intersect distal loop anchors (Fisher’s exact test). Next, we took this subset of variant-gene pairs and restricted it further to cases where the variant was at also least 100 kb from the gene body, to test whether the distance of a variant from the distal loop anchor was associated with its likelihood of being associated with gene expression. We fit a logistic model to this set of 161,793 variant gene pairs to see whether distance to the distal anchor is predictive of the likelihood of being associated with gene expression using distance to the gene body and non-mode allele frequency as covariates. We also compared the proportion of variants that were eVariants that were within 100 kb from the outside of the distal loop anchor to variants that were within 100 kb from inside of the distal loop anchor (Fisher’s exact test). To test whether overlapping loop anchors associated with multiple promoters was more predictive of associations with multiple eGenes than variant localization in a gene dense window, we modeled the number of eGenes versus the number of genes tested for each eVariant, stratifying by the number of genes tested that were either connected by loops to the promoter or not connected by loops to the promoter (statsmodels.api.logit, statsmodels v0.9.0, https://pypi.org/project/statsmodels/). To visualize these regressions, seaborn (regplot)(https://pypi.org/project/seaborn/) was used in order to divide the X axis (number of genes tested that were connected or not connected by loops) into bins with points drawn at the center of the bin showing the mean and error bars indicating 95% confidence intervals. The same bins were used for both groups in order to enable direct comparison between groups, however, each bin does not contain equal numbers of observations.

SV/STR LD tagging and GWAS associations

We downloaded summary statistics for 4,357 human traits from the UK BioBank (UKBB) GWAS Round 2 (http://www.nealelab.is/uk-biobank, 01 August 2018). For each of the 42,921 non-redundant SVs and STRs, we used bcftools52 to extract all SNPs 50 kb upstream and downstream. For each SV or STR, we calculated LD as the correlation (R2) with the genotypes of each surrounding SNV or indel genotyped in i2QTL WGS. We selected the variant with strongest LD overall, as well as the variant with the strongest LD that was included in the UKBB data set (if the two were different). For each UKBB variant linked to an SV or STR, we obtained p values for the variant in all GWAS studies and considered it to be significantly associated with a trait if p < 5 × 10−8. For each variant type, we selected all lead SVs and STRs from the SV/STR-only eQTL analysis and tested if the lead eVariants were: (1) more likely to be in strong LD with UKBB variants in general, and (2) more likely to be in strong LD with UKBB variants significantly associated with a GWAS trait, as compared to non-lead eVariants, using the Fisher’s exact test. To test the association of multi-eGene eQTLs with the likelihood of being in strong LD with a variant significantly associated with a GWAS trait, we divided tested variants by class and modeled the likelihood of a variant being linked to a trait versus the number associated eGenes (statsmodels.api.logit, statsmodels v0.9.0, https://pypi.org/project/statsmodels/). p values were calculated using the Wald test and then corrected for multiple testing using Benjamini–Hochberg.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.