Dissecting the genetic overlap between three complex phenotypes with trivariate MiXeR

Comorbidities are an increasing global health challenge. Accumulating evidence suggests overlapping genetic architectures underlying comorbid complex human traits and disorders. The bivariate causal mixture model (MiXeR) can quantify the polygenic overlap between complex phenotypes beyond global genetic correlation. Still, the pattern of genetic overlap between three distinct phenotypes, which is important to better characterize multimorbidities, has previously not been possible to quantify. Here, we present and validate the trivariate MiXeR tool, which disentangles the pattern of genetic overlap between three phenotypes using summary statistics from genome-wide association studies (GWAS). Our simulations show that the trivariate MiXeR can reliably reconstruct different patterns of genetic overlap. We further demonstrate how the tool can be used to estimate the proportions of genetic overlap between three phenotypes using real GWAS data, providing examples of complex patterns of genetic overlap between diverse human traits and diseases that could not be deduced from bivariate analyses. This contributes to a better understanding of the etiology of complex phenotypes and the nature of their relationship, which may aid in dissecting comorbidity patterns and their biological underpinnings.


Introduction
Many human traits and disorders are highly polygenic, with thousands of associated loci discovered to date [1].Most functional genetic loci affect multiple phenotypes spawning intricate patterns of genetic overlap among them [2].Characterization of patterns of overlapping genetic architecture across multiple phenotypes has generated key discoveries in human genetics in recent years.This characterization has helped reveal the shared genetic underpinnings of a wide range of common human diseases [3].Ultimately, a more complete understanding of the complex genetic relationships between various traits and disorders can help to elucidate mechanisms underlying multimorbidity [4], which is an increasing global health challenge [5].It can also lead to improvements in disease classification and diagnostics.
Previous causal mixture model (MiXeR) tools have been implemented to model genetic architecture and characterize genetic overlap.Univariate MiXeR [1] was developed to quantify characteristics of genetic architecture for complex phenotypes, including polygenicity, which reflects the number of genetic variants influencing a phenotype.This approach was extended to bivariate MiXeR [6] to quantify the overlapping polygenic components between two phenotypes regardless of the effect directions.While genetic correlation is often used to assess genetic overlap between complex phenotypes, its ability to detect overlap is limited to pairs of phenotypes where the bulk of variants have either concordant or discordant directions of effects.A pair of phenotypes sharing variants with a balanced mixture of concordant and discordant effects will provide a genetic correlation close to zero, similar to a pair of genetically disjoint phenotypes, making these two scenarios indistinguishable [7].Bivariate MiXeR has helped to better characterize the relationship between various pairs of phenotypes beyond genetic correlation, effectively capturing shared genetics with mixed directions of effects [3,[8][9][10].However, neither genetic correlation nor bivariate MiXeR can directly estimate genetic overlap across three phenotypes.Apart from trivial cases of non-overlapping or completely overlappling phenotypes, trivariate overlap cannot be reconstructed from a series of three bivariate analyses (analyzing each pair of phenotypes within a triad).Indirect reconstruction of trivariate overlap from three bivariate analyses under a naïve assumption of a maximum entropy probability distribution of overlapping parts may lead to erroneous estimates.For a given combination of three bivariate overlaps, with no additional prior knowledge, the maximum entropy distribution of the trivariate overlap can be conceptualized as the center of all possible trivariate overlap distributions.Investigating genetic overlap among triads of phenotypes can help reveal important aspects of genetic overlap associated with different scenarios of overlap between phenotypes, tissue types or biological mechanisms.
Here, we present trivariate MiXeR, which disentangles the pattern of polygenic overlap among three complex phenotypes using summary statistics from genome-wide association studies (GWAS).We first conduct a series of simulations covering diverse scenarios of genetic overlap among three phenotypes and demonstrate that the tool can reliably reconstruct different patterns of trivariate genetic overlap.We then apply trivariate MiXeR to GWAS summary statistics for eight complex phenotypes representing a range of human traits and disorders for which epidemiological studies suggest shared causal pathways.Our analyses demonstrate non-trivial patterns of trivariate genetic overlap that are substantially different from naïvely expected patterns derived from three bivariate analyses following the principle of maximum entropy. .

Trivariate MiXeR model
The method extends the bivariate MiXeR model [6] for the case of three phenotypes, leaving the basic assumptions of the model unchanged.Briefly, an additive model of genetic effects is considered.In the univariate analysis, the direct (not induced by linkage disequilibrium) effect  of the j th variant on a phenotype is modeled as a mixture of null and phenotype-influencing components characterized by two parameters: the proportion of variants influencing the phenotype (polygenicity,  ∈ [0,1]) and the variance of their effect sizes (discoverability,  ): where (0,  ) is a normal distribution with zero mean and  variance.
Univariate, bivariate and trivariate log-likelihood functions are implemented using numerical integration of the characteristic function applying a trapezoidal rule with fixed step size as described previously [11]. .

Simulation setup
To validate the method and to test its ability to discriminate different scenarios of genetic overlap, we performed a series of analyses with simulated data.GWAS summary statistics for simulations were generated based on participants randomly selected from the UK Biobank using 100,000 unrelated (defined by 22020 data-field) white British (defined by 22006 data-field) individuals and version 3 of the genetic data.UK Biobank data was obtained under accession number 27412.Autosomal variants with minor allele frequency above 0.1%, genotype missingness below 10%, imputation info score above 0.8 and passing Hardy-Weinberg equilibrium test at p=1E-10, totaling 12,926,691 variants were included in the analysis.A set of quantitative phenotypes with equal polygenicity ( = 0.002), equal SNP-heritability (ℎ2 = 0.4) and different patterns of genetic overlap were generated using the SIMU tool [12].For each phenotype, a given number of phenotype-influencing variants ( = 25,853 ≈ 12,926,691 * 0.002) were selected at random.The effect sizes for the selected variants were sampled from a standard normal distribution and scaled to obtain the predefined SNP-heritability.For each individual analyzed, a quantitative synthetic phenotype was then generated as the sum of allelic effects over all phenotypeinfluencing variants complemented by a certain proportion of random Gaussian noise (representing environmental effects) required to keep the predefined level of heritability.Association analysis was performed using PLINK2 [13,14] with sex, age and the first 10 genetic principal components included as covariates.Three simulation scenarios were considered: 1. "Core": only triple overlap, i.e., all overlapping variants are shared between all three phenotypes, for each phenotype half of phenotype-influencing variants also influence both other phenotypes.For each triad of phenotypes, sixteen independent optimization runs were performed to maximize the likelihood of the GWAS z-scores observed in different subsets of 500,000 randomly selected variants.We then calculated the median across these sixteen runs for each polygenicity parameter ( = median  ,  ∈ {1, 2, 3, 12, 13, 23, 123},  = 1, … ,16) and find the run with the smallest deviation from the median overlap pattern.A Euler diagram for this run is then presented both for the simulated data and for the real data analysis.Of note, the pattern constituted from the median polygenicities ( ,  ∈ {1, 2, 3, 12, 13, 23, 123}) is not guaranteed to be feasible itself, since proportions in the overlap of three phenotypes are constrained, as described in the "Naïve expectation" section below, and these constraints are not necessarily fulfilled for the median proportions across multiple patterns which are themselves feasible.

Genome-wide association studies (GWAS) data
For the analysis of the real data, we used publicly available GWAS summary statistics on eight phenotypes: ulcerative colitis [15], psoriasis [16] (FinnGen, release 9, phenotypic code L12_PSORIASIS), multiple sclerosis [17], type 2 diabetes [18], estimated glomerular filtration rate [19], high-density lipoprotein [20], placental weight (fetal GWAS adjusted for fetal sex and gestational duration) [21], height [22] and schizophrenia [23].All summary statistics were based on individuals of European ancestry.We analyzed .patterns of genetic overlap for three triads: (1) type 2 diabetes, estimated glomerular filtration rate and high-density lipoprotein (2) ulcerative colitis, multiple sclerosis and psoriasis, (3) placental weight, height and schizophrenia.The selection of these triads is guided by epidemiological and clinical evidence and intends to illustrate that the method may provide insights into a wide spectrum of traits and disorders.

Naïve expectation
Given univariate estimates of polygenicity for three phenotypes ( ,  ,  ), three bivariate estimates of genetic overlap between these phenotypes ( ,  ,  ) obtained in univariate and bivariate analyses for the triad of phenotypes the genetic overlap between these three phenotypes is constrained by bounds  ∈  ,  , where  =  0,  +  −  ,  +  −  ,  +  −  and  =   ,  ,  .Without any prior knowledge about genetic relationships between analyzed phenotypes, a naïve expectation about the value of  follows the principle of maximum entropy: In other words for the given univariate ( ,  ,  ) and bivariate ( ,  ,  ) polygenicities, the naïve trivariate polygenicity ( ï ) is selected so that among all the possible distributions ( ,  ,  ,  ,  ,  ,  ) with the constraint's  ∈  ,  , the probability distribution ( ï ,  ï ,  ï ,  ï ,  ï ,  ï ,  ï ) has maximum entropy, where are the naïvely expected phenotype pair-specific and phenotype-specific polygenicities.Of note, polygenicities for the naïve expectation ( ï ,  ï ,  ï ,  ï ,  ï ,  ï ,  ï ) are calculated based on univariate ( ,  ,  ) and bivariate ( ,  ,  ) polygenicities estimated by the trivariate MiXeR model, therefore the univariate polygenicity of each phenotype as well as the shared polygenicity of each phenotypic pair are the same for the naïvely expected pattern and the pattern estimated by the trivariate MiXeR, i.e.,  ï +  ï =  +  =  , and similar for other bivariate and univariate polygenicities.

Linkage disequilibrium reference panel
Both simulations and real data analyses were performed using an LD reference panel constructed based on 10,000 randomly selected unrelated white British individuals from the UK Biobank.Quality control procedures for variants were identical to those described in the 'Simulations' section leaving 12,926,691 variants in the LD reference panel.PLINK 1.9 [13,24] was applied to estimate r2 coefficients within each autosome using --ld-window 1000000 --ld-window-kb 20000 --ld-window-r2 0.01 parameters.The resulting text files were then processed to produce input files in the format required by MiXeR, using the scripts provided in the code repository.

Trivariate MiXeR model implementation
Building on the same assumptions as the bivariate MiXeR model, we have extended the framework to include three phenotypes modeling the trivariate distribution of genetic effects as a mixture of eight components.Compared to the bivariate MiXeR tool, the code for log-likelihood estimation was reimplemented using numerical integration of the characteristic function.This facilitated more stable convergence of the optimization algorithm and reduced fluctuations in the estimates caused by the random sampling approach applied in the bivariate MiXeR v1.3 implementation [25], at the cost of substantially increased computational burden.To cope with the increased computational demand, performance-critical parts have been accelerated using GPUs.Trivariate MiXeR code can only be deployed on machines with GPUs supporting NVIDIA CUDA, which are now commonly available on highperformance computing (HPC) facilities or cloud computing facilities.
The trivariate MiXeR tool is implemented in Python mainly using numpy [26], scipy [27] and pandas [28] packages, while the ´numba´ just-in-time (JIT) compiler [29] is used to translate the Python and numpybased routines into machine code.Performance-critical steps rely on the availability of a graphics processing unit (GPU, NVIDIA CUDA).Overlapping patterns are visualized using the ´eulerr´ R package [30].The execution environment with all dependencies can be created using Conda [31] mamba or micromamba [32].
Input parameters for the analysis can be tuned and provided in the configuration file in JSON format.An example configuration file showing parameters is available in the code repository (https://github.com/precimed/mix3r/blob/main/config_t2d_hdl_egfr_oct30_1.paper.json).Important parameters used in all presented analyses are: maf_thresh = 0.05 -z-scores of the variants with minor allele frequency (MAF) below 5% were not used for optimization (MAF is estimated from the same genotypes which are used to construct the LD reference panel); info_thresh = 0.8 -z-scores of the variants with imputation INFO score below 0.8 are not used for optimization, if input sumstats do not contain INFO column the filter is ignored; z_thresh = 32 -z-scores with absolute value larger than 32 were not used for optimization; exclude_regions = 6:25000000-34000000 -variants from the major histocompatibility complex (chr6:25000000-34000000, hg19 genomic guild) were not used for optimization; do_pruning = true, r2_prune_thresh = 0.8 -prior to optimization, variants were randomly pruned with allelic correlation threshold r2 < 0.8; n_random = 500000 -a subset of 500K variants was randomly selected for optimization from all variants surviving random pruning; rand_prune_seed = 1 -a seed for the generator of the pseudorandom numbers (controls both random pruning and random sub-setting of variants), changing this parameter while keeping all other parameters unchanged allows the user to repeat the analysis with a different subset of variants.In this study rand_prune_seed = 1, ..., 16 were used to produce 16 independent runs for each triad of phenotypes.

Application to simulated data
Our simulations demonstrated that MiXeR was able capture the true patterns of genetic overlap for various simulated scenarios (Figure 1).For "core" (Figure 1, row 1) and "ring" (Figure 1, row 2) scenarios, the trivariate MiXeR model accurately captured the disproportional pattern of overlap in the simulated .datasets (column A), while the result expected for the estimated bivariate overlaps under the naïve assumption of the maximum entropy probability distribution (column C) revealed substantial deviation from the true simulated pattern (column B).For the balanced "equilibrium" scenario (Figure 1, row 3) the true pattern (column B) followed the maximum entropy principle thus the naïve expectation (column C) should be no worse than the pattern reconstructed by the trivariate MiXeR (column A).As can be seen in this case the naïve pattern and the pattern reconstructed by the trivariate MiXeR model were very similar, illustrating adequate model fit.Estimates of all model parameters for 16 independent MiXeR runs for "core", "ring" and "equilibrium" scenarios are shown in Tables S1, S2 and S3 respectively.
. Figure 1.Simulated data.Three different scenarios of genetic overlap in simulated data (rows) estimated by trivariate MiXeR (column A, solid black outline), compared to the theoretical true pattern of the simulated overlap (column B, solid white outline) and the overlap pattern expected for the estimated bivariate overlaps under naïve assumption of maximum entropy (column C, dashed black outline).Row 1 (blue colors): "Core" scenario; Row 2 (red colors): "Ring" scenario; Row 3 (green colors): "Equilibrium" scenario (see `Methods` for further details).For each simulation scenario (within each row), for every area of each diagram, its percentage with respect to the combined total area of all three phenotypes in the estimated diagram (Column A) is shown (rounded to the closest integer), i.e., percentages within each diagram in the column A add up to 100 and percentages within each row are directly comparable with those shown in column A. Since percentages in column C are also given with respect to the combined total area of the corresponding diagram in column A, the sum of percentages in column C is not necessarily equal to 100.For each scenario phenotypes were simulated independently, therefore "Trait 1", "Trait 2" and "Trait 3" in row 1 are not the same as "Trait 1", "Trait 2" and "Trait 3" in row 2 or in row 3 of the figure respectively.The middle diagram in the bottom row (row 3, column B) shows the nomenclature for proportions of areas within the pattern ( ,  ,  ,  ,  ,  ,  ) used throughout the text.

Application to real data
We applied trivariate MiXeR to GWAS summary statistics on three triads of phenotypes and compared the pattern of genetic overlap estimated by trivariate MiXeR (Figure 2, column A) with the pattern expected for the given bivariate overlaps under a naïve expectation of maximum entropy probability distribution (Figure 2, column B).
The three examples illustrate the range of observed discrepancies between naïve expectations based on bivariate MiXeR and patterns estimated by trivariate MiXeR, which demonstrate the importance of applying trivariate analysis when studying the genetic architecture of multimorbidities.

Discussion
We have developed and validated the trivariate MiXeR tool to disentangle the pattern of genetic overlap among three complex phenotypes using genome-wide data.The simulations showed that the tool can .reliably reconstruct different patterns of genetic overlap.Furthermore, we have demonstrated how trivariate MiXeR can estimate the proportions of polygenic overlap among diverse human traits and diseases, highlighting patterns of genetic overlap that could not be deduced from bivariate MiXeR.
Pairwise genetic overlaps between multiple phenotypes have been extensively studied and have provided valuable insights into the shared and phenotype-specific genetic architectures of different traits and disorders [3,33,34].However, estimating pairwise genetic overlaps among three phenotypes does not provide a complete picture of the genetic overlap among those three phenotypes.We show that the trivariate MiXeR model can dissect the pattern of genetic overlap among three complex phenotypes, using GWAS summary statistics, and reveal patterns of overlap that are distinct from naïve expectations based on bivariate MiXeR.We provide three examples with real phenotypes demonstrating that trivariate MiXeR can elucidate situations where a triad of phenotypes can overlap disproportionately, providing novel insights into the variability in overlapping genetic underpinnings among those phenotypes.
In clinical and epidemiological studies, type 2 diabetes has been associated with structural changes and abnormal function of high-density lipoprotein [35], which may impact renal function and increase the risk of kidney disease [36,37].Our trivariate MiXeR analysis of type 2 diabetes, high-density lipoprotein and a renal function measure (estimated glomerular filtration rate) demonstrates substantial polygenic overlap between these three phenotypes that were different from the overlap pattern expected from bivariate MiXeR results.These results nevertheless show a mixture of trivariate and bivariate overlapping polygenic components characterizing the shared genetic architecture of these phenotypes and suggesting a complex genetic relationship.
Our analysis of genetic overlap between ulcerative colitis, psoriasis and multiple sclerosis revealed a large number of disease-specific variants while the shared component was predominantly within the triple overlap.These findings are consistent with the hypothesis that there is a common genetic basis for immune-linked diseases [38] with a core combination of genetic mutations [39].Disturbance of these hypothetical core immune processes might activate the breakdown in immune tolerance [38] necessary to trigger any of these diseases.The subsequent developmental trajectory of a given autoimmune disease may then be driven, in part, by disease-specific genetic factors.Determining and disentangling core and specific genetic factors for immune diseases might provide valuable insights into key immune pathways and cell types involved in disease mechanisms, with the potential for drug target development.
There is a growing interest in the role of the placenta in neurodevelopment and the onset of later psychiatric disorders [40,41] but evidence for a genetic link remains elusive [21].Our analysis of genetic overlap between placental weight, adult height and schizophrenia shows that placental weight shares a considerable fraction of its genetic underpinnings with height, while its genetic overlap with schizophrenia is modest and is also common with height.Sporadic genetic overlap is expected for polygenic phenotypes and might represent overlapping core regulatory and house-keeping genes involved in critical processes within multiple cell types.The observed pattern of genetic overlap may therefore indicate a predominance of non-specific genetic overlap between placental weight and schizophrenia.
The development of advanced methods to obtain deeper insight into the genetic overlap among multiple traits and disorders may contribute to improvements in disease nosology.Applying trivariate MiXeR to existing diagnostic categories, sub-phenotypic or symptom-level measures may inform the revision of current classification systems and provide novel insights into the nosological relationship between complex human disorders [42,43].Further, the triangulation of overlapping genetic patterns between .disorders, biological markers and instrumental variables may inform theories regarding the potential biological mechanisms underlying multimorbidity patterns.
Trivariate MiXeR has the same limitations as the univariate and bivariate MiXeR models [1,6].The underlying model is sensitive to LD structure estimation and the reliability of parameter estimates depends on the statistical power of the input GWAS summary statistics.The model makes several simplifying assumptions, including uniform distribution of phenotype-influencing variants across the genome and the effect size's independence from allele frequency, LD, and location in the genome.These assumptions may be violated to different degrees for different phenotypes, making the model less suitable for some phenotypes than for others.Analysis of phenotypes combining a handful of extremely strong genetic effects with a weak polygenic background (for example Alzheimer's disease [1]) may be sensitive to the selection of variants used for parameter fitting.To assess the stability of optimization convergence and robustness of obtained results we perform multiple independent runs using different subsets of variants and carefully assess the variation of parameter estimates across runs to evaluate the model's suitability for each set of phenotypes.
In conclusion, we have developed and validated the trivariate MiXeR model and demonstrated its utility in disentangling the pattern of genetic overlap between three phenotypes using GWAS summary statistics.We provide the tool implementing the model along with documentation and examples of how to use it.The trivariate MiXeR can help to provide a better understanding of the genetic relationship between complex polygenic phenotypes with particular relevance to multimorbidities.
Healthcare (GEHC).The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict-of-interest policies.Ole A. Andreassen has received speaker fees from Lundbeck, Janssen, Otsuka, and Sunovion and is a consultant to Cortechs.aiand Precision Health AS.

Figure 2 .
Figure 2. Real data.Genetic overlap between selected human traits and disorders estimated by trivariate MiXeR (column A) compared to naïve expectation following the principle of maximum entropy (column B).The pattern of genetic overlap between type 2 diabetes (T2D), estimated glomerular filtration rate (eGFR) and high-density lipoprotein (HDL) (Row 1), ulcerative colitis (UC), psoriasis (PS), and multiple sclerosis (MS) (Row 2), and placental weight (PW), height, and schizophrenia (SCZ) (Row 3) demonstrate the range of differences between trivariate MiXeR estimates and naïve expectation from bivariate MiXeR.For each triad of phenotypes (within each row), for every area of each diagram, its percentage with respect to the combined total area of all three phenotypes in the estimated diagram (column A) is shown (rounded to the closest integer), i.e., percentages within each diagram in column A add up to 100 and percentages within each row are directly comparable.Since percentages in column B are also given with respect to the combined total area of the corresponding diagram in column A, the sum of percentages in column B is not necessarily equal to 100.