## Abstract

Mendelian Randomisation (MR) is a statistical method that estimates causal effects between risk factors and common complex diseases using genetic instruments. Heritable confounders, pleiotropy and heterogeneous causal effects violate MR assumptions and can lead to biases. To tackle these, we propose an approach employing a PheWAS-based clustering of the MR instruments (PWC-MR). We apply this method to revisit the surprisingly large apparent causal effect of body mass index (BMI) on educational attainment (EDU): *α* = -0.19 [-0.22, -0.16].

As a first step of PWC-MR, we clustered 324 BMI-associated genetic instruments based on their association profile across 407 traits in the UK Biobank, which yielded six distinct groups. The subsequent cluster-specific MR revealed heterogeneous causal effect estimates on EDU. A cluster strongly enriched for traits related to socio-economic position yielded the largest BMI-on-EDU causal effect estimate (*α* = -0.49 [-0.56, -0.42]) whereas a cluster enriched for primary impact on body-mass had the smallest estimate (*α* = -0.09 [-0.13, - 0.05]). Several follow-up analyses confirmed these findings: (i) within-sibling MR results (*α* = -0.05 [-0.09, -0.01]); (ii) MR for childhood BMI on EDU (*α* = -0.03 [-0.06, -0.002]); (iii) step-wise multivariable MR (MVMR) (*α* = -0.06 [-0.09, -0.04]) where time spent watching television and past tobacco smoking (two proxies for potential confounders) were jointly modelled.

Through a detailed examination of the BMI-EDU causal relationship we demonstrated the utility of our PWC-MR approach in revealing distinct pleiotropic pathways and confounder mechanisms.

## 1 Introduction

Genome-wide association studies^{[1]} (GWASs) have identified many genetic variants associated with multiple complex phenotypes, aiding us in annotating single nucleotide polymorphisms (SNPs) and their functions, as well as identifying putative causal genes. As sample sizes of GWASs increase, more SNP associations are revealed which improve various downstream analyses such as polygenic score prediction, pathway- and tissue-enrichment, and causal inference^{[2,3]}.

Mendelian Randomisation^{[4,5]} (MR), an approach generally applied through the use of genetic variants/SNPs as instrumental variables (IVs) to infer the causal relationship between an exposure or a risk factor *X* and an outcome *Y* of interest, has become increasingly used thanks to well-powered GWASs from which hundreds of genetic associations with heritable exposures can be used as IVs.

MR has three major assumptions concerning the the genetic variant *G* used as an instrument: (1) Relevance – *G* is strongly associated with the exposure. (2) Exchangeability – there is no confounder of the *G*-outcome relationship. (3) Exclusion restriction – *G* affects the outcome only through the exposure. Each instrument provides a causal effect estimate, which can then be combined with others using an inverse variance-weighting^{[6]} (IVW) method to obtain an estimate of the total causal effect of the exposure on the outcome. This estimate is more reliable than observational associations due to it being more protected against unmeasured confounding and reverse causality, provided that the core conditions are met.

Thanks to well-powered GWASs we have also discovered that most genetic instruments are highly pleiotropic^{[7]}, i.e. associated to more than a single trait. This has also been shown in phenome-wide association studies (PheWASs), where associations between a SNP and a large number of phenotypes are tested. The situation when a genetic variant influences multiple traits, but there is a primarily associated trait and all other trait associations are fully mediated by the primary trait, is referred to as vertical pleiotropy. On the other hand, genetic variants that affect some traits through pathways other than the primary trait (the exposure) – a phenomena known as horizontal pleiotropy – are in direct violation of the exclusion restriction assumption and could lead to biased causal effect estimates. However, if the InSIDE assumption ^{[8]}(Instrument Strength is Independent of the Direct Effect on the outcome) holds and the direct SNP effects are on average null, then IVW will yield consistent causal effect estimates. There have been MR extensions to IVW such as MR-Egger to produce less biased causal effect estimates if the InSIDE assumption holds and direct effects are not null on average. Note that violation of the InSIDE assumption leads to correlated pleiotropy, which can severely bias causal effect estimates. Such phenomenon may emerge as a result of a heritable confounder of the exposure-outcome relationship and has been modelled in the past^{[9,10]}.

Well-powered GWAS may also provide confounded genetic associations through dynastic effects^{[3,11]}, assortative mating^{[12,13]}, and population stratification^{[14]}. These phenomena can introduce correlation between an instrument and confounding factors, such as parental/partner traits or genetic ancestry leading to a violation of the exchangeability assumption and biased causal effect estimates. This type of confounding can be resolved when using family-based study designs^{[15,16]} such as sibling-pair studies. Since genetic differences between sibling pairs are due to independent and random meiotic events, these effects are unaffected by population stratification and other potential confounders influencing the phenotype. This and other family-based designs have been used to obtain unbiased heritability estimates, validate GWAS results and test for unbiased causal effect estimates using MR^{[17,18]}.

Another factor that can lead to complications in MR studies is the presence of heterogeneous causal effects emerging due to distinct biological mechanisms: various subtypes of the exposure (e.g. subcutaneous vs visceral adiposity) or different biological pathways through which the exposure impacts the outcome (e.g. interaction between the exposure and other factors). To date, confounding of genetic associations, horizontal pleiotropy and heterogeneous causal effect have been largely treated as distinct mechanisms in MR modelling. However, what they have in common is that they can lead to variable causal effects estimated depending on the group of IVs used in the MR.

To address this, we introduce in this paper our approach of PheWAS-driven clustering of instrumental variables (PWC-MR) and test the resulting clusters for distinct pathways or mechanisms that could underlie the overall causal effect of the exposure. Throughout the paper, we demonstrate the approach through the example of estimating the causal effect of body mass index (BMI) on educational attainment (EDU). This relationship has been analysed extensively in the past and family studies have shown that an apparent strong effect of higher BMI on lower educational attainment is shrunk to near zero when using family studies^{[17]}. One explanation is that offspring BMI is influenced by parental alleles associated with parental (rearing) behaviour, which in turn modify the environment of the offspring. Such parental traits act as a confounder of the offspring genotype-EDU relationship, hence violate the exchangeability assumption of MR. Moreover, they confound the BMI-EDU association in the tested sample, violating the exclusion-restriction assumption and inducing correlated pleiotropy (see Figure 1**a**). Thus, it is plausible that some of the detected IV clusters arise through parental genetic confounding which may manifest statistically as horizontal pleiotropy. To test this, we ran a systematic confounder search and probed the causal effect of the exposure conditional on candidate confounder traits.

## 2 Methods

### 2.1 Instrumental variable selection and PheWAS

As our primary analysis, we aimed to investigate the potential pleiotropy-patterns emerging from the grouping of IVs that are strongly associated with an exposure of interest, as outlined in Figure 1**b**. With BMI selected as the exposure trait, we obtained IVs from the Neale group’s UK Biobank GWAS analysis^{[19]} (data sources can be found in Supplementary Table 1) by filtering for genome-wide significant SNPs (i.e. association p-value less than 5 *×*10^{−8}) followed by linkage disequilibrium (LD)-based clumping using the TwoSampleMR R package^{[20]} with the following parameters: *clump kb* = 10, 000, *clump r*2 = 0.001, *pop* = “*EUR*” to obtain independent IVs.

This left us with 348 BMI-associated IVs, for which we ran PheWASs with 1, 480 traits from the Neale group UK Biobank GWAS analysis^{[19]}. We extracted for each trait and for each SNP the association effect and the corresponding standard error, creating a data matrix of 348 SNPs by 1, 480 traits. For the 1, 480 traits, we also extracted details such as variable type, origin and complete sample size, among others.

### 2.2. Quality control

We removed traits from the PheWAS data matrix that had missing association effects as well as duplicates (keeping the most recent version). Furthermore, we filtered out traits for which the effective sample size was less than 50, 000 due to their low power of association, leaving us with 424 traits.

Using genetic correlation data from the Neale group^{[19]}, we further removed traits that had a high genetic correlation with BMI, i.e. the exposure, (*r*^{g} *>* 0.75), to avoid obvious repetitions of traits closely related to it. The resulting association effect data matrix of 348 SNPs and 407 traits was then standardised (SNP effects are on a SD/SD scale) and used for further analysis. Note that for simplicity, effect sizes for binary traits were treated as those of continuous traits.

In order to test for invalid IVs, we performed a trait-wide variant of Steiger-filtering^{[21]}. Specifically, for each SNP, we tested if any of the traits had a significantly stronger (in terms of explained variance) association compared to that of the exposure. The significance threshold for this one-sided t-test was corrected for using the total number of traits remaining (p-value *<* 0.05*/*407). This revealed 24 SNPs more strongly associated to traits other than BMI (such as ‘Whole body water mass’, ‘Basal metabolic rate’ and ‘Sitting height’) that were then removed from further analysis.

### 2.3 K-means clustering and trait identification

With the aim of discovering distinct meaningful groups of SNPs among the 324 IVs, we proceeded with the clustering of the SNPxTrait association effect matrix using the K-means algorithm^{[22]}. Taking the absolute standardised effects matrix, we normalised the data frame with respect to the SNPs such that the variance of the SNP effects across all the traits equalled 1. We used the absolute effects to cluster, in order to ensure that negatively correlated traits were considered similar by the Euclidean distance based similarity measure of the k-means clustering. We then compared the performance of the clustering with different number of clusters ranging from two to 50, by measuring the Akaike Information Criterion (AIC). After finding the number of clusters with the lowest AIC score (six clusters), we proceeded with the assignment of each SNP to one of the six clusters.

In order to identify traits that were particularly associated to SNPs in each of the six clusters, we computed an enrichment ratio (ER) in this way:

For each trait *t*, we calculated the per-SNP average squared effect in a given cluster *j*, denoted as . Given that SNP *i* belongs to cluster was calculated as follows:
where *c*_{j} represents the set of SNPs in cluster *j* and |*c*_{j}| its cardinality. We then normalised these per-SNP average squared effects for each cluster (*k* total clusters) across all traits to obtain the enrichment ratio (ER), *R*_{j,t}:
where *K* is the total number of clusters. For each cluster (*j*), traits were then prioritised by the (highest) value of ER (*R*_{j,t}).

#### 2.2.1 Causal effect estimate per cluster

We measured the cluster-specific IVW causal effect estimate on the outcome (EDU) using the standardised SNP effects in each cluster, and then compared these estimates to the causal effect estimate using all SNPs. We used the TwoSampleMR R package^{[20]} for this analysis, and although we use two-sample MR techniques despite having a close to complete sample overlap, this does not lead to substantial biases^{[23]}. Measures of IV heterogeneity are calculated using the Cochran’s Q-statistic^{[24]} for the IVW method for each cluster. Furthermore, average clusterheterogeneity (per-IV variance) is also calculated for each cluster from the above-mentioned parameter.

As sensitivity analyses, PWC-MR was repeated twice, once with a different exposure trait (replacing BMI with childhood BMI), and another with a different outcome trait (replacing EDU with systolic blood pressure).

### 2.3 Systematic confounder search

In order to decide which of the emerging clusters represent genetic confounding or true biological heterogeneity, we systematically searched for BMI-EDU confounders. To do this, we investigated the bi-directional causal effects that each trait had on both the exposure and the chosen outcome. Firstly, an extra filtering step was done where traits that were highly genetically correlated with the outcome (*r*^{g} *>* 0.75) were removed from the total 407 traits of the previous analysis.

Then we ran a bidirectional MR for the remaining uncorrelated traits using the TwoSampleMR R package^{[20]}, and obtained four sets of causal effect measurements per trait (bidirectional, two different outcome traits - BMI and EDU). To select bidirectional causal effect estimates from those calculated by the different methods in the TwoSampleMR package^{[20]} (Weighted median, Inverse variance weighted, Simple mode, and Weighted mode), we ordered the p-values of the causal effect estimates for the different methods and selected the estimate of the second most significant method to ensure that at least one other method supports the causal claim.

The next step was to identify the direction of causality. To do so, we performed a one-sided t-test on the estimated causal effect between the trait and the exposure, BMI. If the t-test association p-value was *<* 0.05, then the trait had a (nominally) significantly larger effect on the exposure, and if the p-value was *≥* 0.95, then the exposure had a (nominally) significantly larger effect on the trait. For all the p-values in between, it was challenging to assign a direction in which the causal effect was stronger, and thus these traits were not further categorised. The p-value thresholds we apply are not intended to suggest that there is a transition point at which the meaning of associations change. Rather we use these as a heuristic that is required to control type I error rate at an arbitrary (5%) threshold.

The same was done to explore the relationship between the traits and the outcome trait (EDU). This allowed us to classify the traits into candidate confounders, mediators, colliders and other categories (as seen in the middle panel of Supplementary Figure S1). For example, a confounder was defined as a trait with a significantly larger effect on both exposure and outcome than the reverse. We then focused only on the confounders which can distort MR estimates and filtered them further to make sure that they have at least a nominally significant MR estimate (p-value

*<* 0.05) on both BMI and EDU. We were lenient in our categorisation of candidate traits as adding potentially irrelevant traits would not bias the multivariable causal effect of BMI in the next step. Mediators and colliders were not considered further since their inclusion into an MVMR does not alter the exposure’s causal effect. The same holds for traits with a direct effect on either the exposure or the outcome only.

Furthermore, to test how compatible the two lines of analysis were, we examined the clusterspecific enrichment ratio values for the set of candidate confounder traits we obtained.

#### 2.3.1 Multivariable MR

Focusing on the candidate confounder traits resulting from the systematic search that could bias the causal effect estimate between the exposure-outcome pair, we first ran a stepwise multivariable MR (MVMR) (adapted from the bGWAS R package^{[25]}) with them as exposures to test their effect on our chosen outcome, EDU.

To do this, we created a Z-score matrix combining genome-wide significant (p-value less than 5 *×* 10^{−}8) and independent (LD-clumped *clump kb* = 5, 000, *clump r*2 = 0.01) SNPs and their Z-scores for the candidate confounder traits, given that each SNP had an effect that is genomewide significant for at least one of the candidate traits. We then removed any trait that had less than three instruments, leaving us with a Z-score matrix of 274 SNPs and 5 traits. Using this Z-score matrix as input for step-wise MVMR, we obtained a final list of traits with multivariable causal effects with a p-value *<* 0.05*/*5 on our chosen outcome. To ensure the strength of the instruments used for running MVMR, we calculated the conditional F-statistic for our main exposure (BMI) given each of the surviving traits and their different combinations. We then ran standard MVMR using the combination of traits with a conditional F-statistic of BMI *>* 8, and BMI as exposures to estimate the conditional causal estimates on EDU.

We were more lenient with the conditional F-statistic threshold (typically 10)^{[26]} to ensure that epidemiologically relevant traits are included in the MVMR.

### 2.4 Relation to other approaches

#### 2.4.1 Comparison against MR-Clust

We compared the k-means clustering of BMI IVs against another IV clustering method called MR-Clust^{[27]}, which requires as input the unstandardised SNP effects on both the exposure and the outcome, as well as the standard error of the SNP on each. To do so, we performed a Fisher’s exact test to examine the frequency distribution of SNPs in each of the k-means clusters against the MR-Clust clusters.

#### 2.4.2 Colocalisation analysis

To further interpret the findings of the IV clustering, we sought to test if specific patterns of colocalisation in different tissue types appear for the different IV clusters.

To do this, we reran the steps detailed in Leyden et al.^{[28]} for the 324 BMI IVs used in this work. For each IV, we tested for genetic colocalisation between the BMI GWAS data and the gene expression (eQTL) data of both subcutaneous adipose and brain tissue (data sources can be found in Supplementary Table 1). For each SNP tested, we took a margin of 200kb up- and downstream, and used the coloc R package^{[29]} to test the SNP’s colocalisation with each gene found in that region, once using brain eQTL data, and another colocalisation using adipose eQTL data. We declared colocalisation if the posterior probability of the model sharing a single causal variant was larger than 80%. For each of the aforementioned clusters, we investigated if the IVs were more strongly enriched for or depleted in one tissue or the other using Fisher’s exact test.

## 3. Results

### 3.1 Overview of the method

We applied the PWC-MR approach to investigate potential horizontal pleiotropic effects (emerging due to heritable confounders, dynastic effects, genetic subtypes of obesity and other pleiotropic mechanisms, see Figure 1a) of BMI on educational attainment. The analysis focused on grouping the IVs of the exposure by running a PheWAS-based clustering to reveal distinct mechanisms or pathways underlying their overall effect on the exposure (Figure 1b). This was done by obtaining the standardised PheWAS association of the BMI IVs across a filtered set of 408 traits, and running a k-means clustering on the resulting matrix. This resulted in six clusters of IVs for BMI, which were then annotated by traits based on the association of the cluster-member SNPs with each trait. Specifically, for each cluster-trait pair we computed the average explained variance of the trait by the SNPs of the given cluster. This yielded for each cluster-trait pair an enrichment ratio (ratio of the average explained variances) and we chose the top ten traits with the highest enrichment ratio for each cluster. Furthermore, the causal effect of each cluster’s IVs on education was calculated and compared against each other and that of the causal effect obtained using all BMI IVs.

To complement our findings from the clustering-based analysis, we explored (i) the BMI-EDU causal relationship using sib-regression SNP effect sizes^{[18]}, (ii) the childhood BMI-EDU causal relationship, (iii) replacing the outcome trait with systolic blood pressure (SBP), and finally (iv) the potential role of each of the filtered set of traits as a confounder of the BMI-EDU relationship.

We implemented the latter one by systematically running bidirectional MR between each of the traits and either BMI or EDU as outcome, then classifying the traits depending on their bidirectional associations with both BMI and EDU. The resulting set of candidate confounder traits was further analysed for its potential to bias the causal effect of BMI on EDU. To assess this, we ran stepwise MVMR and finally calculated the causal effect of BMI on EDU conditional on the surviving set of candidate confounder traits of the BMI-EDU relationship.

To further understand the emerging clusters, we sought to uncover tissue-specific mechanisms. To do this, we performed a colocalisation analysis of the BMI and gene expression association signals at each locus around (*±*400kb) the 324 BMI IVs. For the gene expression association we used eQTL data from both adipose and brain tissue. This yielded a proportion of brain-vsadipose colocalised IVs for each cluster.

### 3.2 PheWAS-based K-means clustering and trait identification

After identifying 324 genome-wide significant SNPs as IVs for BMI, and selecting 407 filtered traits to run PheWAS on, we obtained a standardised effect matrix of the 324 IVs on the 407 traits. Normalising the matrix by IVs and running K-means clustering on it revealed that six clusters yielded the lowest AIC score (Supplementary Figure S2) when compared to varying the number of clusters from two to 50. The number of SNPs in each of the six clusters were: 32, 98, 35, 41, 69, 49 respectively (Supplementary Table 2).

Next, we computed an enrichment ratio (ER) to identify with which traits the SNPs in each cluster were strongly associated. The overall ER value between clusters was roughly centred around 1, however clusters #2, #3, #4, and #6 had some large ER values (see Supplementary Figure S3). Visualising the top 10 enriched traits in each cluster and their ER values in Figure 2 and Supplementary Table 3, we see that cluster #2 is strongly enriched for lean mass traits such as ‘Trunk fat-free mass’ and ‘Whole body fat-free mass’.

Similarly, cluster #3 seemed to mostly be enriched for blood- and body stature-related traits such as ‘Platelet count’ and ‘Standing height’, while cluster #4 was enriched for traits related to socio-economic position (SEP) such as ‘Job involves heavy manual or physical work’, ‘Time spent outdoors in summer’, and ‘Fluid intelligence score’. Lastly cluster #6 was enriched for food supplements/nutrients such as ‘Folate’ and ‘Potassium’.

#### 3.2.1 Causal effect estimate per cluster

To test whether the clusters had different causal effects on a selected outcome than the overall causal effect (using all IVs), we computed the IVW causal effect estimate of each cluster on education using cluster-specific IVs. As seen in Figure 3a and Supplementary Table 4, the causal effect estimates between the different clusters are significantly heterogeneous (Q-test value = 130.61, p-value *<* 10^{−300}). Clusters #2 and #5 had the smallest causal effect estimates of *−*0.09 (p-value = 1.23 *×* 10^{−}5) and *−*0.12 (p-value = 5.22 *×* 10^{−}6) respectively, where cluster #2 was enriched for lean-mass traits. These estimates are consistent with those obtained from within-family studies, which are relatively immune to confounding. By contrast, clusters #1 and #4 had the largest negative causal effect estimates of -0.44 (p-value = 7.78 *×* 10^{−}20) and -0.49 (p-value = 1.63 *×* 10^{−}44) respectively, where cluster #4 was strongly enriched for SEP-related traits.

All the clusters were less heterogeneous than the group of all the IVs combined (see ‘Avg het’ in Supplementary Table 4).

### 3.3 Post hoc analyses

To test the robustness of the PWC-MR results, we performed four additional analyses. First, we analysed the same exposure and outcome, but using sib-regression-based SNP effect sizes instead of SNP effects from GWAS of unrelated samples. Second, we replaced the exposure with childhood BMI and estimated its causal effect on EDU. Third, we replaced the outcome, EDU, with SBP. Finally, we executed a systematic search for confounders to include in a multivariable MR analysis.

#### 3.3.1 Sib-regression MR

In Howe et al.(2022)^{[18]}, within-sibship (within-family) meta-analysed GWAS estimates were generated from 178,086 siblings across 19 cohorts. Using these effect estimates, MR was performed with BMI as exposure on multiple traits, including educational attainment. They used 418 independent and genome-wide significant genetic variants for BMI, and estimated its effect on EDU using IVW to be -0.05 (95% CI: *−*0.09, *−*0.01).

They also used jackknife to estimate the standard error of the difference between the sibregression MR estimate and that of the GWAS of unrelated samples MR estimate, -0.19 (95% CI: *−*0.22, *−*0.16). Using the difference Z-score to generate a p-value for heterogeneity between the two estimates revealed a p-value *<* 0.001.

#### 3.3.2 Causal effect of childhood BMI on Educational attainment

We used the UK Biobank trait ‘Comparative body size at age 10’ as a proxy for childhood BMI – a measure that has been validated against measured BMI in childhood^{[30,31]} – for the exposure trait. Childhood BMI is expected to be less confounded by (parental) SEP compared to adult BMI and hence we expect to see a less biased causal effect on EDU. For this trait, we had 171 genome-wide significant SNPs that we used as IVs for the analysis. Of these, 16 SNPs were more strongly associated to traits other than childhood BMI and were thus excluded from further analysis. The standardised effect matrix of the remaining 155 SNPs across 461 traits was clustered into four clusters (yielding optimal AIC), each containing 37, 42, 32, 44 IVs respectively (Supplementary Figure S4, Supplementary Table 6).

Analysing the trait enrichment for each cluster revealed only two clusters with high ER values: clusters #2 and #4 (Supplementary Figure S5, Supplementary Table 7). Cluster #2 had only two traits with ERs greater than 2, which were ‘Number of fluid intelligence questions attempted within time limit’ and ‘Fluid intelligence score’, whereas cluster #4 was highly enriched for bodymeasurement/fat-mass traits such as ‘Waist circumference’ and ‘Whole body fat mass’ (see Supplementary Figure S6). However, calculating the IVW causal effect estimate for each cluster and comparing it to the estimate calculated using all IVs revealed homogeneous causal effect estimates with a Q-statistic of 3.84 (p-value of 0.43) as seen in Figure 3**b** and Supplementary Table 8. Cluster #2 had a causal effect estimate of *−*0.09 (95% CI:-0.1638, -0.0148), and cluster #4 had a causal effect estimate of *−*0.04 (95% CI:-0.0823, -0.0024). Noteworthy is the finding that the IVs of cluster #2 were more heterogeneous than all the IVs combined. Thus, we obtained a massively attenuated causal effect of BMI on EDU, when childhood BMI is used as an exposure. Reassuringly, no SEP-enriched cluster emerged and the cluster specific causal effects were homogeneous.

#### 3.3.3 Causal effect of BMI on SBP

To find further evidence that our approach does not always reveal distinct causal effects when the causal effect is non-null, we replaced EDU with SBP as outcome. Namely, we tested a well-established non-null causal relationship that is hypothesised to not be biased by pleiotropy or confounding: BMI’s effect on SBP. Using the same six clusters previously obtained for BMI, we calculated the estimated causal effect of each of the clusters compared to using all the IVs combined on SBP. This revealed a homogeneous set of causal effect estimates (Q-test value of 4.49, p-value = 0.61), with the IVW estimate using all IVs being 0.15 (p-value = 1.09 *×* 10^{−}28) as seen in Figure 3c and Supplementary Table 5.

#### 3.3.4 Systematic confounder search and MVMR analysis

Given our suspicion that the large BMI-EDU causal effect is driven by heritable confounders, we performed a systematic search to reveal traits that may be potential confounders. As described in the Methods section, the strength of the bidirectional effect of the traits on either the exposure or the outcome determined their categorisation. This led to the identification of 19 traits that were found to be candidate confounder traits (Supplementary Table 9). Matching the 19 confounder traits from this analysis to their respective ER across the six clusters from the previous analysis revealed higher ERs in cluster #1 and cluster #4 (see Supplementary Figure S7), which was associated with SEP-related traits. Noteworthy is that the traits labelled as candidate confounders were predominantly environmental exposures, such as ‘Exposure to tobacco smoke outside home’ and ‘Transport type for commuting to job workplace: Cycle’.

Furthermore, these candidate confounder traits are attributed as *candidate* or *potential* confounders since they are most likely only genetic correlates of the true confounding traits of the BMI-EDU relationship and not act as true confounders themselves.

To investigate the possible biasing effect that potential confounder traits can have on the causal relationship of BMI on EDU, we ran a stepwise MVMR on these 19 candidate confounder traits (Supplementary Table 9). During the creation of the Z-score matrix of SNPs and traits, only five traits had at least three genome-wide significant and independent SNPs whose effects could be used in the analysis. These five traits were: ‘Time spent watching television (TV)’, ‘Usual walking pace’, ‘Past tobacco smoking’, ‘Frequency of tiredness / lethargy in last 2 weeks’, and ‘Average weekly beer plus cider intake’. Of these, only the first three had a significant causal effect estimate on EDU (p-value *<* 0.05*/*5) as estimated by stepwise MVMR, and were subsequently used as exposures alongside BMI in a standard MVMR analysis.

To ensure the strength of the IVs used in the MVMR analysis, we calculated the conditional F-statistic and the MVMR causal effect estimate of BMI given various combinations of the 3 remaining candidate confounder traits. We saw the expected trend of a decreasing conditional F-statistic with the addition of traits and their IVs to the analysis (see Supplementary Figure S8). We note that the causal effect estimate of BMI on EDU decreases when any combination of the candidate confounder traits is used with BMI as exposure in comparison to the univariable MR causal effect estimate of BMI on EDU (*−*0.19, p-value = 7.11 *×* 10^{−}41). We settled on the combination of candidate confounder traits yielding a conditional F-statistic for BMI *>* 8,for which the corresponding causal effect estimates are reported in Table 1 below. This choice was a compromise between two sources of biases: weak instrument bias *vs* upward bias due to omitting relevant confounders.

### 3.4 Relationship with other approaches

#### 3.4.1 Comparison against MR-Clust

Other known IV clustering methods include MR-Clust^{[27]}, which attempts to cluster variants with similar causal effect estimates together following the hypothesis that exposures can affect an outcome by distinct causal mechanisms to varying extents. MR-Clust also accounts for the possibility of spurious clusters by assigning IVs with uncertain causal effect estimates to ‘null’ or ‘junk’ clusters.

We compared the k-means clustering of BMI IVs against that of MR-Clust with EDU as the outcome. The results revealed two main clusters as well as a ‘null’ cluster. Cluster #1 had 35 SNPs, 13 of which had an inclusion probability greater than 80%. Cluster #2 had 171 SNPs, 36 of which had an inclusion probability greater than 80%, and the remaining 142 SNPs were categorised into the ‘null’ cluster as seen in Supplementary Figure S9. The mean causal effect estimate of SNPs in cluster #1 was *−*0.496, whereas it was *−*0.246 for cluster #2. Searching for trait associations for the SNPs in each of the clusters revealed that body measurement traits like ‘Arm fat mass’ or ‘Body fat percentage’ are associated to SNPs in both clusters, while SEP-related traits such as ‘Fluid intelligence score’ or ‘Time spent watching television’ were associated to more SNPs in cluster #1 than in cluster #2.

Comparing the SNP clustering between the k-means method against that of MR-Clust in Table 2 below, we see that cluster #1 in MR-Clust, which seems to be more strongly enriched for SEP traits than cluster #2, has SNPs that were similarly clustered in clusters #1 and #4 using k-means, matching their large negative causal effect of BMI on EDU. However, the same distinct comparison cannot be made for SNPs in cluster #2 of MR-Clust.

Of the 12 Fisher’s exact tests performed to examine the contingency of SNPs in the two separate sets of clusters, four tests revealed a significant association: SNPs in cluster #1 of MR-Clust were significantly associated with SNPs in clusters #1, #2 (lean-mass traits), #4 (SEP-related traits) and #5 of the K-means clustering.

#### 3.4.2 Colocalisation analysis

With the aim of finding supporting evidence for the k-means clustering and enrichment analysis, we ran a genetic colocalisation analysis on BMI IVs and two types of tissue: subcutaneous adipose and brain, the results of which can be found in Supplementary Tables 10 and 11 respectively.

Running a set of Fisher’s tests to compute the overlap between the membership of the SNPs in the six clusters and their tissue of colocalization did not reveal any association between clusters and tissues.

## 4 Discussion

We have developed a method that performs informative clustering of IVs by utilising their association with a large number of traits. Our use of PheWAS data to guide the clustering of IVs has revealed distinct mechanisms by which exposure effects could act on outcomes. For our exposure, BMI, six distinct clusters were obtained through optimal K-means clustering. These clusters had well-defined trait enrichments, with clusters matching SEP-related, substrate, and body measurement traits. Estimating individual causal effects of each cluster on EDU as an outcome revealed heterogeneous causal effect estimates which allowed us to further strengthen our suspicion that the MR estimate for the causal effect of BMI on EDU is upward biased when using population-based SNP effect size estimates due to confounding.

We note from MR analysis run using within-sibling GWAS data^{[18]} that the causal effect estimate between BMI and EDU is *−*0.05 (95% CI: *−*0.09, *−*0.01), which is smaller than the causal effect estimate seen using population based GWAS data (*−*0.19, 95% CI: *−*0.22, *−*0.16). Investigating the various mechanisms or pathways through which BMI could have a causal effect estimate on EDU through trait-enrichment analysis has revealed notable causal effect estimates from two clusters: one with a strongly negative MR estimate whose trait enrichment reflects shared mechanisms with socio-economic factors, and another cluster with close to zero causal effect estimate enriched for lean-mass traits. MR has typically presented bias due to heterogeneous causal effects emerging via distinct pathways and bias due to confounding of the instrumentoutcome association as being separate mechanisms. Here, we have illustrated that a pheWASbased clustering approach can classify instruments into clusters, some of which correspond to different pathways, while others include IVs that are primarily confounder-associated. Our results have two major implications: 1) The lean-mass-related IV cluster indicated a close to zero causal effect of BMI on EDU. 2) We revealed that the SEP-related IVs suggest a sizeable negative effect of BMI on EDU.

In order to substantiate our findings, we performed several follow-up analyses. First, sibregression based MR of BMI on EDU recapitulated the close-to-zero causal effect obtained for the body-mass specific cluster of IVs. This indicates that many IVs for adult BMI (from population-based GWAS) represent indirect (parental/dynastic) effects, which are associated rather with a rearing-related parental trait and not primarily with offspring BMI. Second, replacing adult BMI with childhood BMI (much less associated with SEP) as exposure in the PWC-MR analysis confirmed a negligible causal effect estimate (*−*0.03, p-value = 0.04), and the four emerging clusters showed homogeneous causal effect estimates indicating the lack of confounding or biasing effects. This comparison was supported by the growing evidence showing that genetic variants have varying effects on BMI or body size at different stages of life^{[32,33]}, and that the UK Biobank proxy trait ‘Comparative body size at age 10’ captures childhood BMI well^{[30]}. One of the four clusters was strongly enriched for body-measurement/fat-mass traits whereas the second most strongly enriched cluster had only two mildly enriched SEP-related traits. This finding means that as opposed to adult BMI, childhood BMI genetics are unrelated to childhood (i.e. parental) SEP. It is also interesting to note that although fat-mass traits are strongly enriched for using childhood BMI IVs alongside lean-mass traits, these same traits are less enriched for using adult BMI IVs. Thus, IVs associated with body-mass related traits seem to be underlying the true nominally significant and minuscule causal effect between BMI and EDU. Furthermore, out of the 41 adult BMI IVs that make up cluster #4 (SEP-related traits), only three were found to be in LD with childhood BMI IVs in cluster #2 (enriched for two SEP-related traits).

In Howe et al. (2022), assortative mating, dynastic effects and population stratification were all considered candidate mechanisms for biased population-based GWAS effect estimates. Given our observations, a possible explanation is a dynastic effect of parental SEP traits acting as a confounder on both adult EDU and adult BMI (as seen in Figure 1a). This effect is direct on adult EDU but could affect adult BMI indirectly through either of two ways or both to a certain extent: (i) Parental SEP has a direct effect on the offspring’s SEP as an adult, which in turn has an effect on offspring adult BMI^{[34]}, or (ii) parental SEP – as an indicator of childhood social circumstances – may have an effect through this on the offspring’s (adult) BMI.

To explore the relevance of the obtained six clusters of IVs, we replaced EDU with SBP as the outcome of interest since within-sibling GWAS MR results showed no difference when compared to population GWAS MR results, indicating that there seems to be no bias in the causal effect estimate due to pleiotropy or confounding. Our analysis revealed that for the six clusters attributed to BMI, their causal effect estimate on SBP was homogeneous with the estimate using all SNPs (0.16, p-value = 1.09 *×* 10^{−}28). As there is no significant heterogeneous effects and all the cluster causal effects agree, we can conclude that there is no other confounding effects biasing the causal effect estimate. It is reassuring to note that our PWC-MR approach does not always seek to identify distinct causal effects, confirming that confounding mechanisms are specific to certain exposure-outcome pairs.

Finally, our systematic confounder search coupled with stepwise MVMR has pinpointed TV watching and smoking as two candidate confounder traits (maybe acting as a correlate of parental SEP) that may bias standard MR analysis of the BMI-EDU relationship: upon accounting for these two traits, BMI exhibits strongly attenuated causal effect on EDU.

Comparing our method to other IV clustering methods such as MR-Clust does not reveal strong concordance in the findings. MR-Clust takes as input the exposure and outcome effects as well as their standard errors and attempts to cluster the exposure IVs based on the possible similarity between each IV’s causal effect. When using BMI and EDU as exposure and outcome respectively, MR-Clust revealed two main clusters alongside a null cluster. Both of the clusters were enriched for a variety of traits including body-measurement traits, both leanand fat-mass, as well as SEP-related traits. The causal effect estimates of both clusters were strongly negative, similar to using all IVs in an MR analysis for this trait pair.

The most apparent difference between the clustering of our method and that of MR-Clust is our use of external information for the exposure and our clustering of IVs independently of the outcome or the individual MR causal effects of the IVs. By clustering based on the PheWAS data of the exposure IVs and various other traits, we can reveal possible pathways and mechanisms through which the exposure manifests, independently of any outcome.

Another comparable clustering method by Grant et al.^{[35]} uses genetic variant associations with a set of traits to identify groups of IVs with similar biological mechanisms. Their method, NAvMix, uses a directional clustering algorithm and includes a noise-cluster to increase robustness to outliers. NAvMIX is demonstrated on BMI IVs and their associations to nine lifestyle or cardio-metabolic traits that have been previously shown to be related to BMI. Their results revealed 5 distinct clusters where they were able to identify a metabolically healthy obesity cluster that also had a small MR causal effect on coronary heart disease (CHD). However, we were unable to run their method using our data due to convergence issues arising when the number of traits used for PheWAS association increases. This comparison also highlights that the traits we include in the pheWAS analysis (and the subsequent clustering) have an important role in which biological mechanisms we can detect. For example, our analysis did not pick up the metabolically healthy obesity cluster, potentially because waist-to-hip ratio and other subcutaneous-vs-visceral fat proxy-traits were not included among the 408 selected phenotypes due to our filtering on genetic correlation with BMI (*r*^{g} *<* 0.75). Without such filtering, PWC-MR reveals 5 clusters with significantly heterogeneous causal effects on EDU. These five clusters are very similar to the original six, with the original cluster #1 getting diffused into the other clusters. Reassuringly, the cluster that is strongly enriched for SEP-related traits has a large negative causal effect estimate of -0.53 (95% CI: *−*0.59, *−*0.48), whereas the cluster that is most enriched for body-measurement/fat-mass traits still had an attenuated causal effect of -0.10 (95% CI: *−*0.14, *−*0.06).

Furthermore, we attempted to consolidate our findings of the k-means clustering and enrichment analysis by running a genetic colocalisation analysis on the 324 clustered BMI IVs and both subcutaneous adipose and brain tissue. Unfortunately, we do no find an association between the cluster memberships of the IVs and their signal colocalization in brain or adipose tissue, possibly due to the limited evidence of colocalization in either tissue for most loci.

Our method has its own set of limitations: first, we are limited by the availability of traits with PheWAS data to support our informative clustering of IVs. This may lead to a failure in identifying key pathways and thus missing clusters representing important subgroup (mediator/subphenotype/confounder). Second, although it is not the most ideal handling of data, our binary traits are treated as continuous ones in our analysis. In large samples, linear and logistic regression effect estimates correlate very strongly and hence, it is likely that this choice did not impact the clustering^{[36]}. Third, although we have attempted to minimise the arbitrary choice of parameters in our analysis, the genetic correlation threshold that determines which traits are too similar to the exposure and outcome trait is arbitrarily set at 0.75 for BMI and EDU and could be modified, but the emerging clusters may change as a consequence. Similarly, some p-value thresholds and type I error rate control was set at 5%, which may be viewed as arbitrary. Fourth, the identified potential confounder traits used in the MVMR analysis act as simple proxies for true confounders. For example, exposure to current tobacco smoking or TV watching can be highly (genetically) correlated to the same or a similar exposure during early life (or even proxy a parental trait), hence it is rather the earlier version of the exposure which is likely to be the true confounder. Our proxy confounders were simply nuisance variables, their only role was to see the remaining causal effect of BMI on EDU upon conditioning on them. Lastly, we acknowledge that there are several other tests^{[37]} that could be used in place of a t-test when excluding SNPs more strongly associated to other traits than our exposure or different MR methods used in our systematic confounder search, however both of these were simple exclusion or pre-selection steps that have very little impact on the outcome of the results.

## Data Availability

All data produced are available online at: - http://www.nealelab.is/uk-biobank - https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium - https://yanglab.westlake.edu.cn/software/smr/#DataResource - https://www.sciencedirect.com/science/article/pii/S0002929721004687

## Author contributions

L.D. and Z.K. conceived and designed the project. Z.K. supervised all statistical analyses. L.D. implemented the research and performed the analyses. L.D. and Z.K. prepared the first draft of the manuscript. L.D., Z.K., G.H. and G.D.S. contributed to the review and editing of the manuscript.

## Competing interests

The authors declare no competing interests1

## Code availability

The source code for this work can be found on https://github.com/LizaDarrous/PheWAS-cluster.

## Supplementary Information

## Acknowledgements

We are grateful for the useful discussions on MVMR with Eleanor Sanderson. This research has been conducted using the UK Biobank Resource under Application Number 16389. Z.K. was funded by the Swiss National Science Foundation (310030 189147 and 32003B 173092). GH and GDS work within the MRC Integrative Epidemiology Unit at the University of Bristol (MC UU 00011/1). Payments were made to the institution. For computations, we used the CHUV HPC cluster.