Genome-wide association and Mendelian randomization analysis prioritizes bioactive metabolites with putative causal effects on common diseases

Bioactive metabolites are central to numerous pathways and disease pathophysiology, yet many bioactive metabolites are still uncharacterized. Here, we quantified bioactive metabolites using untargeted LC-MS plasma metabolomics in two large cohorts (combined N{approx}9,300) and utilized genome-wide association analysis and Mendelian randomization to uncover genetic loci with roles in bioactive metabolism and prioritize metabolite features for more in-depth characterization. We identified 118 loci associated with levels of 2,319 distinct metabolite features which replicated across cohorts and reached study-wide significance in meta-analysis. Of these loci, 39 were previously not known to be associated with blood metabolites. Loci harboring SLCO1B1 and UGT1A were highly pleiotropic, accounting for >40% of all associations. Two-sample Mendelian randomization found 46 causal effects of 31 metabolite features on at least one of five common diseases. Of these, 15, including leukotriene D4, had protective effects on both coronary heart disease and primary sclerosing cholangitis. We further assessed the association between baseline metabolite features and incident coronary heart disease using 16 years of follow-up health records. This study characterizes the genetic landscape of bioactive metabolite features and their putative causal effects on disease.

rheumatoid arthritis. Amongst the 31 metabolite feature levels, there were a wide range of 2 5 4 correlations with most exhibiting weak positive correlations as well as ~5 clusters of features 2 5 5 whose levels were in moderate-high correlation. 2 5 6 2 5 7 Of the 15 metabolite features with putative causal effects on both PSC and CHD, all were 2 5 8 inverse associations: lifetime exposure to low metabolite feature levels increased risk of CHD 2 5 9 and PSC, with the latter exhibiting a somewhat greater effect size (Figure 2A). Lifetime 2 6 0 exposure to elevated leukotriene D4, a basophil-secreted metabolite known to be involved in 2 6 1 the induction of smooth muscle contraction, vasoconstriction and vascular permeability 46 , 2 6 2 was predicted to reduce risk of both PSC and CHD ( Figure 2B). While evidence of a 2 6 3 pleiotropic effect (MR-Egger intercept=0.07, p=0.02) for PSC was observed, the causal 2 6 4 estimates were consistent across all testing methods. Leukotriene D4 was only weakly Association of metabolite features with incident coronary heart disease 2 7 0 2 7 1 Using the baseline bioactive metabolite features, we next assessed CHD-free survival for 2 7 2 incident CHD via the linked EHR data available in both cohorts (Methods). While PSC is 2 7 3 relatively rare in the population (with only 9 incident cases in FINRISK02), we were 2 7 4 powered to detect baseline metabolite feature associations with CHD (541 and 97 incident 2 7 5 CHD cases in FINRISK02 and FHS, respectively). Time-to-event Cox proportional hazards 2 7 6 models were used in both cohorts and then meta-analyzed (Methods). 2 7 7 2 7 8 Of the 19 metabolite features with putative causal effects on CHD, six were associated with 2 7 9 incident CHD at FDR-adjusted significance (Table S10, Figure 3), with an additional 2 8 0 metabolite feature associated at nominal significance (p<0.05). For these seven metabolite 2 8 1 features, there was opposing direction of effect for the MR-based lifetime exposure estimate 2 8 2 and time-to-event Cox model estimate, with the former being negative and the latter being 2 8 3 positive (Figure 3). The corresponding hazards ratios from the Cox models of each 2 8 4 metabolite feature in FINRISK02 and FHS were consistently positive in both cohorts ( Table  2 8 5 S10). 2  8  6  2  8  7  2  8  8  Discussion  2  8  9  2  9  0 In this study, we investigated the genetic associations of over 10,000 bioactive metabolite 2 9 1 features and their relationships with common diseases. We identified 118 genetic loci 2 9 2 harboring variants robustly associated with the levels of 2,319 metabolite features, 91% of 2 9 3 which were chemically unidentified compounds, suggesting a largely unexplored reservoir of 2 9 4 genetic control of the circulating metabolome. We identified 39 genetic loci previously 2 9 5 unlinked to blood metabolites and highlighted loci with extensive pleiotropy for bioactive 2 9 6 metabolites. We found causal effects for multiple identified and unidentified bioactive 2 9 7 metabolites on diverse common diseases, and investigated the baseline relationship of 2 9 8 putatively causal metabolite features with incident coronary heart disease. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 4, 2020. causal metabolic pathways modulating risk of both diseases. As a chronic autoimmune 3 2 5 disease, PSC is a progressive disease mainly associated with the hepatic system and 3 2 6 inflammation of the bile ducts, consistent with putative causal effects of leukotriene D4, an 3 2 7 inflammatory mediator known to be released by basophils 48 . We would speculate that other 3 2 8 bioactive metabolites amongst the 15 affecting CHD and PSC risk are also mediators of 3 2 9 inflammation, a broad but recognized causal process underlying both diseases.
Notably, for bioactive metabolites associated with CHD, we found directional inconsistency 3 3 2 between baseline risk prediction models and risk conferred by lifetime exposure (derived 3 3 3 from MR). This adds to previous evidence for significant but directionally opposed 3 3 4 biomolecular associations with cardiovascular diseases in MR and observational analyses. A 3 3 5 recent proteome analysis found that genetic predisposition to higher plasma MMP-12 levels 3 3 6 was predicted to reduce risk of coronary disease and atherosclerotic stroke, despite 3 3 7 observational studies finding a positive association with cardiovascular disease risk 49 . A 3 3 8 separate study found levels of PON1, a major anti-atherosclerotic component of high-density 3 3 9 lipoprotein (HDL), to be negatively associated with CHD in observational analyses but with 3 4 0 MR results predicting higher PON1 to increase risk of CHD 50 . Authors hypothesized that the 3 4 1 CHD condition may be linked to a downregulation of PON1 resulting in lower plasma These findings appear to be robust yet puzzling. We hypothesize several scenarios which 3 4 5 may explain these data. First, the causal effect estimated using MR is thought to indicate a 3 4 6 lifetime average effect 51,52 ; however, biological pathways are highly dynamic and 3 4 7 biomolecules may have age or time-varying effects. Second, sub-clinical atherosclerosis or 3 4 8 other relevant disease states may rewire biochemical and metabolic pathways such that 3 4 9 disease risk is increased but normal biological functions (as estimated by MR) are 3 5 0 compromised. Third, MR approaches assume linearity for the examined causal effect and for 3 5 1 the genetic effects on the exposure and the outcome 53 . However, biological risk factors may 3 5 2 have non-linear effects on the outcome, For example, high body mass index (BMI) is a risk 3 5 3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.01.20166413 doi: medRxiv preprint factor for type 2 diabetes (T2D) in adults; however, high BMI can be protective in infants 54 3 5 4 and early childhood famine is associated with higher T2D risk in later life 55 .  3  5  5  3  5  6 In conclusion, this study shows the efficiency of coupling untargeted metabolomics, GWAS 3 5 7 and MR to prioritize the bioactive molecular features with causal effects on disease. In doing 3 5 8 so, it highlights loci harboring potentially key enzymes, uncovers new insights into bioactive 3 5 9 molecular pathways, and raises salient questions of observational effects and MR-based 3 6 0 lifetime effects of molecular exposures which may have important therapeutic implications. 3 6 1 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.01.20166413 doi: medRxiv preprint measured metabolite features across 8,291 samples. 4 0 1 4 0 2 For FINRISK02, a two-step normalization was applied to the intensity data to control for 4 0 3 technical variation. The first step was to remove variation observed in external bracket 4 0 4 pooled plasma with RUV 60 , a tool implemented in R package MetNorm to remove unwanted 4 0 5 variation of metabolomics data. The second step was to remove variation observed in internal 4 0 6 standards with RUV. There were two components for each RUV process, RUV-random 4 0 7 (function NormalizeRUVRand) and RUV-Kmeans (function NormalizeRUVRandClust). 4 0 8 RUV-random was to remove k technical factors estimated from external pooled plasma or 4 0 9 internal standards, where k was set to the number of principal components explaining >97.5% 4 1 0 of variation. RUV-Kmeans was used to refine the k technical factors to better represent the 4 1 1 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 4, 2020. The kinship matrix was constructed using 106,201 SNPs selected by pruning the hard-called 4 5 8 SNPs with r 2 < 0.1 (plink2 command --indep-pairwise 1000 80 0.1). Genetic principal 4 5 9 components were calculated using FlashPCA2 68 on the pruned SNPs. Leave-one-4 6 0 chromosome-out (LOCO) analysis within BOLT-LMM was used to avoid proximal 4 6 1 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.01.20166413 doi: medRxiv preprint contamination. The linear mixed model included the following covariates for FINRISK02: 4 6 2 genotyping batch, age, gender and top 10 genetic principal components; for FHS, the 4 6 3 covariates were age and sex. The average genomic inflation factor was 1.0067 (range 0.9776 4 6 4 to 1.0389) and a positive correlation (Pearson correlation coefficient was 0.57) was observed 4 6 5 between the genomic inflation factor and SNP-heritability ( Figure S1). For those SNPs 4 6 6 reaching genome-wide significance (p<5×10 -8 ), to obtain conditionally independent SNP-4 6 7 metabolite associations GCTA-COJO 69 was used to conduct step-wise conditional and joint 4 6 8 analysis on individual genotype data. Given the large number of metabolite features, the 4 6 9 number of effective tests was based on eigenvalue variance and estimated using 4  As such, 41 SNPs in total were assigned to more than one locus; in these cases, we reported 4 9 0 both loci in published-gwas-with-metabolomics; last accessed 02/2020). 4 9 5 4 9 6 Mendelian randomization 4 9 7 4 9 8 For genetic instruments, we utilized SNPs reaching genome-wide significance in meta-4 9 9 analysis. Summary statistics of SNP-disease associations were extracted from MR-base 5 0 0 database using R package TwoSampleMR 76 . If an index SNP was not present, the strongest 5 0 1 proxy SNP was used or set to missing if r 2 <0.8). GWAS summary statistics were required to 5 0 2 include European ancestries and be based on at least 1,000 individuals (Table S7). For each 5 0 3 metabolite-disease analysis, SNPs were clumped using a linkage-disequilibrium threshold of 5 0 4 r 2 <0.05 in a 500kb window to minimize the impact of correlated SNPs on causal estimates. 5 0 5 As MR analysis with multiple instruments is more reliable, five or more genetic instruments 5 0 6 were required for a metabolite to be taken forward for MR analysis. The effect allele was 5 0 7 taken to be the effect-increasing allele of metabolite in FINRISK02. We estimated causal 5 0 8 effects using an ensemble of five widely utilized methods: inverse variance weighted 5 0 9 (IVW) 77 , simple mode 78 , weighted mode 78 , weighted median 79 , and MR-Egger 80 . As these 5 1 0 methods have different assumptions, agreement among multiple methods would indicate a 5 1 1 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.01.20166413 doi: medRxiv preprint robust estimate of causal effects 81 . We defined a significant causal effect as p<0.05 in three of 5 1 2 the five selected methods. For significant causal estimates, details of genetic instruments are 5 1 3 provided in Table S8. As a second step, for putative causal associations passing three of the 5 1 4 five methods in the ensemble, we then used MR-PRESSO 45 to detect and correct for 5 1 5 horizontal pleiotropic outliers. 5 1 6 5 1 7 Cox proportional hazards models 5 1 8 5 1 9 To test the association between metabolite feature and incident CHD in both FINRISK02 (16 5 2 0 years follow-up; 541 incident events) and FHS (6 years follow-up; 97 incident events), Cox 5 2 1 proportional hazards regression (coxph function in the survival R package) was utilized to 5 2 2 predict the CHD event for metabolite feature. Metabolite levels were at log10 scale. 5 2 3 Covariates included age, sex, and log-transformed BMI. Participants with prevalent CHD at 5 2 4 baseline were excluded. Cox models were sex-stratified with time-on-study as the time scale. 5 2 5 Fixed effect meta-analysis was conducted to combine the summary statistics from both 5 2 6 cohorts. Sensitivity analysis was conducted to ensure associations were robust to LDL-5 2 7 cholesterol, smoking status, hypertension, diabetes as well as medications for lipid-lowering, 5 2 8 anti-hypertension and diabetes. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 4, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.01.20166413 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.01.20166413 doi: medRxiv preprint