Twin-based Mendelian Randomization Analyses Highlight Smoking’s Effects on Blood DNA Methylation, with Putative Reverse Causation

Epigenome-wide association studies (EWAS) aim to identify differentially methylated loci associated with complex traits and disorders. EWAS of cigarette smoking shows some of the most widespread DNA methylation (DNAm) associations in blood. However, traditional EWAS cannot differentiate between causation and confounding, leading to ambiguity in etiological interpretations. Here, we apply an integrated approach combining Mendelian Randomization and twin-based Direction-of-Causation analyses (MR-DoC) to examine causality underlying smoking-associated blood DNAm changes in the Netherlands Twin Register (N=2577). Evidence across models suggests that current smoking’s causal effects on DNAm likely drive many of the previous EWAS findings, implicating functional pathways relevant to several adverse health outcomes of smoking, including hemopoiesis, cell- and neuro-development, and immune regulation. Additionally, we find evidence of potential reverse causal influences at some DNAm sites, with 17 of these sites enriched for gene regulatory functional elements in the brain. The top three sites with evidence of DNAm’s effects on smoking annotate to genes involved in G protein-coupled receptor signaling (GNG7, RGS3) and innate immune response (SLC15A4), elucidating potential biological risk factors for smoking. This study highlights the utility of integrating genotypic and DNAm measures in twin cohorts to clarify the causal relationships between health behaviors and blood DNAm.


Supplementary Methods
In this study, we analyzed data from the Netherlands Twin Register (NTR) 1 to examine the causal influences between smoking status and blood DNA methylation (DNAm) using MR-DoC models 2,3 .In the current analyses, we included data from European-ancestry adult twins with both genotypic and DNAm data, comprising 2,577 individuals (67% female).

Genotypic Data, Principal Components Analysis, and Ancestry Outlier Pruning
The DNA samples included in the current study were genotyped on 3 SNP (single nucleotide polymorphism) microarray platforms: Affymetrix 6.0 (N= 2,399), Affymetrix Axiom (N= 83), and Illumina GSA NTR array (N= 95).Genotype calling was done following the manufacturer's protocols.Sample and variant quality control (QC), imputation, genetic principal component analysis (PCA), and ancestry assignment have been previously described 4 .Briefly, after QC and harmonizing variants across the three platforms, the data were aligned to the positive strand of Genome Reference Consortium Human Build 37 (GRCh37) and then imputed against the European (EUR) super-population of the 1000 Genomes Project Phase-3 (KGP3) 5 , the Haplotype Reference Consortium (HRC) 6 1.1 (Ega version), and the Genome of the Netherlands Consortium (GoNL) 7 reference panels.Using SmartPCA in EIGENSTRAT 8 , the first 20 PCs for the genotypic data were calculated in the KGP3 data, and the NTR samples were then projected onto the PC space based on the SNP weights.Samples identified as outliers in the PC space were then excluded.

Smoking Assessment at Blood Sampling
At the time of blood sampling for DNAm assessment, the participants answered the following questions about smoking in Dutch.The English translations are described in the main Methods section.

Polygenic Risk Score of Smoking
The PRS of smoking was based on the European-ancestry summary statistics from the genomewide association study (GWAS) of smoking initiation (lifetime regular smoking) by GSCAN (GWAS & Sequencing Consortium of Alcohol and Nicotine use) 9 , excluding the NTR from the meta-analysis.
As described in a previous study using the same PRS in the NTR 4 , the post-imputation SNPs from the merged best-guess three-platform data were QCed to satisfy the following criteria: MAF >0.01, HWE p >0.00001, Mendel error rate < 1%, and genotype call rate over 98%.Furthermore, the imputation info for the three platforms needed to be above 0.10, and the allele frequency between platforms after imputation could not differ more than 2%, leaving a total of 7,551,860 post-QC SNPs for analysis.The PRS was calculated using LDpred v0.9 10 , with HRC+GoNL as the LD (linkage disequilibrium) reference panel.For estimating the target LD structure, we used a subset of unrelated individuals and a set of wellimputed variants in the NTR.The parameter ld_radius was set by dividing the number of variants in common (from the output of the coordination step) by 12000.For the coordination step, the median sample size was used as the input value for N.For the LDpred step, we applied the following thresholds for the fraction of variants with non-zero effects (in addition to the default infinitesimal model): --PS=0.5,0.3,0.2,0.1,0.05,0.01.
To determine the LDPred threshold that yielded the PRS with the highest predictive power for the variables of interest (current vs. never and former vs. never smoking), we fitted logistic regression models in R (v4.3.2) to estimate incremental R 2 on a liability scale.We first fitted a null logistic regression model using the glm()function with family=binomial(link='logit') and a standard set of covariates comprising age (linear and quadratic), sex, SNP microarray platform (dummy variables), and the first ten genetic PCs (without including the PRS).Then, we fitted a full model with the PRS as an additional independent variable.We estimated the liability-scale R 2 in both models and then the difference in the two R 2 estimates as the variance in the outcome variable explained by the PRS (controlling for the covariates).For both outcome variables (current and former smoking), the PRS with the highest incremental R 2 was based on a threshold of 0.1 and thus retained for further analyses.The PRS was residualized for the SNP microarray platform and the first ten genetic PCs using linear regression models.The residuals were then standardized to have a mean of zero and an S.D. of one before using it as an IV in the MR-DoC models.

Figure S5
Bidirectional Causal Estimates at the 64 CpGs with Robust Evidence of the Causal Effects of Current Smoking on DNA methylation These CpGs did not show robust evidence for the reverse effects of DNAm on current smoking.Please refer to Supplementary Tables S1 (Current Smoking → DNAm) and S2 (DNAm → Current Smoking) for the corresponding data.

Figure S6
Upset plot of the intersection of CpGs with statistically significant Current Smoking → DNAm effects after Bonferroni correction in each of the three MR-DoC models Note.Please refer to Supplementary Table S1 for the corresponding data

Figure S7
Upset plot of the intersection of CpGs with statistically significant DNAm → Current Smoking effects after Bonferroni correction in each of the three MR-DoC models Note.Please refer to Supplementary Table S3 for the corresponding data

Figure S8
19 CpGs with potential bidirectional causal effects between current smoking and DNA methylation Note.Three CpGs had more robust evidence of Current Smoking → DNAm causal effects than vice versa.
One CpG had more robust evidence of DNAm → Current Smoking causal effects than vice versa.The rest 15 CpGs had only suggestive evidence (consistent, nominally significant estimates across models) in both directions.Please refer to Supplementary Tables S1-S4 for the corresponding data.

Figure S9
Top Enriched Ontology Clusters in Metascape's Gene Annotation and Functional Enrichment Analyses of the 525 CpGs (outside the MHC region) with Potential Current Smoking → DNAm effects Note.The "NearestGene" IDs from Supplementary Table S1 were used as the input data for Metascape 12 .Please refer to Supplementary Tables S5-S6 for the corresponding annotation and enrichment results.As detailed in the Metascape manuscript 12 , the program first identified all significant ontology terms, including GO/KEGG terms, canonical pathways, and hallmark gene sets.The significant terms (based on hypergeometric p-value <0.01 and >1.5-fold enrichment) were then clustered into a hierarchical tree based on Kappa-statistical similarities among their gene memberships.The tree was then cast into term clusters based on a threshold of 0.3 kappa score.The enrichment clusters and their underlying terms are marked as "Summary" and "Membership", respectively, under the column GroupID in Supplementary Table S24.The "Summary" terms provide an overview of enriched, non-redundant ontology terms.

Figure S10 Enrichment Results for Gene-Ontology (GO) Processes in Metascape's Gene Annotation and Functional
Enrichment Analyses of the 525 CpGs (outside the MHC region) with Potential Current Smoking → DNAm effects Note.The "NearestGene" IDs from Supplementary Table S1 were used as the input data for Metascape 12 .Please refer to Supplementary Table S6 for all enrichment results.

Figure S11
Top 100 Ontology Terms in Metascape's Gene Annotation and Functional Enrichment Analyses of the 525 CpGs (outside the MHC region) with Potential Current Smoking → DNAm effects Note.The "NearestGene" IDs from Supplementary Table S1 were used as the input data for Metascape 12 .Please refer to Supplementary Table S6 for all enrichment results.

Figure S12 eFORGE analyses of overlap between gene-regulatory chromatin states and the 525 CpGs (outside the MHC region) with potential Current
Smoking → DNAm effects Note.Please refer to Supplementary Table S7 for the corresponding data.

Figure S14 eFORGE analyses of overlap between DNase hypersensitivity (DHS) sites and the 546 CpGs with potential Current Smoking → DNAm effects
Note.Please refer to Supplementary Table S9 for the corresponding data.FDR q−value q < 0.01 q < 0.05 non−sig FDR q−value q < 0.01 q < 0.05 non−sig

Figure S15
64 CpGs with potential DNAm → Current Smoking effects, based on consistent, nominally significant estimates across models Note.These CpGs were used for the follow-up enrichment analyses with eFORGE 11 and Metascape 12 .
None of these sites are in the MHC region.Please refer to Supplementary Table S3 for the corresponding data.

Figure S16 Top Ontology Clusters in Metascape's Gene Annotation and Functional Enrichment Analyses of the 64
CpGs with Potential DNAm → Current Smoking effects Note.The "NearestGene" IDs from Supplementary Table S3 were used as the input data for Metascape 12 .None of the ontology terms were significant after multiple-testing correction.Please refer to Supplementary Tables S10 and S11 for all annotation and enrichment results.

Figure S17 Enrichment Results for Gene-Ontology (GO) Processes in Metascape's Gene Annotation and Functional
Enrichment Analyses of the 64 CpGs with Potential DNAm → Current Smoking effects Note.The "NearestGene" IDs from Supplementary Table S3 were used as the input data for Metascape 12 .None of the ontology terms were significant after multiple-testing correction.Please refer to Supplementary Table S11 for all enrichment results.

Figure S20
eFORGE analyses of overlap between DNase hypersensitivity (DHS) sites and the 64 CpGs with potential DNAm → Current Smoking effects Note.Please refer to Supplementary Table S14 for the corresponding data.FDR q−value q < 0.01 q < 0.05 non−sig FDR q−value q < 0.01 q < 0.05 non−sig

Figure S21 Follow-up eFORGE analyses of overlap between gene-regulatory chromatin states and the 21 CpGs enriched for overlap with Enhancers in the "Fetal Brain
Male" sample in Figure S10

Figure S23 Follow-up eFORGE analyses of overlap between DNase hypersensitivity (DHS) sites and the 21 CpGs enriched for overlap with Enhancers in the "Fetal Brain
Male" sample in Figure S10/Table S5 Note.Please refer to Supplementary Table S17 for the corresponding data.FDR q−value q < 0.01 q < 0.05 non−sig FDR q−value q < 0.01 q < 0.05 non−sig S14/Table S9 Note.Please refer to Supplementary Table S20 for the corresponding data.FDR q−value q < 0.01 q < 0.05 non−sig FDR q−value q < 0.01 q < 0.05 non−sig

Figure S25
Estimated DNAm → Current Smoking effects at the 17 CpGs showing highly specific enrichment for overlap with generegulatory elements in the brain in Figure 7 Note.The Y-axis shows the probe ID and the "Nearest Gene".For the corresponding data, please refer to Supplementary Table S3.

Figure S27
eFORGE analyses of overlap between histone-mark modifications and the 18 CpGs underlying the enriched overlap with Enhancers in the "Lung" sample in Figure S10/Table S5 Note.Please refer to Supplementary Table S22 for the corresponding data.FDR q−value q < 0.01 q < 0.05 non−sig FDR q−value q < 0.01 q < 0.05 non−sig

Figure S29
eFORGE analyses of overlap between gene-regulatory chromatin states and the 18 CpGs underlying the enriched overlap with Enhancers in the "Primary B cells from cord blood" sample in Figure S10

Figure S31
eFORGE analyses of overlap between DNase hypersensitivity (DHS) sites and the 18 CpGs underlying the enriched overlap with Enhancers in the "Primary B cells from cord blood" sample in Figure S10/Table S5 Note.Please refer to Supplementary Table S26 for the corresponding data.FDR q−value q < 0.01 q < 0.05 non−sig FDR q−value q < 0.01 q < 0.05 non−sig Upset plot of the intersection of CpGs with statistically significant (FDR <0.05) Former Smoking → DNAm effects in each of the three MR-DoC models Note.Please refer to Supplementary Table S27 for the corresponding data.

Figure S33
Upset plot of the intersection of CpGs with statistically significant (FDR <0.05) DNAm → Former Smoking effects in each of the three MR-DoC models Note.Please refer to Supplementary Table S29 for the corresponding data.

Figure S34
Estimated DNAm → Former Smoking effects at the two former-smoking-associated CpGs that showed robust evidence of DNAm → Current Smoking effects Note.Please refer to Supplementary Tables S3 and S29 for the corresponding data.

Figure S35
Prior EWAS association statistics of smoking-associated CpGs stratified by whether the CpG was identified as having an mQTL allelic score with F-statistic >10 in the current study On the X-axis, "mQTL-" indicates the CpGs without an mQTL allelic score with F >10 (5,816 CpGs), and "mQTL+" indicates the CpGs with an mQTL allelic score with F >10 (11,124 CpGs).The Y-axis shows the -log10(FDR) values of the association results from the previous EWAS meta-analysis of current vs. never smoking 13 .The "mQTL-" CpGs were not tested for DNAm → Smoking causal effects in the current study.
from peripheral blood E032 Primary B cells from peripheral blood E033 Primary T cells from cord blood E034 Primary T cells from peripheral blood E046 Primary Natural Killer cells from peripheral E050 Primary hematopoietic stem cells G−CSF−mobili E051 Primary hematopoietic stem cells G−CSF−mobili E028 Breast variant Human Mammary Epithelial Cells E003 H1 Cells E004 H1 BMP4 eFORGE analyses of overlap between gene-regulatory chromatin states and the 64 CpGs with potential DNAm → Current Smoking effectsNote.Please refer to Supplementary TableS12for the corresponding data.analyses of overlap between histone-mark modifications and the 64 CpGs with potential DNAm → Current Smoking effects Note.Please refer to Supplementary TableS13for the corresponding data. eFORGE /Table S5 Note.Please refer to Supplementary Table S15 for the corresponding data.Follow-up eFORGE analyses of overlap between histone-mark modifications and the 21 CpGs enriched for overlap with Enhancers in the "Fetal Brain Male" sample in Figure S10/Table S5 Note.Please refer to Supplementary Table S16 for the corresponding data.
Enhancers in the "Lung" sample in Figure S10/Table S5 Note.Please refer to Supplementary Table S21 for the corresponding data.
Figure S28 eFORGE analyses of overlap between DNase hypersensitivity (DHS) sites and the 18 CpGs underlying the enriched overlap with Enhancers in the "Lung" sample in FigureS10/TableS5Note.Please refer to Supplementary TableS23for the corresponding data.
/TableS5Note.Please refer to Supplementary TableS24for the corresponding data.Figure S30 eFORGE analyses of overlap between histone-mark modifications and the 18 CpGs underlying the enriched overlap with Enhancers in the "Primary B cells from cord blood" sample in FigureS10/TableS5Note.Please refer to Supplementary TableS25for the corresponding data.