Genome-wide SNP-sex interaction analysis of susceptibility to idiopathic pulmonary fibrosis

Background Idiopathic pulmonary fibrosis (IPF) is a chronic lung condition that is more prevalent in males than females. The reasons for this are not fully understood, with differing environmental exposures due to historically sex-biased occupations, or diagnostic bias, being possible explanations. To date, over 20 independent genetic variants have been identified to be associated with IPF susceptibility, but these have been discovered when combining males and females. Our aim was to test for the presence of sex-specific associations with IPF susceptibility and assess whether there is a need to consider sex-specific effects when evaluating genetic risk in clinical prediction models for IPF. Methods We performed genome-wide single nucleotide polymorphism (SNP)-by-sex interaction studies of IPF risk in six independent IPF case-control studies and combined them using inverse-variance weighted fixed effect meta-analysis. In total, 4,561 cases (1,280 females and 2,281 males) and 23,500 controls (8,360 females and 14,528 males) of European genetic ancestry were analysed. We used polygenic risk scores (PRS) to assess differences in genetic risk prediction between males and females. Findings Three independent genetic association signals were identified. All showed a consistent direction of effect across all individual IPF studies and an opposite direction of effect in IPF susceptibility between females and males. None had been previously identified in IPF susceptibility genome-wide association studies (GWAS). The predictive accuracy of the PRSs were similar between males and females, regardless of whether using combined or sex-specific GWAS results. Interpretation We prioritised three genetic variants whose effect on IPF risk may be modified by sex, however these require further study. We found no evidence that the predictive accuracy of common SNP-based PRSs varies significantly between males and females.


Overview of study
In this study we have used six IPF case-control studies (US, Colorado, UK, UUS, Genentech and CleanUP-UCD) to conduct a genome-wide SNP-by-sex interaction meta-analysis.
Interaction analyses were performed in each of the six studies separately, using genetic variants imputed using the Trans-Omics for Precision Medicine (TOPMed) reference panel and the Michigan Imputation Server 1 .The individual study results were then meta-analysed using a fixed-effect metaanalysis in order to improve the statistical power of the analysis.From the meta-analysis we used meta-P < 1x10 -8 as the threshold for genome-wide significance and the threshold meta-P < 1×10 -6 was used to define suggestively significant interactions.The IPF study IPFJES was used to assess males only results from the meta-analysis.
We performed polygenic risk score (PRS) analyses with the scores constructed for CleanUP-UCD males and female participants separately, using the SNP effects from meta-analysing five of the IPF case-control cohorts.We used the area under the ROC curve (AUC) and DeLong's test to test whether there were differences in the predictive accuracy of multiple PRS between males and females.

Studies and quality control
Quality control (QC) and sample selection for five of the six studies meta-analysed (US, Colorado, UK, UUS, Genentech) has previously been described in Allen et al 2 and the CleanUP-UCD study 3,4 has been described separately.
In the analysis we retained IPF cases and controls who were of genetically-determined European ancestry and had sex-at-birth recorded.
IPFJES (the IPF Job Exposure Study) has been described by Reynolds et al 5 .The study comprises 960 (494 cases and 466 controls) individuals who were recruited across 21 UK hospitals.IPF cases were male who had been first diagnosed with IPF between 1 st February 2017 and 1 st October 2019.Controls were selected from non-IPF individuals attending outpatient's departments over the same time-period.Of the 494 IPF cases and 466 controls, 441 cases and 423 controls were genotyped by the Affymetrix UK Biobank array.
The following subject quality control was performed: -Affymetrix quality control: four individuals removed after failing Affymetrix dish quality control and five individuals removed as they had a call rate < 97% in step 1 genotype calling (performed using Axiom Power Tools).-Individual call rate: four individuals removed as they had an individual call rate < 95%.-Sex mismatches: two individuals removed as inferred genetic sex using PLINK 1.9 (www.coggenomics.org/plink/1.9/) 6was different to the recorded sex.-Duplicates and relatedness: relatedness was estimated using KING 7 on autosomal genetic variants with a genotype call rate > 95%, minor allele frequency (MAF) > 1%, in Hardy-Weinberg equilibrium (P > 10 -6 ) and not found to be in regions of high linkage disequilibrium.Duplicate/monozygotic twin pairs were defined as those with kingship coefficient > 0.3540, first-degree relatives were those with a kingship coefficient between 0.1770 and 0.3540 and second-degree relatives were those with a kingship coefficient between 0.0884 and 0.1770.Seven individuals identified as being duplicate/monozygotic twins were removed and 13 individuals were removed due to high relatedness (first-degree or second-degree relatedness).One individual was removed as they had already been genotyped in other IPF cohort studies (UK and UUS studies).-Ancestry: principal component analysis, using PLINK on the genetic data and HapMap samples, was used to infer ancestry.Autosomal genetic variants that were present in HapMap with a genotype call rate > 95%, MAF > 1%, in Hardy-Weinberg equilibrium (P > 10 -

6
) and not found to be in regions of high linkage disequilibrium were used in the principal component analysis.In total, 27 individuals were identified as being of non-European ancestry.
After subject quality control, 416 IPF cases and 385 controls were retained.Additional controls from UK Biobank were combined with the 385 controls to improve statistical power.The number was increased to 2,465 controls so that there were 5 controls to every case (additional 2,080 UK Biobank controls were selected).Imputation was performed using the TOPMed imputation reference panel.

Genome-wide SNP*sex interaction analysis
Genome-wide sex interaction analyses of IPF risk were performed separately in each study, using PLINK 1.9.
The following logistic regression model was applied: where, Phenotype is IPF status, G is the dosage for a given genotype (additive effect), Sex is binary coded, G*Sex is the interaction term and PC1 to PC10 are the first ten standardised genetic principal components for ancestry.
Bi-allelic autosomal variants that were well imputed (imputation R 2 ≥ 0.5) using the TOPMed imputation reference panel, had a MAF ≥ 0.01, did not depart from Hardy-Weinberg Equilibrium (P ≥ 1×10 -6 ) and were present in at least three of the six studies were retained in the analysis.The results from each of the six IPF cohorts were then combined using an inverse-variance weighted fixed effect meta-analysis, implemented using PLINK 1.9.Genomic control lambda was calculated for the metaanalysis.
We defined sentinel variants as those with the smallest p-value and P<1×10 −6 within a 1 Mb region.
Then GCTA-COJO 8 was used to identify additional independent signals meeting P<1×10 −6 within each 1 Mb window.
Region plots were produced using LocusZoom 9 .

Bioinformatic investigation of signals
Fine-mapping was used to produce a set of genetic variants that had a 95% probability of containing the causal variant for a given genetic signal (95% credible set).This was performed in R version 4.2.1.
The fine-mapping approach used (Wakefield approximate Bayes factor) assumed that there was one causal variant and that it had been measured.
We used GTEx Portal to check if the genetic variants in the 95% credible sets were associated with gene expression across 49 tissues (including lung and non-lung tissues).For those found to be associated with expression levels of a gene in either lung or cultured fibroblasts, colocalisation analyses were performed in lung and cultured fibroblasts tissue (GTEx Version 8) for the corresponding gene using the coloc package 10 in R version 4.2.1.Colocalisation analyses was used to identify whether the same casual variant was associated with both IPF susceptibility and gene expression in GTEx.The posterior probability of the following models was estimated using approximate Bayes factor: H0: neither sex-specific IPF susceptibility nor gene expression have a genetic association in that region H1: only sex-specific IPF susceptibility has a genetic association in that region H2: only gene expression has a genetic association in that region H3: both sex-specific IPF susceptibility and gene expression are associated, but with different causal variants H4: both sex-specific IPF susceptibility and gene expression are associated and share a single causal variant If the posterior probability supporting the alternative hypothesis that both sex-specific IPF susceptibility and gene expression share a single causal variant (H4) was greater than 80% then we concluded that the sex-specific IPF and gene expression signal colocalised.

Polygenic risk score analysis
Two main polygenic risk score analyses (PRS) were performed, 'standard PRS' and 'sex-specific PRS'.The 'base data' were derived from the meta-analyses of the US, Colorado, UK, UUS and Genentech datasets 2 (4,096 cases & 20,433 controls) with the association effect sizes from the previously published combined-sex IPF GWAS meta-analysis used for the 'standard PRS', and association effect sizes from new sex-specific IPF GWAS meta-analyses of the same five datasets used for the 'sexspecific PRS' (Figure 1).The 'target dataset' was the CleanUP-UCD study, which comprised 2,297 males (372 cases & 1,925 controls) and 623 females (93 cases & 530 controls).Bi-allelic autosomal variants that were well imputed (imputation R 2 ≥ 0.5), had a MAF ≥ 0.01 and did not depart from Hardy-Weinberg Equilibrium (P ≥ 1×10 -6 ) were retained in the base data.Ambiguous SNPs were excluded.Only variants available in both the base data and the target dataset were included in the analyses.
For the 'standard PRS' we first constructed the 19-variant PRS using the effect sizes from the base data 2 .We then tested the predictive accuracy of this PRS in males and females separately in the target data.Using the same base data, we then created multiple PRSs for a range of p-value thresholds (PT) using PRSice v2.3.5 11 and the PRS threshold with the most significant p-value association in the target data was selected as the best-performing PRS.
For the 'sex-specific PRS' we used sex-specific GWAS results derived from the base data and PRSice to create multiple PRSs in males and females separately.The PRS threshold with the most significant p-value association in the two target datasets separately (CleanUP-UCD male and CleanUP-UCD female) were selected as the best-performing PRS.
For the best performing PRS in both the 'standard PRS' and 'sex-specific PRS' analyses, we also estimated the AUC to examine its predictive accuracy.For the 'sex-specific' analysis we tested whether the predictive accuracy of a PRS constructed in males using male-specific effect sizes was statistically significantly different to a PRS constructed in females using female-specific effect sizes, using DeLong's test.
For all PRS analyses, linkage disequilibrium (LD) was accounted for using clumping (R 2 > 0.1 across 250Kb window).Note: The plots are zoomed in, so the the y-axis on the plots do not start at 0 in order to help with visualisation.

Figure S2 :Figure S3 :Figure S4 :Figure S5 :
Figure S2: Forest plot for of male only results by IPF study and meta-analysed male results for a) rs62040020, b) rs1756167317 and c) rs1663078846.OR = odds ratio and CI = confidence interval a)

Table S3 : Results of colocalisation analysis between rs62040020 and lung tissue and cultured fibroblasts using Coloc (female and male specific results
) [Excel spreadsheet]

Table S4 : SNP-by-sex interaction meta-analysis results for previously reported IPF susceptibility SNPs
[Excel spreadsheet]