Abstract
Participant overlap has been thought to induce overfitting bias into Mendelian randomization (MR) and polygenic risk score (PRS) studies. This hinders the potential research into many unique traits and disease outcomes from large-scale biobanks. Here, we evaluated a block jackknife resampling framework for genome-wide association studies (GWAS) and PRS construction to mitigate the influence of overfitting bias on MR analyses compared to alternative approaches and implemented this study design in causal inference setting using data from the UK Biobank.
We simulated PRS and MR under three scenarios: (1) using weighted SNP estimates from an external GWAS, (2) using weighted SNP estimates from an overlapping GWAS sample and (3) using a block jackknife resampling framework. Based on a conventional P-value threshold to derive genetic instruments for MR studies (P<5×10−8), our block-jackknifing PRS did not suffer from overfitting bias (mean R2=0.034) compared to the externally weighted PRS (mean R2=0.040). In contrast, genetic instruments derived from overlapping samples explained a higher proportion of variance (mean R2=0.048) compared to the externally derived score. The detrimental impact of overfitting bias became considerably larger when using a more liberal P-value threshold to construct PRS (e.g., P<0.05, mean R2=0.103), whereas estimates using jackknife score remained robust to overfitting (mean R2=0.084).
In an applied setting, we examined (A) the effects of body mass index on circulating biomarkers and (B) the effect of childhood body size on levels of testosterone in adulthood using methods described above. In the first applied analysis, overlapping sample PRS and block jackknife resampled PRS led to comparable effect sizes, whereas narrower confidence intervals were identified when using the overlapping sample instrument. In the second example, through sex-stratified multivariable and bi-directional MR, we demonstrate that childhood body size indirectly leads to lower testosterone levels in adulthood in males, an effect mediated through adult body size.
Author summary Using genetic variants as instrumental variables for risk factors, Mendelian randomization (MR) provides an approach to explore the genetically predicted effects of modifiable risk factors on disease which is robust to confounding and reverse causation. Genetic instrumental variables are conventionally selected from results of genome-wide association studies on an independent dataset whose sample does not overlap with the dataset being analysed using MR analysis, as this can lead to overfitting bias. This can often be challenging to entirely avoid however, as such association studies are increasingly being performed by meta-analysing several biobanks to achieve the maximum power to detect variants with smaller effect sizes. Moreover, when investigating exposures and outcomes which only a single biobank has measured in sufficiently large samples, avoiding participant overlap requires splitting the study population into subgroups which can limit statistical power. Block jackknife resampling MR provides a solution to conduct causal inference under these circumstances with the maximum statistical power while avoiding bias due to overlapping participants. In this study, we evaluated this study design with simulated dataset in comparison to MR using genetic variants discovered from an external dataset or one with overlapping samples. We applied this approach using UK Biobank to investigate the role of body mass index on circulating biomarkers, as well as the causal relationship between childhood adiposity and testosterone levels in adulthood.
Competing Interest Statement
TGR is employed part-time by Novo Nordisk outside of this work. TRG receives funding from Biogen for unrelated research. All other co-authors declare no conflict of interest.
Funding Statement
All authors work at the MRC Integrative Epidemiology Unit at the University of Bristol (MC_UU_00011/1, MC_UU_00011/4). TRG and GDS conduct research at the NIHR Biomedical Research Centre at the University Hospitals Bristol NHS Foundation Trust and the University of Bristol. SF is supported by a Wellcome Trust PhD studentship in Molecular, Genetic and Lifecourse Epidemiology [108902/Z/15/Z]. GH is funded by the Wellcome Trust [208806/Z/17/Z]. This work was supported by the British Heart Foundation (AA/18/7/34219). The views expressed in this publication are those of the author(s) and not necessarily those of the NHS, the National Institute for Health Research or the Department of Health.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Our work involves the previously collected genetic sequencing and phenotype data of human participants in the UK Biobank cohort study. The North West Multi-centre Research Ethics Committee (MREC) gave ethical approval for the UK Biobank.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
All data produced in the present work are contained in the manuscript.