Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks

View ORCID ProfileMaxwell Salvatore, View ORCID ProfileRitoban Kundu, View ORCID ProfileXu Shi, View ORCID ProfileChristopher R Friese, View ORCID ProfileSeunggeun Lee, View ORCID ProfileLars G Fritsche, View ORCID ProfileAlison M Mondul, David Hanauer, View ORCID ProfileCeleste Leigh Pearce, View ORCID ProfileBhramar Mukherjee
doi: https://doi.org/10.1101/2024.02.12.24302710
Maxwell Salvatore
1Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
2Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Maxwell Salvatore
Ritoban Kundu
2Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ritoban Kundu
Xu Shi
3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Xu Shi
Christopher R Friese
4Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
5Center for Improving Patient and Population Health, School of Nursing, University of Michigan, Ann Arbor, MI, USA
6Department of Health Management and Policy, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christopher R Friese
Seunggeun Lee
3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
7Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Seunggeun Lee
Lars G Fritsche
2Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
4Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lars G Fritsche
Alison M Mondul
1Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
4Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alison M Mondul
David Hanauer
8Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Celeste Leigh Pearce
1Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
4Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Celeste Leigh Pearce
Bhramar Mukherjee
1Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA
2Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bhramar Mukherjee
  • For correspondence: bhramar{at}umich.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Objective To explore the role of selection bias adjustment by weighting electronic health record (EHR)-linked biobank data for commonly performed analyses.

Materials and methods We mapped diagnosis (ICD code) data to standardized phecodes from three EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n=244,071), Michigan Genomics Initiative (MGI; n=81,243), and UK Biobank (UKB; n=401,167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to be more representative of the US adult population. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted four common descriptive and analytic tasks comparing unweighted and weighted results.

Results For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB’s estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted PheWAS for colorectal cancer, the strongest associations remained unaltered and there was large overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates.

Discussion Weighting had limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation more. Results from untargeted association analyses should be followed by weighted analysis when effect size estimation is of interest for specific signals.

Conclusion EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.

Competing Interest Statement

LGF is a Without Compensation (WOC) employee at the VA Ann Arbor, a United States government facility. All other authors declare that they have no competing financial or non-financial interests related to this research.

Funding Statement

This work was funded by National Cancer Institute grant P30CA046592 and the Training, Education, and Career Development Graduate Student Scholarship of the University of Michigan Rogel Cancer Center.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The work presented here was reviewed and approved by the University of Michigan Medical School Institutional Review Board (IRBMED) under application HUM00155849.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All of Us data are publicly available via researchallofus.org to those who fulfill their requirements. Michigan Genomics Initiative data is available to researchers who receive access approvals from a University of Michigan Institutional Review Board. UK Biobank data are available to researchers who receive approval (see ukbiobank.ac.uk). Code used in these analyzes are publicly available at: https://github.com/maxsal/biobank_selection_weights.

https://github.com/maxsal/biobank_selection_weights

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted February 13, 2024.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks
Maxwell Salvatore, Ritoban Kundu, Xu Shi, Christopher R Friese, Seunggeun Lee, Lars G Fritsche, Alison M Mondul, David Hanauer, Celeste Leigh Pearce, Bhramar Mukherjee
medRxiv 2024.02.12.24302710; doi: https://doi.org/10.1101/2024.02.12.24302710
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks
Maxwell Salvatore, Ritoban Kundu, Xu Shi, Christopher R Friese, Seunggeun Lee, Lars G Fritsche, Alison M Mondul, David Hanauer, Celeste Leigh Pearce, Bhramar Mukherjee
medRxiv 2024.02.12.24302710; doi: https://doi.org/10.1101/2024.02.12.24302710

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Epidemiology
Subject Areas
All Articles
  • Addiction Medicine (434)
  • Allergy and Immunology (758)
  • Anesthesia (222)
  • Cardiovascular Medicine (3313)
  • Dentistry and Oral Medicine (366)
  • Dermatology (282)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1175)
  • Epidemiology (13397)
  • Forensic Medicine (19)
  • Gastroenterology (900)
  • Genetic and Genomic Medicine (5175)
  • Geriatric Medicine (482)
  • Health Economics (785)
  • Health Informatics (3283)
  • Health Policy (1145)
  • Health Systems and Quality Improvement (1198)
  • Hematology (432)
  • HIV/AIDS (1023)
  • Infectious Diseases (except HIV/AIDS) (14651)
  • Intensive Care and Critical Care Medicine (915)
  • Medical Education (478)
  • Medical Ethics (128)
  • Nephrology (525)
  • Neurology (4950)
  • Nursing (262)
  • Nutrition (734)
  • Obstetrics and Gynecology (888)
  • Occupational and Environmental Health (797)
  • Oncology (2530)
  • Ophthalmology (730)
  • Orthopedics (284)
  • Otolaryngology (348)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (547)
  • Pediatrics (1305)
  • Pharmacology and Therapeutics (552)
  • Primary Care Research (558)
  • Psychiatry and Clinical Psychology (4223)
  • Public and Global Health (7525)
  • Radiology and Imaging (1713)
  • Rehabilitation Medicine and Physical Therapy (1018)
  • Respiratory Medicine (981)
  • Rheumatology (480)
  • Sexual and Reproductive Health (500)
  • Sports Medicine (425)
  • Surgery (551)
  • Toxicology (72)
  • Transplantation (237)
  • Urology (206)