RT Journal Article SR Electronic T1 To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2024.02.12.24302710 DO 10.1101/2024.02.12.24302710 A1 Maxwell Salvatore A1 Ritoban Kundu A1 Xu Shi A1 Christopher R Friese A1 Seunggeun Lee A1 Lars G Fritsche A1 Alison M Mondul A1 David Hanauer A1 Celeste Leigh Pearce A1 Bhramar Mukherjee YR 2024 UL http://medrxiv.org/content/early/2024/02/13/2024.02.12.24302710.abstract AB Objective To explore the role of selection bias adjustment by weighting electronic health record (EHR)-linked biobank data for commonly performed analyses.Materials and methods We mapped diagnosis (ICD code) data to standardized phecodes from three EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n=244,071), Michigan Genomics Initiative (MGI; n=81,243), and UK Biobank (UKB; n=401,167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to be more representative of the US adult population. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted four common descriptive and analytic tasks comparing unweighted and weighted results.Results For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB’s estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted PheWAS for colorectal cancer, the strongest associations remained unaltered and there was large overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates.Discussion Weighting had limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation more. Results from untargeted association analyses should be followed by weighted analysis when effect size estimation is of interest for specific signals.Conclusion EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.Competing Interest StatementLGF is a Without Compensation (WOC) employee at the VA Ann Arbor, a United States government facility. All other authors declare that they have no competing financial or non-financial interests related to this research.Funding StatementThis work was funded by National Cancer Institute grant P30CA046592 and the Training, Education, and Career Development Graduate Student Scholarship of the University of Michigan Rogel Cancer Center.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The work presented here was reviewed and approved by the University of Michigan Medical School Institutional Review Board (IRBMED) under application HUM00155849.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll of Us data are publicly available via researchallofus.org to those who fulfill their requirements. Michigan Genomics Initiative data is available to researchers who receive access approvals from a University of Michigan Institutional Review Board. UK Biobank data are available to researchers who receive approval (see ukbiobank.ac.uk). Code used in these analyzes are publicly available at: https://github.com/maxsal/biobank_selection_weights. https://github.com/maxsal/biobank_selection_weights