To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks

Objective: To explore the role of selection bias adjustment by weighting electronic health record (EHR)-linked biobank data for commonly performed analyses. Materials and methods: We mapped diagnosis (ICD code) data to standardized phecodes from three EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n=244,071), Michigan Genomics Initiative (MGI; n=81,243), and UK Biobank (UKB; n=401,167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to be more representative of the US adult population. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted four common descriptive and analytic tasks comparing unweighted and weighted results. Results: For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB’s estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted PheWAS for colorectal cancer, the strongest associations remained unaltered and there was large overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates. Discussion: Weighting had limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation more. Results from untargeted association analyses should be followed by weighted analysis when effect size estimation is of interest for specific signals. Conclusion: EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.


Inverse probability weighting
In MGI, we estimated the first term, ( !"#!$%&' = 1|), by fitting a simplex regression model for the known design probabilities using NHIS data.We estimated the numerator of the second term, ( = 1|,  &'' = 1), using a logistic regression model.We considered the set of selection factors, : age (≥ 50 indicator), female sex, BMI (categorical), non-Hispanic White race/ethnicity, and EHR-derived binary indicators for anxiety, depression, diabetes, cancer, and hypertension (variable definitions in Table S2).Cancer was not included directly in the estimation procedure above because the small prevalence of cancer in NHIS led to unstable model fitting. 1 Instead, a cancer factor,  (&%(!$ , defined as )(+,-./0|) , was estimated by fitting logistic regression models with the same .The probabilities, , were multiplied by this factor (i.e.,  (&%(!$ ).
In AOU, we flexibly selected  by splitting the data in half and fitting a lassopenalized logistic regression model on  and all possible pairwise interactions using the glmnet R package (version 4.1-8).We considered a set of selection factors, : age ((≥ 50 indicator), female sex, non-Hispanic White race/ethnicity, non-heterosexual orientation (yes/no), health insurance coverage status (yes/no), annual household/family income (≥ $75,000), educational attainment (at least high school graduate or equivalent), and region of residence (indicators for West, South, and Northeast) (variable definitions in Table S2).Using 10-fold cross-validation, we selected the largest  such that the error is within 1 standard error of the minimum to result in a parsimonious model.Of the 55 possible main effect and interaction terms, 39 were selected by this model (Table S3) and (along with the main effect for West region) were then used as the final set of  to estimate IP weights in the other half of the data as described for MGI above.The indicator variables for income, health insurance status, and non-Hispanic White race/ethnicity were the three most important variables (Figure S2).In both cohorts, the resulting probabilities were winsorized at the 2.5 th and 97.5 th percentiles.
We note that augmented inverse probability weighting (AIPW) is a doubly robust weighting method that may be of interest to the reader; see [2][3][4] .
We note that other there are other weighting methods relying only on summary statistics like calibration, raking, and pseudo likelihood that may be of interest to the reader; see [5][6][7] .

Correlations
We also explored the correlation structure of unweighted and weighted phenomes through partial correlations.Unweighted partial correlations were calculated between pairs of traits,  and , adjusted for age and sex, using the ppcor R package (version 1.1). 8Weighted partial correlations were approximated as the coefficient  9 from the weighted multiple linear regression model  =  : +  9  +   , where , , and  were mean standardized and  were age and female sex.For ,  pairs where one trait was sex-specific, the other trait was limited to individuals of that sex, and sex was not included as a covariate.Network graphs of correlations with absolute values greater than 0.3 were constructed to visually inspect the structure.All traits were treated as binary based on the presence of a single phecode in the EHR.(See Section S1 for results).

PheWAS
The data were prepared as described in Salvatore and colleagues 9 at the one-year prior to colorectal cancer diagnosis threshold.For sex-specific phecodes, those with discordant sex were treated as missing.(Of note, some ICD codes do not map to phecodes).Logistic regression models were fit as follows: where CA_101.41(the phecode for colorectal cancer) is an indicator for the outcome,  represents the exposure phecode  (indicator), and covariates are age at one-year prior to colorectal cancer diagnosis (continuous), female sex (indicator), and length of EHR follow-up (continuous).
Phenomewide significant hits were identified using a conservative multiple testing corrected threshold of 0.05 divided by the number of total tests.Weighted logistic regression models were fit using svyglm from the survey R package. 10In cases where a given exposure phecode did not have both (1)

Section S1. Unweighted and weighted partial correlations
Network diagrams depicting unweighted and weighted partial correlation coefficients with absolute values greater than 0.3 (an arbitrary threshold) in AOU is shown in Figure S5 (MGI and UKB shown in Figures S6 and S7).We can see clusters of correlated traits within endocrine/metabolic and musculoskeletal categories, as well as a cluster including both digestive and neurological traits.A small reduction in correlations with absolute values greater than 0.3 were observed after weighting (2,533 vs. 2,474).Interestingly, we see strong correlations with neoplasm traits in MGI (Figure S6), which largely disappear after weighting.There are distinct clusters within musculoskeletal traits and across circulatory system and endocrine/metabolic traits in UKB, which remain after weighting.The number of strong (absolute value > 0.3) correlations in UKB slightly increases after weighting (1,674 vs 1,757).Figures S8 and S9 depict the distribution of the unweighted and weighted partial correlation coefficients in each cohort, respectively.Generally, correlations tend to be highest in MGI followed by AOU and then UKB.Comparing the two US-based cohorts, AOU (Figure S5) and MGI (Figure S6), we see that, while the prevalences of traits involved in these networks are comparable, the network in MGI is denser compared to AOU.

Section S2. Comments on methodological considerations in EHR-based data analysis
Weighting-based analytic approaches present a relatively simple way for researchers to improve the generalizability of their results and help reduce (not remove) selection bias.IP weights are preferred to PS weights though they rely on the assumption that the weighting model is correctly specified.Regression-based weights can be made more flexible through the use of indicator variables (as in our AOU IPweights and in van Alten and colleagues 13 ), though non-parametric methods like random forest can be used.When individual-level data from the target population is not available, PS weights can be estimated using summary-level strata probabilities (provided these probabilities are conditionally independent).When selection weights are unavailable, methods like covariate or propensity score adjustment, which are simple to implement, can be considered to address in some situations where selection bias is a concern.
Beyond introductory papers, [14][15][16][17][18][19] substantial work has focused specifically on traditional methodological concerns including confounding, 20,21 misclassification, 7,22,23 missing data, [24][25][26][27][28][29][30] and selection bias and cohort representativeness 1,6,7,31-34 related to EHR-based cohorts.For example, traits defined using the phecode framework have demonstrated reduced misclassification compared to ICD codes. 35One method to further reduce the impact of misclassification, described by Hubbard and colleagues, relies on EHR-derived probabilistic phenotyping. 22Others have described methods using manual chart review on a subset of data to improve EHR-derived phenotypes. 23,36,37Beesley and Mukherjee developed three novel likelihood-based bias correction strategies to address outcome misclassification of EHR-derived disease status. 7Teixeira and colleagues explored incorporation of unstructured data like doctors notes, which improved the identification of hypertensive individuals compared to using ICD codes and blood pressure reading cutoffs alone. 38Missing data is another issue that has received attention to avoid loss of power and inducing selection bias (via complete case analyses) and aid in meeting assumptions necessary for multiple imputation. 25One avenue is using non-missing genotype data available in EHR-linked biobanks to inform imputation, which demonstrated improvements in imputation of cardiovascular related measurements. 39This idea could be extended using exposure polygenic risk scores 40 to inform imputation of missing exposure data.
One consideration broadly applicable in health research but is particularly acute in EHR-based analyses is target validity.Westreich and colleagues have defined this as a joint measure of internal and external validity of an effect estimate with respect to a specific target population. 41Historically, internal validity, the notion that an estimate reflects the true underlying parameter in the study population, has taken precedence over external validity, that the parameter in the study population is representative of the true parameter in the target population.However, because of observation mechanisms and recruitment strategies into EHR-linked biobanks, the target population is almost certainly never (1) exactly the study sample or (2) the population of which the study sample is a simple random sample. 41EHR researchers should think critically regarding who the results are intended for or representative of before beginning an analysis and make their target populations explicit in their work.We believe it is critical for researchers to consider weighted approaches that account for both the observation and recruitment mechanisms in each cohort (including potential subcohorts) and differences in the distribution of key characteristics between the analytic cohort and the target population.
We want to highlight some considerations that are hallmarks of EHR analysis.One such consideration is informed presence, defined by Goldstein and colleagues as "the notion that inclusion in an EHR is not random but rather indicates that the subject is ill, making people in EHRs systematically different from those not in EHRs." 42 This resulting discrepancy harms generalizability to general populations who tend to be healthier than those in the EHR data sample and results in bias.This concept extends to individuals within the EHR -those that are sicker tend to have more encounters and records than those who are healthier -and, in some cases, to records in the EHR (e.g., lab results).This phenomenon is illustrated by Agniel and colleagues, which shows that the presence and timing of laboratory results was more informative than the value of the laboratory results themselves. 435][46] Including EHR metadata, like length of follow-up, number of encounters, density of laboratory measurements, and visit type (e.g., outpatient vs inpatient vs emergency), and careful selection or matching of controls in analyses are recommended to improve exchangeability and attempt to make EHR observation mechanisms comparable.

Section S3. Investigation into infectious diseases peak in AOU PheWAS using phecode 1.2 mapping tables
An earlier version of the manuscript was performed using the phecode 1.2 mapping tables instead of phecode X.The Manhattan plot representing the colorectal cancer PheWAS in AOU in Figure S14A shosw a peak in the infectious disease category.The top hit is Human immunodeficiency virus [HIV] disease, or phecode 071 in the phecode 1.2 mapping tables.It is well established that there is no association between HIV status and colorectal cancer. 47,48We investigated the underlying ICD codes that are qualify as a colorectal cancer case.Our analyses in the manuscript use the phecode mapping table present in the PheWAS R package (version 1.2). 49,50We also show qualifying ICD codes for a different phecode mapping table (version X), 51,52 which defines over 3,600 traits.The results of the differences in qualifying ICD codes, number of individuals with the ICD code, and the number (and percent) overlap with individuals who have HIV according to their version 1.2 defined phecode are summarized in Table S5.We see that there is significant overlap between individuals with ICD codes for anal Pap smears, inconclusive results and carcinoma in situ and HIV status.These codes are present in the version 1.2 mapping table, but not in the version X mapping table .Codes present in the version 1.2 definition also include malignant neoplasms of the anus, but not in the version X definition.And there is evidence that people living with HIV experience higher incidence of anal cancer. 53Because version X has more traits, there is greater separation between colorectal cancer and anal cancer.We also present the 2x2 tables and crude odds ratios between these two different definitions of colorectal cancer and HIV status in Table S6.The crude odds ratio in the time-restricted phenome (t = 1) is 10.29 using version 1.2 colorectal cancer mapping and 0.32 using version X mapping.

Figure S1 .Figure S2 .
Figure S1.Flowchart depicting several common data tasks.This flowchart is subjective and not exhaustive.

Figure S5 .Figure S8 .
Figure S5.Unweighted (panel A) and inverse probability-weighted (panel B) network plots of the partial correlation structure of medical phenomes in All of Us.Correlation coefficients are adjusted for age and sex.Only correlations with an absolute value greater than or equal to 0.3 are shown.The size of the nodes corresponds to the prevalence of the trait in its cohort and the color corresponds to the phecode category.Corresponding figures for MGI and UKB are in Figures S6 and S7, respectively.

Figure S10 .
Figure S10.Venn diagrams comparing the overlap in phenome-wide significant hits from unweighted and weighted colorectal cancer PheWAS in AOU, MGI, and UKB.

Figure S12 . 4 . 5 10-
Figure S12.Manhattan plots summarizing poststratification-weight (panels A-D) phenomewide association studies for colorectal cancer in All of Us and the Michigan Genomics Initiative and the inverse probability weighted UK Biobank using 1:2 case:non-case matched data restricted to one year prior to initial diagnosis along with the corresponding meta-analsysis.The dashed red line represents the Bonferroni-corrected p-value threshold (-log10(0.05/number of traits)).The five traits with the smallest pvalues are labeled.The upward (downward) orientation of the triangle indicates a positive (negative) association.Plots corresponding to unweighted and IP-weighted PheWAS are presented in Figure 4.

Figure S14 .
Figure S14.Manhattan plots summarizing unweighted (panels A-C) phenomewide association studies for colorectal cancer in All of Us, the Michigan Genomics Initiative, and UK Biobank using 1:2 case:non-case matched data restricted to one year prior to initial diagnosis.Panel D shows the unweighted meta-analysis PheWAS, respectively.The dashed red line represents the Bonferroni-corrected p-value threshold (-log10(0.05/number of traits)).The five traits with the smallest p-values are labeled.The upward (downward) orientation of the triangle indicates a positive (negative) association.

Table S1 .
Phenotypes defined in paper and their qualifying phecode definitions

Table S2 .
Definition of variables by cohort used throughout paper * visit https://www.cdc.gov/nchs/nhis/2019nhis.htm for more information Distribution of weighted partial correlations across medical phenomes.Partial correlations were adjusted for age and, if both codes in the pair applied to both sexes, sex.IP-based weights were used for AOU and MGI and IP-based weighted developed by van Alten and colleagues 11 were used for UKB.
Venn diagrams comparing the overlap in phenome-wide significant hits from meta-analysis PheWAS.Meta-analysis PheWAS of Colorectal cancer [CA_101.41] at t = 1

Table S5 .
Comparison between ICD codes by colorectal cancer phecode mapping table, count with ICD code, and overlap with individuals who have HIV phecode (sorted by proportion of overlap).

Table S6 .
Colorectal cancer and HIV contingency table by phecode mapping version and crude odds ratioOther hits in the infectious disease category like HIV infection, symptomatic (phecode 071.1) and viral warts & HPV (phecode 078) share many of the same underlying ICD codes as with phecode 071, which implies similar overlap with colorectal cancer as defined by phecode mapping table version 1.2.