Leveraging large-scale biobank EHRs to enhance pharmacogenetics of cardiometabolic disease medications

Electronic health records (EHRs) coupled with large-scale biobanks offer great promises to unravel the genetic underpinnings of treatment efficacy. However, medication-induced biomarker trajectories stemming from such records remain poorly studied. Here, we extract clinical and medication prescription data from EHRs and conduct GWAS and rare variant burden tests in the UK Biobank (discovery) and the All of Us program (replication) on ten cardiometabolic drug response outcomes including lipid response to statins, HbA1c response to metformin and blood pressure response to antihypertensives (N = 740–26,669). Our findings at genome-wide significance level recover previously reported pharmacogenetic signals and also include novel associations for lipid response to statins (N = 26,669) near LDLR and ZNF800. Importantly, these associations are treatment-specific and not associated with biomarker progression in medication-naive individuals. Furthermore, we demonstrate that individuals with higher genetically determined low-density and total cholesterol baseline levels experience increased absolute, albeit lower relative biomarker reduction following statin treatment. In summary, we systematically investigated the common and rare pharmacogenetic contribution to cardiometabolic drug response phenotypes in over 50,000 UK Biobank and All of Us participants with EHR and identified clinically relevant genetic predictors for improved personalized treatment strategies.

According to Figure 3a, biomarker levels Y at time t can be modelled as follows: where β 0 is the baseline genetic effect, G the genetics, β E the environmental effect, E the environment, γ E the gene-environment interaction effect, D the indicator of drug use, β D the drug effect and γ D the pharmacogenetic effect.
The drug response phenotype which is the difference between post-treatment Y t1 and baseline Y t0 biomarker levels can thus be modelled as follows: where baseline genetics of Y t1 and Y t0 cancel each other out, D t1 and D t0 correspond by definition to 1 and 0, respectively, and (β E + γ E • G E )(E t1 − E t0 ) + ∆ϵ 01 are regrouped under δ 01 assuming that interactions between genetics and changing environments are negligible (see control GWAS on longitudinal change in Figures SS9-SS10).The same derivation applies to the logarithmic difference log(Y t1 ) − log(Y t0 ) = log(Y t1 /Y t0 ) which can be interpreted as a relative change in biomarker levels.
Adjusting drug response phenotypes or change scores for baseline biomarker levels Y t0 wrongly introduces baseline genetic effects into expression 3 which results in the estimation of β 0 in addition to γ D which we elaborate in the following.
To simplify the calculations, let us assume that all these variables (Y , G, E) are scaled to have zero mean and unit variance.When we regress Y t1 onto Y t0 the regression estimate α will be Thus the residual from such a regression will be Note that α only changes by a constant 1 if we regress the biomarker difference Y t1 −Y t0 onto Y t0 instead of Y t1 onto Y t0 , making these two approaches equivalent.This can be shown as follows: Rearranging Equation 5results in: When running a GWAS on this residual response phenotype its correlation with G 0 will be This means that when regressing the post-treatment effect on the pre-treatment effect and running a GWAS on the residuals, we expect to see a strong (spurious) genetic correlation with the genetic basis of the (time-invariant) baseline effect.Note that this correlation between the residual and the baseline genetics is identical whether we used drug-naive or pre-vs post-treated individuals.This observation further confirms that genetic "discoveries" based on residual response definitions are likely to be nonspecific to the treatment.
If we examine the correlation between these residuals and the underlying drug response genetics we have Thus, this correlation in drug-naive samples (where D t0 = D t1 = 0) is zero, but in post-vs pre-treated samples (where It is clear that if we simply use the post-treatment vs baseline difference, i.e.Y t1 − Y t0 , its correlation with G 0 is zero and its correlation with G D is γ D .Therefore, it is strongly recommended to use the simple post-treatment -baseline biomarker difference to elucidate the pure treatment-specific genetic effects.

QC 1: Prior EHR record
EHR record (other than investigated conditions) at least two years before medication start.
QC 2: Baseline measure 100/365 days before and 7 days after medication start

QC 5: Prior related medication
Removal of individuals having taken medication from the same broad medication class (lipid-lowering, antidiabetic, antihypertensive) within the year preceding the primary medication start.Primary medication can also act as add-on therapy in certain cases.This was the case for sulfonylureas in conjunction with metformin, antilipemic agents other than statins (e.g.fenofibrates) in conjunction with statins, and beta blockers and loop diuretics in conjunction with antihypertensives*. .

QC 6: Prescription after post-measure
Removal of individuals with no prescription from the same broad medication class after post-measure.

QC 7: Treatment change
Removal of individuals for which there is an additional drugs from the same broad medication class prescribed between medication start and post-measure (either medication switch or add-on).

QC 8: Dose change
Removal of individuals with dose change between medication start and post-measure.The average dose is taken when multiple doses are present.

QC 9: Regular prescriptions
Removal of individuals with no regular prescriptions between medication start and post-measure.Regular prescriptions are defined as completenss above 60%/30% where a completeness of 100% means a prescription at least every two months for the duration.

QC 3: Minimum baseline level
Removal individuals with a baseline level below a required minimum.For antihypertensives, primary medication (ACEi, CCB and thiazide diuretics) could also act as add-on therapy to beta blockers and loop diuretics.However, medication prescribed before the primary medication start was also required to be prescribed afterwards (at least until the post-treatment measure).If the start of a beta blocker or loop diuretics medication was after the prescription start of the primary medication, this would count as "treatment change" and the individual would be removed.The height of the bar represents the number of individuals having at least one prescription of the investigated drug.The bottom grey bar represents the number of individuals after QC steps.Note that some filtering reasons are not mutually exclusive.For instance, baseline-medication time filtering was done after checking for prior related medication.Therefore, for metformin-HbA1c in the lenient scenario, it seems that more individuals were filtered out because of baseline-medication time than in the stringent scenario.However, given that individuals with previous sulfonylureas use were excluded in the stringent, but included in the lenient filtering setting, there is a larger pool of individuals in the lenient scenario for whom baseline measures are potentially missing.The same reasoning holds for antihypertensives where individuals with prior beta blocker and loop diuretics prescriptions were included in the lenient filtering scenario (see Figure S2).S13.

Figure S1 :Figure S2 :
FigureS1: Flow diagram of quality control steps.After selecting individuals taking the primary medication of interest, individuals with missing biomarker measures, medication therapy changes before post-treatment measures, irregular prescriptions, or not enrolled in the health care system before the medication start were removed.Stringent filtering criteria are written in red and lenient ones in blue.Medication/phenotype-specific criteria are written in brown.

Figure S3 :
FigureS3: Number of individuals in each UK Biobank drug response cohort and reasons for removal (stacked barplot).The height of the bar represents the number of individuals having at least one prescription of the investigated drug.The bottom grey bar represents the number of individuals after QC steps.Note that some filtering reasons are not mutually exclusive.For instance, baseline-medication time filtering was done after checking for prior related medication.Therefore, for metformin-HbA1c in the lenient scenario, it seems that more individuals were filtered out because of baseline-medication time than in the stringent scenario.However, given that individuals with previous sulfonylureas use were excluded in the stringent, but included in the lenient filtering setting, there is a larger pool of individuals in the lenient scenario for whom baseline measures are potentially missing.The same reasoning holds for antihypertensives where individuals with prior beta blocker and loop diuretics prescriptions were included in the lenient filtering scenario (see FigureS2).

Figure S4 :
Figure S4: HbA1c response to metformin, SBP response to beta blocker and HR response to beta blocker GWAS results.Plots on the left show GWAS results for the absolute biomarker (post-baseline level) and plots on the right the results for the logarithmic relative (log(post)-log(base)) difference.GWAS results correspond to the lenient filtering scenarios with average baseline and post-treatment values over multiple measures if available.Genome-wide significant loci are annotated with the closest gene.The horizontal line denotes genome-wide significance (p-value < 5e-8).

Figure S5 :
Figure S5: SBP response to first-line antihypertensives GWAS results.Plots on the left show GWAS results for the absolute biomarker (post-baseline level) and plots on the right the results for the logarithmic relative (log(post)-log(base)) difference.GWAS results correspond to the lenient filtering scenarios with average baseline and post-treatment values over multiple measures if available.Genome-wide significant loci are annotated with the closest gene.The horizontal line denotes genome-wide significance (p-value < 5e-8).

Figure S6 :Figure S7 :Figure S8 :
Figure S6: LDL-C response to statins GWAS results in the different filtering scenarios.Plots on the left show GWAS results for the absolute biomarker (post-baseline level) and plots on the right the results for the logarithmic relative (log(post)-log(base)) difference.For stringent and lenient filtering scenarios, single baseline and post-treatment measures and average values over multiple measures, if available, were tested.Results for lenient filtering and multiple measures are shown in Figure 3. Genome-wide significant loci are annotated with the closest gene.The horizontal line denotes genomewide significance (p-value < 5e-8).

Figure S9 :
Figure S9: Longitudinal biomarker change GWAS in medication-naive individuals for LDL, TC and HDL.Plots on the left show GWAS results for the absolute biomarker (post-baseline level) and plots on the right the results for the logarithmic relative (log(post)-log(base)) difference.Genome-wide significant loci are annotated with the closest gene.The horizontal line denotes genome-wide significance (p-value < 5e-8).

Figure S10 :Figure S11 :
Figure S10: Longitudinal biomarker change GWAS in medication-naive individuals for HbA1c, HR and SBP.Plots on the left show GWAS results for the absolute biomarker (post-baseline level) and plots on the right the results for the logarithmic relative (log(post)-log(base)) difference.Genomewide significant loci are annotated with the closest gene.The horizontal line denotes genome-wide significance (p-value < 5e-8).

Figure S12 :Figure S13 :Figure S14 :
Figure S12: GWAS results for baseline adjusted drug response phenotypes (SBP response to first-line antihypertensives).GWAS results correspond to the lenient filtering scenarios with average baseline and post-treatment values over multiple measures if available.Genome-wide significant loci are annotated with the closest gene.The horizontal line denotes genome-wide significance (p-value < 5e-8).