Evaluating the comparability of osteoporosis treatments using propensity score and negative control outcome methods in UK and Denmark electronic health record databases

Evidence on the comparative effectiveness of osteoporosis treatments is heterogeneous. This may be attributed to different populations and clinical practice, but also to differing methodologies ensuring comparability of treatment groups before treatment effect estimation and the amount of residual confounding by indication. This study assessed the comparability of denosumab vs oral bisphosphonate (OBP) groups using propensity score (PS) methods and negative control outcome (NCO) analysis. A total of 280,288 women aged [≥]50 years initiating denosumab or OBP in 2011-2018 were included from the UK Clinical Practice Research Datalink (CPRD) and the Danish National Registries (DNR). Balance of observed covariates was assessed using absolute standardised mean difference (ASMD) before and after PS weighting, matching, and stratification, with ASMD >0.1 indicating imbalance. Residual confounding was assessed using NCOs with [≥]100 events. Hazard ratio (HR) and 95% confidence interval (CI) between treatment and NCO was estimated using Cox models. Presence of residual confounding was evaluated with two approaches: (1) >5% of NCOs with 95% CI excluding 1, (2) >5% of NCOs with an upper CI <0.75 or lower CI >1.3. The number of imbalanced covariates before adjustment (CPRD 22/87; DNR 18/83) decreased, with 2-11% imbalance remaining after weighting, matching or stratification. Using approach 1, residual confounding was present for all PS methods in both databases ([≥]8% of NCOs). Using approach 2, residual confounding was present in CPRD with PS matching (5.3%) and stratification (6.4%), but not with weighting (4.3%). Within DNR, no NCOs had HR estimates with upper or lower CI limits beyond the specified bounds indicating residual confounding for any PS method. Achievement of covariate balance and determination of residual bias were dependent upon several factors including the population under study, PS method, prevalence of NCO, and the threshold indicating residual confounding.


Introduction
Routinely collected data from clinical practice settings have been used to evaluate the realworld effectiveness of osteoporosis treatments in reducing risk of fracture amongst postmenopausal women.Choice of treatment is dependent on a range of factors including patient medical history such as previous fractures, falls and treatment, patient and clinician treatment preference, and effectiveness of treatment for specific fracture sites (1).According to the European guidance and clinical guidelines in the UK, first line treatment typically includes oral bisphosphonates (alendronate, risedronate, ibandronate) and second line includes denosumab if bisphosphonates are not suitable or tolerated (1)(2)(3)(4).
Real-world evidence on the effectiveness of denosumab vs. oral bisphosphonates on fracture risk is inconsistent (5).Studies have shown denosumab was equally effective as oral bisphosphonates in reducing the risk of non-vertebral (6) and hip (7) fractures using US claims and Danish registry data, respectively.However, an analysis of Spanish pharmacy data showed that denosumab reduced the risk of hip and any type of fracture more than oral bisphosphonates (8).Although these differing results may reflect genuine differences in effectiveness due to study populations, settings, treatment guidelines, and comparator groups, confounding by indication may also explain these results, due to the inherent differences between patients using first and second-line treatments.For instance, patient characteristics are likely to differ between treatment groups, which may affect fracture prognosis.
To account for measured confounding, propensity score (PS) methodology can be used to create balanced treatment groups with respect to measured covariates, such as age and history of fracture, by using the PS in matching, stratification, or inverse probability treatment .weighting (IPTW) (9,10).However, information on important confounders such as bone mineral density, may be missing in administrative databases, and thus cannot be adjusted for using PS methods, which can lead to unmeasured confounding.
Negative control outcomes (NCOs) can be used to minimize residual confounding by unmeasured covariates.NCOs are outcomes known not to be causally associated with treatment.NCO methods have been developed to detect the presence of unmeasured (residual) confounding, when an association is found between the treatment and NCO (11,12).However, there are currently no gold standards or guidelines on the threshold for comparability between treatment groups using NCO.Previous studies have defined bias as non-null effect of an NCO using risk difference and confidence interval (CI) (13,14), as well as more than 5% of a large set of NCOs having 95% CI excluding 1 (15).This study aims to determine whether cohorts who received denosumab or oral bisphosphonates were comparable using PS matching, stratification and IPTW, and by applying different rules for presence of residual confounding via NCO analysis, in two European databases.
The objectives were to: 1) Describe osteoporosis treatment groups, denosumab vs oral bisphosphonates, with respect to demographics, clinical history, and prior medication use 2) Assess whether treatment groups are comparable on measured covariates after PS matching, stratification and IPTW.
3) Detect the presence of residual confounding using different stringency rules via NCO analysis within each PS method.
. 4) Assess whether comparability is achieved in subgroups of older patients, post fracture patients, and patients with potentially three years of follow-up, plus a post-hoc analysis of second line users.

Study design and setting
A retrospective, new user and new switcher, active comparator cohort study was implemented (16).
We used data from two European countries, including the UK Clinical Practice Research Datalink (CPRD) GOLD (17) and AURUM (18), which are primary care databases of medical records from general practitioners.CPRD was linked to the following databases: Hospital Episode Statistics Admitted Patient Care, Office for National Statistics mortality data, and the Index of Multiple Deprivation (IMD).Practices that appeared in both the GOLD and AURUM databases were retained in AURUM and excluded from GOLD.
We also used data from the Danish National Registries (DNR), which contains linked data from the Danish Civil Registration System (19), the Danish National Patient Registry (20), and the Danish National Prescription Registry (21).DNR contains dispensations in outpatient pharmacies, inpatient and outpatient hospital clinics encounters, and complete follow-up until death or emigration, set within the universal Danish healthcare system (22).

Participants
We included women aged ≥50 years at the time of initiating denosumab or oral bisphosphonates between 01 January 2011 to 31 December 2018 for UK patients and 01 January 2011 to 31 December 2017 for Danish patients.For the new user cohort, the date of .initiating denosumab or bisphosphonates was defined as the index date (Figure 1).We included patients who had no history of these treatments in the year before the index date prior to treatment initiation and were registered with their practice for at least one year before index.
To reflect the treatment guidelines in place in the UK and Denmark, where oral BPs are recommended as first-line treatments, and denosumab is recommended as second-line treatment, we assembled a cohort of recent treatment switchers in post-hoc analyses.This approach allowed us to compare patients at a similar stage of treatment.The treatment groups were re-defined as follows, (1) denosumab users included patients who switched from bisphosphonates to denosumab with no previous use of denosumab in the one year prior, and (2) bisphosphonate users included patients who switched from one oral bisphosphonate to a new oral bisphosphonate and had no previous prescription for the new bisphosphonate in the one year prior (Figure 1).The index date was defined as the date of the initiation of the new treatment.
Patients were excluded if they had any of the following diagnoses in the five years prior to the index date, or treatments in the year before the index date: Paget's disease of bone, cancer (except for non-melanoma skin cancer) or its associated treatments (hormonal, endocrine, or radiation therapies), end stage renal disease, or prescription for both denosumab and oral bisphosphonates on the index date.
Patients were followed from the index date until the earliest censoring event occurrence: NCO (defined below), death, moved practice, end of study period (31 st December 2019 for CPRD and 31 st December 2018 for DNR), drug discontinuation or treatment group switch, diagnosis of cancer (except for non-melanoma skin cancer) or its associated treatments (hormonal, endocrine, or radiation therapies), end stage renal disease, or a maximum three-year follow-up. .

Treatment groups
The treatment groups were patients initiating (new user) or switching to (switcher) either (1) oral bisphosphonates (alendronate 10mg/day or 70mg/week, ibandronate 150mg/month, risedronate 5mg/day or 35mg/week) or (2) denosumab (60mg/6 months).We used an as-treated exposure definition, where patients were required to be on the initiated or new treatment throughout follow-up.Successive prescriptions were considered as a continuous treatment episode if there was no more than 90 days between prescriptions.Discontinuation was defined as a gap of more than 90 days between prescriptions; date of discontinuation was therefore defined as the end date of drug duration of the last prescription plus an additional 90 days (23).
For bisphosphonates, duration of prescription was calculated as the ratio of the number of tablets and daily dose; if duration was missing, the default of 30 days was used.For denosumab, duration was defined as a default of 180 days in line with the dosing interval indicated for osteoporosis.The end date of prescription was defined as the date of the prescription plus duration of prescription.
Propensity score analysis and covariates PS expressed the probability of being assigned denosumab rather than an oral bisphosphonate conditional on measured covariates.Logistic regression estimated the PS for each patient using over 80 covariates that are known to be associated with risk of fractures or falls (Table S1) (13).Covariates with more than 10 patients in each treatment group were included in the logistic regression model.The PS was then used in three ways to create comparable treatment groups.
In PS matching, each denosumab user was matched with up to five oral bisphosphonate users that had the closest PS within a calliper width of 0.2 of the pooled standard deviation of the .logit of the PS (24,25).In addition, exact matching was performed on year of index date as it was deemed to be an empiric confounder.
In PS stratification, all patients in the study population were divided into ten mutually exclusive strata based on the ranked PS distribution of the denosumab group.Within each stratum, the distribution of covariates between treatment groups should be similar.
In IPTW, each patient was assigned a weight equal to the inverse of their propensity score (1/PS), to create a pseudo-population with a balanced covariate distribution (26).Stabilised weights were calculated by multiplying the weights by the proportion of patients on each treatment to reduce large weights (27).To further minimise large weights, weights were then truncated at 99% percentile of distribution of the stabilised weights (26).Only the IPTW was used in the new switcher cohort.
Balance of each covariate between treatment groups was assessed using the absolute standardised mean difference (ASMD) before and after matching, stratification and IPTW.An ASMD less than 0.1 indicates the treatment groups were balanced for that covariate.

Negative control outcome analysis
NCOs are defined as outcomes not causally associated with the treatment of interest, except through shared confounders (11,12).If the estimated association between treatment and NCO is non-null, one can determine whether there is evidence of residual confounding suggesting treatment groups are not comparable with respect to unmeasured covariates.
A preliminary and non-exhaustive list of NCOs were identified, including fracture and nonfracture NCOs.Based on the FREEDOM trial (28), there was no effect of treatment on risk of fracture in the first three months; therefore, a fracture (hip, vertebral, radius, ulna, wrist, humerus, pelvis, or shoulder) occurring in the first three months of treatment was considered a NCO (29).In addition, NCOs were also sourced from previous studies (13,30,31), and from an automated method identifying potential NCOs (32).The maximum follow up for fracture NCOs was three months.The preliminary list of NCOs is listed in Table S1.
Analysis for a specific NCO was conducted if there were ≥100 events in total.The association between treatment and NCO was estimated using the Cox Proportional Hazards model with robust standard errors, adjusted for year of index and imbalanced covariates with ASMD ≥0.1 for each PS method (33).The estimated hazard ratio (HR) and 95% CI was used to assess for the presence of residual confounding based on two approaches: Approach 1: There were more than 5% of NCOs having a 95% CI excluding 1; the denominator was the number of NCOs with ≥100 events and the numerator was the number NCOs with a CI excluding the null value of one.
Approach 2: There are more than 5% of NCOs having an upper CI of HR <0.75 or lower CI >1.30; the denominator was the number of NCOs with ≥100 events and the numerator was the number of NCOs having CI above or below the specified cut-off values.
Treatment groups were deemed comparable if residual confounding was absent in both Approach 1 and Approach 2. .

Results
Analysis was performed separately for CPRD (combining the GOLD and AURUM databases) and DNR.

CPRD New user cohort
Of the 200,179 eligible patients, 6,528 were denosumab users and 194,191 were bisphosphonate users (Table 1).Most demographics, comorbidities, and prescriptions were balanced between denosumab and bisphosphonate users; however, there were 22 (25.3%)covariates whose distributions were imbalanced (Table 2).Denosumab users were observed to be older, had a higher number of GP visits, hospital admissions, fractures, longer duration of cumulative oral bisphosphate use, higher prevalence of calcium or vitamin D, proton pump inhibitor, lower prescriptions for non-steroidal anti-inflammatory drugs and corticosteroids than bisphosphonate users.
Eighty-seven covariates were used to estimate the PS.The median (interquartile range (IQR)) PS amongst denosumab and bisphosphonate users were 0.123 (0.048, 0.303) and 0.008 (0.004, 0.023), respectively.Figure S1 illustrates the PS distribution is right skewed in both treatment groups however, there were fewer bisphosphonate users with PS above 0.4.
After matching, the PS distribution was identical in the two treatment groups (Figure S1).Three (3.4%) covariates remained imbalanced: cumulative oral bisphosphonate use, vitamin D deficiency, and factor Xa inhibitor prescription (Figure S2).

PS stratification
The PS distribution amongst denosumab and bisphosphonate users were similar in the first seven strata.In the 8 th , 9 th and 10 th strata, the median and IQR was slightly larger for denosumab users than bisphosphonate users.Overall, within stratum there was good PS overlap between the treatment groups (Figure S4).On average, three covariates had ASMD >0.1: cumulative oral bisphosphonate use, recency of fracture, and strontium prescription (Figure S5).There were 47 NCOs with at least 100 events and the effect of treatment on NCO is shown in Figure S6.Approach 1 had eight (17.0%)NCOs (atelectasis, bowel incontinence, delirium, early fracture, ingrown toenail, ankle sprain, strabismus, total hip arthroplasty due to osteoarthritis) with 95% CI excluding one.Approach 2 had three (6.4%)NCOs with 95% CI upper bound <0.75 (ankle sprain) or lower bound >1.30 (ingrown toenail and strabismus).

New switcher cohort
In CPRD, the new switcher study design analysed a smaller number of patients, 2,792 new denosumab switchers, and 14,668 new bisphosphonate switchers.In the pseudo-population, a smaller number of imbalanced covariates (2.4%, 2/82), as compared to the new user population, were observed which included previous fracture and cumulative oral bisphosphonate use (Figure S13).Thirteen NCOs had at least 100 events, of which none met the criteria for residual confounding using Approach 2 (Figure S14).DNR New user cohort DNR was a smaller database containing 79,569 eligible patients of which there were 4,276 denosumab users and 75,293 bisphosphonate users (Table 1).Most covariates were balanced between the two treatment groups; however, 18 (21.7%)covariates were imbalanced (Table 2), a smaller percentage compared to CPRD.Imbalanced patient characteristics were similar to CPRD, except higher prescription of antiparathyroid in denosumab users in DNR.(Table 1).
Figure S7 shows the PS distribution within treatment groups were right skewed however the distributions overlap each other.There were fewer bisphosphonate users with PS greater than 0.6.
Twelve NCOs had at least 100 events with the effect of treatment on NCO shown in Figure S9.
In Approach 1, only one (8.3%)NCO early fracture had its 95% CI excluding one.No NCO had met the criteria of Approach 2.

PS stratification
The PS distribution was similar between treatment groups in the first seven stratum.In the 8 th , 9 th and 10 th strata, the median and IQR was slightly larger for denosumab users than bisphosphonate users.Overall, there is good overlap of PS distributions between the treatment groups (Figure S10).Nine covariates (10.8%) were imbalanced (Figure S11): age, renal disease, Charlson comorbidity score, cohabitation status, number of GP visits, cumulative bisphosphonate use, number of prescriptions, region, and proton pump inhibitor use.
Twenty-six NCOs had at least 100 events and the effect of treatment on NCO is shown in Figure S12.In Approach 1, one (3.8%)NCO early fracture had its 95% CI excluding one.In Approach 2, no NCO had met the criteria.

IPTW
The median (IQR) of weights was 0.68 (0.25, 1.23) and 0.98 (0.97, 0.99) in denosumab and bisphosphonate users respectively.In the pseudo-population, two (2.4%) covariates, age, and cumulative oral bisphosphonate use, were imbalanced (Figure 2b).Twenty-six NCOs had at least 100 events and the effect of treatment on NCO is shown in Figure 3b.In Approach 1, five (19.2%)NCOs had its 95% CIs excluding 1: anorectal disorder, eye injury, haematochezia, incomplete emptying of bladder, and total hip arthroplasty due to osteoarthritis.In Approach 2, no NCO had its 95% CI bound exceeding the indicated threshold of residual confounding.

New switcher cohort
In the new switcher cohorts, 3,726 new denosumab switchers and 2,525 new bisphosphonate switchers were eligible for the new switcher design.In the pseudo-population, a larger number of imbalanced covariates (5.1%, 4/78), as compared to the new user population, were observed which included previous fracture, number of fractures, number of fracture types, and recency of fracture (Figure S15).Comparability could not be assessed as there were no NCOs with at least 100 events. .

Comparability assessment
After adjustment using PS methods, comparability of treatment groups based on meeting criteria for both Approach 1 and 2 was not achieved, with the exception of PS stratification in DNR (Table 3).Approach 1 was found to be more stringent than Approach 2 in both CPRD and DNR databases.Based on Approach 2, comparability was achieved for IPTW in CPRD, and all three PS methods in DNR.IPTW was considered preferable as it passed the criteria set in Approach 2 for both databases.After IPTW, the treatment groups were comparable in the new switcher cohorts in CPRD.

Key results
This study had aimed to determine whether cohorts of patients selected for pharmacological treatment for osteoporosis with denosumab or oral bisphosphonates are comparable with respect to measured and unmeasured confounding.
In the UK and Denmark, denosumab is prescribed as second line treatment if an oral bisphosphonate is not suitable.As expected, baseline differences between the two treatment groups were observed with denosumab users being older, previously on preventative fracture treatment, and generally in worse health than bisphosphonate users.PS methods were used to ensure treatment groups were balanced for a wide range of covariates.Each PS method performed reasonably well in creating balanced groups, although the degree of performance varied by country.IPTW performed the least well in CPRD but performed the best in DNR, whereas stratification performed the best in CPRD but the worst in DNR; PS matching performed reasonably well in both databases, although it reduced sample size.
Evidence of residual confounding was based on satisfying both rules: more than 5% of NCOs had its 95% CI excluding the null value of one (Approach 1) or the CI was wholly outside the range of 0.75 to 1.30 (Approach 2).Based on both approaches, in CPRD, all PS methods showed evidence of residual confounding suggesting the new user treatment groups were not comparable.In DNR, PS stratification but not PS matching or IPTW showed treatment groups were comparable.Using Approach 1, large sample sizes may lead to narrow 95% CI and limit the possibility of crossing the null, even when effect sizes are clinically insignificant.
Therefore, evidence of residual confounding based on Approach 2 alone may be considered for future analyses of comparative effectiveness.
IPTW was chosen to proceed with further as this satisfied the threshold in Approach 2 for both data sets.The new switcher analysis in DNR was not possible due to a small number of patients; in CPRD, the treatment groups were shown to be comparable.

Comparison with previous studies
PS and NCO methods are popular approaches to account for confounding; however, there is no consensus on which threshold to use to determine whether treatment groups are comparable.
In our study, the degree to which PS methods ensured treatment groups were comparable varied by method and country, with other studies also observing similar findings.Choice of PS cannot be generalised to all databases as it is highly dependent on the research objective, achieving covariate balance, and the type and size of treatment effect estimate of interest (marginal, conditional, average treatment effect, or average treatment effect of the treated) (34)(35)(36).
Studies have approached the issue of residual confounding in different ways.McGrath (13) had assessed the comparability of newly initiating denosumab vs. oral bisphosphonates after using IPTW and 12 pre-specified NCOs using US claims data, with residual confounding being evident if a meaningful, non-null effect was observed.That study had found comparability was not achieved as associations were found for two NCOs thus comparative effectiveness estimation could not proceed.In another US study, Kim et al ( 29) evaluated three early fracture NCOs and four non-fracture NCOs; and assessed residual confounding using relative (risk ratio <0.85 or >1.15) and absolute (risk difference > 0.01) measures.In other populations, Levintow (14) had evaluated the comparability of lipid-lowering drugs using 10 pre-defined NCOs, assuming comparability was achieved if the risk difference was close to the null effect of zero (although a range was not specified) and the 95% CI contained zero for all NCOs.Other studies had instead focused on identifying a large set of NCOs in order to calibrate treatment effect estimates for residual confounding (15,30,37,38) with Hripcsak (15) specifying that 95% of NCOs were expected to have the null effect contained within its 95% CI.Most commonly, studies choose one (or few) NCO(s) that assume to have the same confounding structure between treatment and primary outcome, and if a non-null effect is observed then residual confounding is present (11).
In contrast, and a strength of our study, we had used a combination of methods.Firstly, early fracture was selected a priori as an NCO assuming it shared the same set of confounders as treatment and primary outcome of fragility fracture.Secondly, an automated method identified over 100 potential NCOs that were unlikely to be associated with treatment; however, the assumption of sharing the same set of confounders is unlikely to be met; use of a large set of NCOs with differing confounding structures may mitigate that assumption.Thirdly, the optimal number of NCOs may be dependent on whether one wants to simply detect the presence of residual confounding, or to use NCOs to calibrate treatment effect estimates for residual confounding which would require at least 30 NCOs (39).The use of a large number of NCOs in our analysis allows for this possibility. .

Strengths and limitations
Our study was performed in population-based databases in the UK and Denmark.There was some similarities and differences in the achievement of comparability between databases.
Although new-user analyses are generally preferred for comparative studies, the predominant use of denosumab as second-line therapy in both the UK and Denmark may have contributed to difficulties in achieving comparability and to inclusion of patients not fully representative of clinical practice.We therefore conducted a new-switcher analysis to account for prior use of oral bisphosphonates in the denosumab group.Correspondingly, comparability appeared easier to achieve in the CPRD new switcher analysis than the new-user analysis.
We assessed whether residual confounding was evident based on two approaches with differing levels of strictness.Choice of threshold carries risks for decision-making on when to proceed with comparative assessments: a threshold that is too stringent risks excluding the possibility of conducting analyses on sufficiently comparable cohorts; a threshold that is not stringent enough may lead to comparisons on non-comparable cohorts.
Although our study had identified over 100 potential NCOs, under half were used in analysis as some NCOs were rare (less than 100 events).The decision to exclude rare outcomes was justified as they were unlikely to have adequate power to detect a non-null effect contributing misleading evidence of no residual confounding.However, imposing a minimum number of events had led to some analyses with a small number of assessable NCOs, resulting in only one NCO comparison needing to show a significant difference to exceed the 5% threshold of residual confounding.
Decisions made using IPTW may have affected the way standard errors and in turn CIs were calculated.Firstly, robust standard errors were used to ensure NCO estimates were not biased due to either a potential misclassification of PS estimation or the Cox model (40); this approach . is known to over-estimate standard errors (41).Secondly, use of truncated weights would have led to more stable weights thus leading to smaller standard errors.Lastly, adjusting for imbalanced covariates may have led to larger standard errors (34).It is unclear what the overall impact was on standard errors, whether they were over-or under-estimated and the impact this would have had when using CIs to assess whether residual confounding Despite these limitations, a key strength is that all three PS methods were considered thus offering the flexibility that any one method could be used for further comparative effectiveness analysis.

Conclusions
Confounding by indication will always be observed in routinely collected medical record data and PS and NCO methods are important tools to account for such confounding.We have shown that assessment of comparability varies depending on the method of PS adjustment and definitions of residual bias.The extent to which residual confounding can be identified is unknown, and studies should consider more than one PS method to test robustness and identify the largest number of NCOs to give the greatest flexibility in detecting residual confounding.
Further research is required to determine the optimal threshold to identify residual confounding.

Ethical approval
Access to CPRD data was approved (Protocol #20_000206) according to CPRD's research data governance framework.

Figure 2b: Covariate balance with propensity score weighting in new users (DNR)
.

Figure 3a: Negative control outcome estimates after PS weighting in new users (CPRD) Figure 3b: Negative control outcome estimates after PS weighting in new users (DNR)
.

Figure 1 :
Figure 1: New user and new switcher study designs

Figure 2a :
Figure 2a: Covariate balance with propensity score weighting in new users (CPRD)

Table 1 : Baseline patient characteristics of new users of denosumab and oral bisphosphonates
..* Numbers masked when less than minimum cell count of 5. ^Socioeconomic status represented by the Index of Multiple Deprivation in the UK, and by family wealth, education level, cohabiting status, and marital status in Denmark

Table 2 : Number of imbalanced covariates in new user cohorts
Clinical Practice Research Datalink; DNR: Danish National Registries; IPTW: inverse probability of treatment weighting; PS: propensity score N refers to the number of covariates in the PS model; n refers to the number of imbalanced covariates after PS methods

Table 3 : Comparability threshold using negative control outcome estimates in new user cohorts
Clinical Practice Research Datalink; DNR: Danish National Registries N refers to number of negative control outcomes with at least 100 events during follow-up period.Results presented in number of negative control outcomes with residual bias (percentage) .