FormalPara Key Points

The efficacy of treatments can be assessed in randomised, double-blind, clinical trials to minimise bias.

Most of these clinical trials do not include individuals from the population to whom efficacy findings will be applied in the real world, so the effectiveness of treatments must be ‘translated’ to these populations.

We show that current translation methods do not provide accurate predictions for effectiveness, highlighting the need to develop and validate functional and relevant translation approaches for the translation of clinical trial efficacy to the real-world setting.

1 Introduction

Randomised, double-blind clinical trial methodology, if well-implemented, minimises bias in the measurement of treatment efficacy and allows any difference in outcomes to be attributed to the treatment effect. This provides an unbiased estimate of the size of the difference in outcome rates between patients in the treated and control groups. Clinical trials are designed to answer the question, ‘is the tested treatment better than the control?’ and to establish a causal link between receiving the tested treatment and the difference in outcome rates. The estimate of the size of this difference provides a quantitative estimate of how much better the treatment is for a group of patients or, even, a given patient [1, 2].

The patients in phase III clinical trials are included from the treatment target population using eligibility criteria, which often lead to a ‘selected’ trial population that is not always representative of the target population [3,4,5,6,7,8,9,10]. For example, various racial/ethnic populations, elderly people and women were shown to be underrepresented in 59 trials in heart failure [11]. In breast, colorectal, lung and prostate cancer clinical trials sponsored by the National Cancer Institute, participation varied significantly across racial/ethnic and age groups, and in cardiovascular clinical trials funded by the National Heart, Lung, and Blood Institute, women were reported to be underrepresented [12, 13].

Attempts have been made to correct for this via specific trial designs, appropriate data analysis tools or using a pragmatic trial approach with more permissive eligibility criteria, but success has been limited [8, 14, 15]. There is heterogeneity between results from large, multicentre international trials assessing the same treatment, suggesting that the trial populations differ. For example, in a systematic review of remifentanil compared with short-acting opioids for general anaesthesia, the observed overall frequency of postoperative nausea in 11 fentanyl control groups (N = 3048) ranged from 14 to 81% [16]. In the two largest trials (N = 2437; N = 4787), the frequencies were statistically significantly different (25% and 32%, p = 0.0002), implying heterogeneity in the patients’ characteristics or the practice of care. Another example is the reported heterogeneity of the absolute benefit (AB) estimates from clinical trials assessing the same drug class [5, 17]. The results from 12 trials assessing the efficacy of β-blockers versus placebo or no β-blocker in reducing 1-year mortality rate in post-myocardial infarction patients were published between 1975 and 1990 (Table 1) [18,19,20,21,22,23,24,25,26,27,28,29,30,31]. The AB ranged from 0.0155 (an increase in mortality) to − 0.0530 (a reduction), and the corresponding number needed to treat (NNT) ranged from − 421 to 60. If we consider only the three trials with a p value < 0.05, the range for the AB is − 0.0167 to − 0.0530 and 19–60 for the NNT. A meta-analysis of these trials showed heterogeneity for AB but not for relative risk (RR) [17].

Table 1 Results from trials assessing β-blockers for the prevention of death (1-year mortality) in post-myocardial infarction patients (data standardised at 1 year of follow-up)

This heterogeneity makes it difficult to generalise these trial results to the whole population. Thus, a translation process must be used to extrapolate the efficacy for these populations. The goal of the translation process, which is sometimes termed the ‘transportability’ process, is to predict the impact of the tested treatment on the population of interest in a real-world setting, using the clinical trial results [32]. This translation process is integrated in a broader framework known as health technology assessment, which assesses the impact, safety and cost of a treatment on the health status of the target population.

Generally, the endpoints in phase III clinical trials reflect clinical outcomes that are binary variables, such as death or occurrence of a cancer relapse. Thus, the efficacy estimate is calculated using the rate of outcomes observed in the control group (Rc) and in the experimental (treated) group (Rt). These are analysed using summary metrics (or statistics) of treatment efficacy, such as the odds ratio (OR), RR, relative benefit, AB and NNT. See the Electronic Supplementary Material (ESM) for more information.

The purpose of this article was to compare estimated population-level benefit, based on summary clinical trial data, as is usually done, with that based on the true efficacy in the population of interest, translated from the efficacy observed in clinical trials.

The process of translating clinical trial findings to a given population involves using the trial efficacy metrics to compute population benefit metrics. In this article, we have limited our assessment to NPEpop and NNTpop, which we think are the most relevant population benefit metrics, as shown in the ESM. We assessed whether these population metrics derived from the clinical trial efficacy metrics could accurately predict real-world effectiveness over a given time.

2 Materials and Methods

We used fictive individual patient data to simulate the translation process for clinical trial efficacy to real-life effectiveness because generally only aggregated data are publically available. Aggregated data provide estimates for the ‘average’ patient enrolled in the clinical trial, who is probably not representative of patients in the real world since higher-risk patients are rarely enrolled in phase III clinical trials. Hence, the results from these analyses should be assessed qualitatively and not quantitatively.

2.1 Simulation Framework

We simulated three populations, two drugs with different efficacies, and two trials with different sampling protocols.

2.1.1 Populations

Each population, A, B and C, comprised 100,000 individuals who were all assumed to have the same disease, thus they were all at risk of the same clinical event, but the event rates in untreated individuals (Rc) differed in each population (Fig. 1). The distributions of Rc differed, but the average Rc was the same for populations A and B (0.35) and was lower for population C (0.22). The effect of the two drugs on the clinical outcome was modelled with the Wang model (see Sect. 2.2). The population metrics for the beneficial effects of drugs 1 and 2 were then computed for the three populations.

Fig. 1
figure 1

Distribution of risk without treatment (Rc) in three simulated populations, A, B and C, each comprising 100,000 individuals who were all assumed to have the same disease and, therefore, were all at risk of a clinical event but the event rates in the untreated individuals (Rc) differed in each population

2.1.2 Drugs

In the simulation, both drugs 1 and 2 had the same mode of action but drug 1 was more potent than drug 2 (Table 2) [33].

Table 2 Summary of characteristics of drugs 1 and 2 [33]

2.1.3 Clinical Trials

Two clinical trials were simulated in population A, one for each drug, to obtain two sets of summary trial metrics using different sampling processes. Trial 1 should have been run on a random sample of population A; however, since random variations and confidence intervals were not taken into consideration in our approach, the whole population A was used in trial 1, not a random sample. Hence, this can be considered as a random sample with the same average Rc as the overall population, without the random variations. A non-random sample from population A with an average Rc that was lower than that for the overall population was used in trial 2.

2.2 The Wang Model

The Wang model is the simplest model of drug action on a clinical outcome that takes into consideration the main features of both the drug’s pharmacological action on its biological target and the consequences on the course of a disease [34]. It assumes that the probability of the outcome under treatment (or the event rate, Rt) follows a logistic function of the drug’s pharmacodynamic effect with two parameters (β0, the intercept, and S, the coefficient of E), which can be interpreted as the scale of the drug effect size [35]. See the ESM for more details.

2.3 Calculations

To assess any translation biases arising from the source of data used for the efficacy metrics calculation, we compared the efficacy metrics of each of the two drugs (1) computed on the trial summary data (for the two trials with each of the two drugs), (2) computed on the three populations (for each of the two drugs) and (3) translated for the three populations from the trial summary data (for the two trials and the two drugs). More details about this process are provided in the ESM.

3 Results

3.1 Simulated Clinical Trials

The results from the four simulated clinical trials, which were assumed to be statistically significant, showed that the efficacy metrics differed in the two populations for the same drug (Table 3). The efficacy metrics for the least potent drug, i.e. drug 2, were less favourable in both trials.

Table 3 Summary of efficacy metrics from two clinical trials for drugs 1 and 2

3.2 Results from Simulated Translation

The results from the simulated translation of results from clinical trials with drugs 1 and 2 are summarised in Table 3. Although the NPEpop should be constant for a given drug in a given population when translated using trial summary data, its value varied depending on the metric used to calculate it. For example, the NPEpop for drug 2 in population C, calculated from trial summary data, was 125% of the true value when calculated with AB from trial 1 and 356% and 535% of the true value when calculated with the RR and OR, respectively, from the same trial. For NNTpop, the ratio varied from one population to another within the single trial, which could have been anticipated. The estimates of real-world effectiveness metrics using the clinical trial efficacy metrics differed from the values calculated for the trial populations, with the exception of population A in trial 1 with both drugs, since the whole population was included in this trial. The number of prevented events (NPEs) and NNTs were under-predicted for drug 1 and over-predicted for drug 2 when the RRs and ORs from the clinical trials were used for the translation (Table 4). RR varies with Rc, and the RRs varied between trials and with the population (Fig. 2; Table 4). This variation was greater for drug 2, which was less efficacious than drug 1.

Table 4 Number of prevented events and number needed to treata, translated in populations A, B and C, using various trial efficacy metrics estimated using summary data from trials 1 and 2, for drugs 1 and 2
Fig. 2
figure 2

Variation of absolute benefit with risk without treatment for two drugs (1 and 2) in the same population A. a The absolute benefit (AB) as a function of the risk without treatment (Rc) in population A for drug 1: b AB as a function of Rc in population A for drug 2

Table 5 summarises the values of estimated effectiveness metrics in populations A, B and C based on the trial efficacy metrics estimated from trial 1 for each of the two drugs. These data show that use of the trial efficacy metrics for inferring population benefit results in erroneous population metrics. For example, the observed RRs in trial 1 with drugs 1 and 2 were 0.255 and 0.955, respectively. When these were used to translate to the three populations individually, we observed the same values for population A because the whole population was included in the trials, but the RRs for populations B and C were, respectively, 0.174 and 0.201 for drug 1 and 0.938 and 0.942 for drug 2. The bias was lowest when AB, computed with trial summary data, was used for the translation. The RR computed using trial 2 summary data differed from the RR for the true population A since trial 2 was run on a selected sample of population A (see the ESM).

Table 5 Comparison of the values for odds ratio, absolute benefit, relative risk and number needed to treat effectiveness metrics calculated for populations A, B and C and drugs 1 and using efficacy metrics from trial 1 for each drug

4 Discussion

The observed differences between trial efficacy metrics and real-world effectiveness metrics is due to differences in Rc distributions in trial and real-world populations. We demonstrated that it is possible to translate an appropriate trial efficacy metric to a population effectiveness metric if the trial is undertaken on a random sample of the population of interest. However, in most diseases, except rare (orphan) diseases, it is extremely difficult, if not impossible, to recruit patients into clinical trials who are truly representative of the population that will be treated. Although the clinical trial population is drawn from the treatment target population, it is selected using eligibility criteria that result in a subpopulation of the treatment target population that does not have the same characteristics as the whole population; in particular, the Rc differs.

The translation issue has been explored both through a factual approach by comparing clinical trial results with observational data and through a theoretical statistical approach [1, 14, 36]. The factual approach produced results that were difficult to interpret because of the variability of postmarketing (phase IV) studies and their limited capability to manage bias. The theoretical statistical approach was not intuitive for the medical community and will require more work to provide a practical solution.

As mentioned, simulation is a simple way to explore translation, although it does not resolve the issue, particularly when the models and simulated data have not been validated. However, since there is no alternative approach for exploring this issue, our results should be interpreted cautiously.

The NNT metric is generally used by teachers, regulators, authors and pharmaceutical companies to benchmark treatments or assess the relative benefits of a treatment. The treatment with the lowest NNT is generally taken to be the most efficacious. It has been suggested that NNT ‘has that clinical immediacy’ (of clinical applicability), which is one reason why it is such a popular measure [37]. However, this is not true when the NNT is computed on clinical trial data for translation purposes or for comparing drug efficacies, as frequently occurs. We showed that the same drug in different trials can lead to different NNT values when the Rc for the trial populations differ and are different from those for the population of interest; therefore, the translated NNT should be interpreted cautiously. Several authors have warned against the sensitivity of NNT to factors that change baseline risk, e.g. patients’ characteristics, secular trends in incidence and case fatality and delay to event [38, 39]. The value of NNT is not the same if the treatment effect is immediate or if the effect is to delay an outcome rather than prevent it [40].

As mentioned, evidence exists that patients included in clinical trials, although taken from the overall target population, are not representative of all patients to whom the new treatment will be prescribed. The main differences between the trial population and the real-world population are the risk of the outcome (Rc) and the presence of concomitant diseases [5]. Although it is often assumed that the populations will be sufficiently similar to support the hypothesis that the new treatment will also be efficacious in the real-world population, we cannot extrapolate the size of the treatment effect from the clinical trial to the real-world population. Even when recruitment criteria focus on high-risk patients, it has been observed that trial patients are at lower risk than real-world ‘high-risk’ patients and the exclusion criteria often prevent patients with concomitant diseases and those who cannot respect the clinical trial schedules from being recruited to the trial, although these patients will potentially receive the new drug [18]. The results from trials assessing β-blockers illustrate the fact that the same treatment can show differing efficacy when used in different patient populations who, in theory, all have the same disease or risk (Table 1).

With a low-risk population, there would be differences, but if the treatment were moderately efficacious, these differences would be modest. Since, for most drugs, treatment efficacy assessed in trials is modest, the population-level effectiveness metrics obtained by translating trial efficacy metrics values may be viewed as satisfactory. However, this would be without taking into account the issue of responders/non-responders, which is more important when the observed efficacy is low.

Population efficacy metrics computed on clinical trial data were first used to assess the validity of the trial results. However, the statistical assessment with traditional null hypothesis testing is based on the assumption that the analysed trial is a random sample of an infinite number of similar trials and therefore the observed trial efficacy is representative of the true treatment efficacy.

Regulators and payers continue to focus on statistical significance and p values and have not adequately addressed the issue of translation of treatment efficacy from a trial setting to treatment effectiveness in a real-life setting. The guidelines published in 2007 by the European Medicines Agency explicitly mentioned the translation issue in the objectives section, but the issue was not properly formulated and addressed in the rest of the report, although the NNT was discussed in a way that came close to the fundamental issue [41]. This fundamental issue concerns the fact that we are dealing with non-linear effects, whereas the metrics used in translation assume a linear effect.

Pragmatic clinical trials were designed to address the translation issue by providing evidence for adoption of treatment in real-world clinical practice [42, 43]. Since then, only a few truly pragmatic trials have been published, essentially because the rules that define a pragmatic trial are difficult to put into practice. For example, the patients should be similar to those who will receive the intervention in real life, but they must accept being randomised to the new treatment or the comparator, which is usually the current standard of care. In addition, the investigators, who should be real-world prescribers and not trialists, can decide how to administer the treatment. Alternatively, model-based methods can be used to translate observed trial results to a specified target population, but this approach can only take into consideration a small number of covariates [14].

We propose that the effect model (EM) could be used to translate trial metrics to population metrics. The EM approach models the relationship between AB and Rc, which is a characteristic of the treatment at a given time point. This has been demonstrated using simulated populations and been reported in real life [44, 45].

5 Conclusion

This analysis clearly shows that more appropriate and accurate tools are needed to be able to translate clinical trial efficacy to population-level effectiveness. We showed that two population efficacy metrics, NPEpop and NNTpop, could be used to compare two or more treatments (e.g. drugs 1 and 2 in populations A, B or C), irrespective of whether the trials had been run on random samples of the corresponding populations or whether unbiased translation has been achieved. This approach requires prior knowledge of, at least, the target population distribution of Rc and the treatment EM.