Review Article
Statistically significant meta-analyses of clinical trials have modest credibility and inflated effects

https://doi.org/10.1016/j.jclinepi.2010.12.012Get rights and content

Abstract

Objective

To assess whether nominally statistically significant effects in meta-analyses of clinical trials are true and whether their magnitude is inflated.

Study Design and Setting

Data from the Cochrane Database of Systematic Reviews 2005 (issue 4) and 2010 (issue 1) were used. We considered meta-analyses with binary outcomes and four or more trials in 2005 with P < 0.05 for the random-effects odds ratio (OR). We examined whether any of these meta-analyses had updated counterparts in 2010. We estimated the credibility (true-positive probability) under different prior assumptions and inflation in OR estimates in 2005.

Results

Four hundred sixty-one meta-analyses in 2005 were eligible, and 80 had additional trials included by 2010. The effect sizes (ORs) were smaller in the updating data (2005–2010) than in the respective meta-analyses in 2005 (median 0.85-fold, interquartile range [IQR]: 0.66–1.06), even more prominently for meta-analyses with less than 300 events in 2005 (median 0.67-fold, IQR: 0.54–0.96). Mean credibility of the 461 meta-analyses in 2005 was 63–84% depending on the assumptions made. Credibility estimates changed >20% in 19–31 (24–39%) of the 80 updated meta-analyses.

Conclusions

Most meta-analyses with nominally significant results pertain to truly nonnull effects, but exceptions are not uncommon. The magnitude of observed effects, especially in meta-analyses with limited evidence, is often inflated.

Introduction

What is new?

  • Most statistically significant results from meta-analyses of clinical trials are more likely to reflect truly nonnull effects than false-positive results.

  • It is more probable that the credibility of the updated meta-analyses increases rather than decreases.

  • Data added to the existing meta-analysis in a 5-year window (2005–2010) indicated less prominent effects than did the summary estimates in 2005.

  • The median fold change in these summary estimates was 0.85, but the reduction was greater for meta-analyses with less cumulative data (median reduction of 0.67-fold).

Meta-analyses are often considered as the highest level of evidence for evaluating interventions in health care [1], [2] and are very influential in the literature and in practice [3]. However, there has been some debate on whether meta-analyses provide reliable evidence. For example, in an analysis that stirred intense discussion and criticism, LeLorier et al. [4] evaluated 19 meta-analyses and pointed out that these studies had only modest ability to predict the results of subsequent large clinical trials. Meta-analyses with limited evidence, biased studies, and poor-quality trials are considered to be more prone to unreliable results [5], [6], [7], [8], [9], [10]. Other investigators have pointed out that the current interpretation of statistically significant results in meta-analyses ignores the fact that studies are added one at a time, thus one needs more conservative rules to claim statistical significance [7], [10]. When corrections for sequential testing are made, many statistically significant meta-analyses lose their nominal significance [11].

Based on these concerns, clinicians, patients, and policy makers are left with some uncertainty about how they should interpret a meta-analysis, when they see that it has a P-value < 0.05 and its 95% confidence intervals (CIs) exclude the null. How likely is it that there is some genuine treatment effect rather than a “false positive”? Moreover, if there is some effect, is the statistically significant meta-analysis estimate reliable or inflated—and, if so, by how much? Often clinicians and policy makers use nominal statistical significance as a first prerequisite before even considering an intervention for implementation. Then, they may also ask for a sufficiently large treatment effect size. However, there is evidence from diverse fields that, when one focuses on statistically significant results that pass a given threshold of significance (e.g., P < 0.05), some of them are false positives [5] and effect size estimates are inflated on average because of the winner’s curse phenomenon [12]. The winner’s curse refers to the situation where we select results based on the fact that they cross a threshold of significance and at the same time we try to obtain an effect size estimate. It is then mathematically expected that, on average, these estimates are exaggerated [12]. The extent of inflation of effect sizes varies substantially across different studies and scientific fields and is more prominent when the sample size is smaller [12], [13], [14]. False positives and inflation of effects for meta-analyses of clinical trials require more systematic study. Both false positives and inflated effects could cause misleading impressions about an intervention and wrong treatment choices.

Here, we evaluated empirically whether nominally statistically significant results in meta-analyses of clinical trials are credible and the effect sizes from such meta-analyses are potentially inflated. We estimated the credibility (the posterior probability of true-positive results) in independent meta-analyses that had nominal statistical significance in the Cochrane Database of Systematic Reviews (CDSR) in late 2005. Then, we evaluated the change in the credibility of these meta-analyses that had data from additional trials included by early 2010. Moreover, we estimated whether the updating data suggested smaller effects than the initial meta-analyses.

Section snippets

Databases of meta-analyses

We have previously collected data on all 1,011 independent meta-analyses from the CDSR (issue 4, 2005), with binary outcomes and four or more trials [15], [16]. Briefly, one meta-analysis has been used per systematic review (the one with the largest number of trials or the largest number of events, if there were two or more with similar number of studies). Further detailed information on selection criteria appears elsewhere [15], [16], [17]. In these 1,011 meta-analyses, we summarized results

Evaluated meta-analyses

Among the 1,011 meta-analyses, 461 had nominally statistically significant effects in 2005 by random-effects calculations. Of the 461 meta-analyses, 199 belonged to a systematic review that had been updated between 2005 and 2010. Eighty of these 199 meta-analyses included data from additional trials in the 2005–2010 window. Appendix shows the comparisons and outcomes for these 80 meta-analyses, the amount of information available until 2005 and in the 2005–2010 Update (see Appendix on the

Discussion

We evaluated 461 meta-analyses of clinical trials on diverse interventions, 80 of which had also been updated over a period of 5 years. We estimated under different assumptions that 63–84% of the 461 meta-analyses probably represent true effects, whereas the remaining 16–37% of the statistically significant meta-analyses are false positives. Moreover, based on the updated sample, the point estimates of the nominally statistically significant effects are, on average, inflated. The inflation is

References (65)

  • E.W. Steyerberg et al.

    Internal validation of predictive models: efficiency of some procedures for logistic regression analysis

    J Clin Epidemiol

    (2001)
  • E.W. Steyerberg et al.

    Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis

    J Clin Epidemiol

    (1999)
  • D. Moher et al.

    Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement. Quality of Reporting of Meta-analyses

    Lancet

    (1999)
  • D. Moher et al.

    Systematic reviews: when is an update an update?

    Lancet

    (2006)
  • A.M. Moseley et al.

    Cochrane reviews used more rigorous methods than non-Cochrane reviews: survey of systematic reviews in physiotherapy

    J Clin Epidemiol

    (2009)
  • S.R. Johnson et al.

    Methods to elicit beliefs for Bayesian priors: a systematic review

    J Clin Epidemiol

    (2010)
  • I. Olkin

    Meta-analysis: current issues in research synthesis

    Stat Med

    (1996)
  • G.H. Lyman et al.

    The strengths and limitations of meta-analyses based on aggregate data

    BMC Med Res Methodol

    (2005)
  • N.A. Patsopoulos et al.

    Relative citation impact of various study designs in the health sciences

    JAMA

    (2005)
  • J. LeLorier et al.

    Discrepancies between meta-analyses and subsequent large randomized, controlled trials

    N Engl J Med

    (1997)
  • J.P. Ioannidis

    Why most published research findings are false

    PLoS Med

    (2005)
  • J.P. Ioannidis

    Effect of formal statistical significance on the credibility of observational associations

    Am J Epidemiol

    (2008)
  • L. Wood et al.

    Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: meta-epidemiological study

    BMJ

    (2008)
  • L.L. Kjaergard et al.

    Reported methodologic quality and discrepancies between large and small randomized trials in meta-analyses

    Ann Intern Med

    (2001)
  • K. Thorlund et al.

    Can trial sequential monitoring boundaries reduce spurious inferences from meta-analyses?

    Int J Epidemiol

    (2009)
  • J.P. Ioannidis

    Why most discovered true associations are inflated

    Epidemiology

    (2008)
  • T.V. Pereira et al.

    Discovery properties of genome-wide association signals from cumulatively combined data sets

    Am J Epidemiol

    (2009)
  • N.A. Patsopoulos et al.

    Sensitivity of between-study heterogeneity in meta-analysis: proposed metrics and empirical evaluation

    Int J Epidemiol

    (2008)
  • N.A. Patsopoulos et al.

    The use of older studies in meta-analyses of medical interventions: a survey

    Open Med

    (2009)
  • J.P. Ioannidis et al.

    Reasons or excuses for avoiding meta-analysis in forest plots

    BMJ

    (2008)
  • N.S. Young et al.

    Why current publication practices may distort science

    PLoS Med

    (2008)
  • J.P. Ioannidis

    Calibration of credibility of agnostic genome-wide associations

    Am J Med Genet B Neuropsychiatr Genet

    (2008)
  • Cited by (103)

    • The fragility of statistically significant results from clinical nutrition randomized controlled trials

      2020, Clinical Nutrition
      Citation Excerpt :

      Moreover, in the planning of an RCT systematic reviews of existing evidence can be utilised and the results of a single trial should also be reported in the context of the whole body of evidence summarised by systematic review [61–65]. Interestingly, the majority of published meta-analyses include too few randomised patients, to reach appropriate statistical power leading to overestimating (type I) or underestimation (type II) of intervention effects [66,67]. Therefore, trial sequential analysis methodology has been proposed to handle the issue of the reliability of cumulative evidence derived from multiple, heterogeneous, often “fragile” studies.

    View all citing articles on Scopus

    Competing interests: None.

    Funding: T.V.P. is funded by grants from the Fundação de Amparo à Pesquisa do Estado de São Paulo (State of São Paulo Research Foundation, FAPESP) and the Coordenação de Aperfeiçoamento Pessoal de Nível Superior (CAPES, Brazilian Ministry of Education).

    View full text