Review Article
A review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard

https://doi.org/10.1016/j.jclinepi.2009.02.005Get rights and content

Abstract

Objective

In diagnostic accuracy studies, the reference standard may be imperfect or not available in all patients. We systematically reviewed the proposed solutions for these situations and generated methodological guidance.

Study Design and Setting

Review of methodological articles.

Results

We categorized the solutions into four main groups. The first group includes methods that impute or adjust for missing data on the reference standard. The second group consists of methods that correct estimates of accuracy obtained with an imperfect reference standard. In the third group a reference standard is constructed by combining multiple test results through a predefined rule, based on a consensus procedure, or through statistical modeling. In the fourth group, the diagnostic accuracy paradigm is abandoned in favor of validation studies that relate index test results to relevant clinical data, such as history, future clinical events, and response to therapy.

Conclusion

Most of the methods try to impute, adjust, or construct a reference standard. In situations that deviate only marginally from the classical diagnostic accuracy paradigm, these are valuable methods. In cases where an acceptable reference standard does not exist, the concept of clinical test validation may provide an alternative paradigm to evaluate a diagnostic test.

Introduction

A key phase in the evaluation of a test is determining its diagnostic accuracy—the ability of a test to discriminate between patients with and without the target condition [1], [2], [3]. In diagnostic accuracy studies, the presence or absence of the target condition is determined by a gold standard. Ideally, the gold standard provides error-free classification. Accuracy measures, such as test sensitivity, specificity, likelihood ratios, predictive values, or diagnostic odds ratio, express how well the results of the test under evaluation agree with the outcome of the gold standard [4]. For most, if not all conditions in clinical medicine, a gold standard that is without error or uncertainty is not available [5], [6], [7]. In these circumstances, researchers use the best available practicable method to determine the presence or absence of the target condition, and such a method is referred to as “reference standard” rather than the “gold standard” [3], [8].

Even within this framework, several problematic reference standard situations can occur. It may not be possible to perform the reference standard in all patients or it may be substantially imperfect, or there may be no accepted reference standard for a target condition. There is no universally accepted solution in diagnostic research when faced with missing, imperfect, or absent reference standards. Multiple solutions have been proposed, each with their own merits and limitations, but no unifying guidance exists that summarizes these coherently.

In this article, we briefly describe some of the reasons for imperfections of the reference standard. Using a systematic review of the literature, we have tried to identify and classify the solutions proposed for problematic reference standard situations. Based on this work we present a flowchart that provides methodological guidance to researchers planning diagnostic accuracy studies and to readers for critical appraisal of such studies.

An ideal reference standard, a gold standard, in an optimal diagnostic accuracy study would fulfill the following criteria:

  • (1)

    The reference standard provides error-free classification of all subjects.

  • (2)

    The same reference standard is used to verify all index test results.

  • (3)

    The index test and reference standard can be performed within a short interval to avoid changes in target condition status.

Errors in the classification by the reference standard can come from many sources. Intrinsic reference standard errors can occur, for example, when the target condition does not produce the biochemical changes of interest, or when a tumor is missed by imaging because it is too small for the resolution of the technique applied. Examples of these intrinsic errors are prostate-specific antigen–negative prostate cancers [9] and small polyps not detectable by computed tomography (CT) colonography [10].

The presence of an alternative condition in patients, one that produces changes in a biomarker similar to those associated with the target condition, can also lead to misclassification. An additional source of misclassification consists of errors and failures in the reference standard protocol and interpretation errors by observers. Examples include failure of detecting cancer cells after fine-needle aspiration because the biopsy was performed outside the tumor mass or by overlooking a small pulmonary embolism when reading CT images.

A special problem occurs when the test under evaluation is part of the reference standard or its result is known when interpreting the reference standard. In both situations there is a likely increase in agreement between index test results and reference standard outcome; these biases are known as incorporation and diagnostic review bias [8], [11].

For several target conditions there is no reference standard based on histological or biochemical changes and researchers have to rely on combinations of symptoms and signs to define the condition. An example is rheumatic fever in which a combination of major and minor criteria is used to establish the diagnosis [12]. Such classifications may vary over time [13], or across countries, and cannot be error-free.

Whatever the cause of classification errors, using an imperfect reference standard procedure will directly lead to bias in the accuracy statistics [5], [11], [14]. Any disagreement between the reference standard and the index test will be labeled as a “false” index test result. The net effect of reference standard misclassification can be an upward or downward bias in estimates of diagnostic accuracy. The direction depends on whether errors by the index test and imperfect reference standard are correlated. If errors are positively correlated, misclassification will erroneously increase agreement in the 2-by-2 tables and estimates of accuracy will be inflated. The magnitude of the biasing effect will depend on the frequency of errors by the imperfect reference standard and the degree of correlation in errors between index test and reference standard [15], [16].

Section snippets

Methods

We used multiple search strategies to identify methods for dealing with imperfect or missing reference standard situations in diagnostic studies. We performed searches of electronic databases (MEDLINE, EMBASE, MEDION, and Cochrane Library), contacted experts for articles in personal archives, explored databases from previous methodological projects like Standards for Reporting of Diagnostic Accuracy (STARD) [17] and Quality Assessment of Diagnostic Accuracy Studies (QUADAS) [18], and

Results and discussion

We identified many different methods for problematic reference standard situations, which we have categorized into four main groups (see Table 1). These four groups differ in the extent of their departure from the classical “gold standard” diagnostic accuracy paradigm. The main characteristics of each group are described in Table 1, which also lists some key references providing further details. Most of these articles focus on one type of solution, and a few authors compare different

Use of flowchart

The flowchart is organized around a key set of questions. We will illustrate the use of the flowchart by discussing each question and specifically apply these questions to the (hypothetical) case where the accuracy of a new promising marker for the detection of heart failure needs to be evaluated. In addition, Table 2 lists a number of examples from the literature in which one or more of the described methods have been applied.

Conclusions

The verification of index test results in diagnostic accuracy studies can pose a real challenge. Using a reference standard that provides inadequate classification or ignoring missing reference standard results may bias estimates of diagnostic accuracy. We present a flowchart centered around a set of key questions that provides methodological guidance to researchers planning diagnostic accuracy studies and to readers for critical appraisal of such studies.

Researchers and readers should be aware

Acknowledgments

This work was supported by the NHS Research Methodology Program, UK (grant: RM04/JH21). We gratefully acknowledge the many useful comments we received from the following experts: Prof. Aeilko H. Zwinderman, Professor of Biostatistics; Francisca Galindo Garre, PhD, Statistician; Prof. Les Irwig, Professor of Epidemiology; Prof. Karel G.M. Moons, Professor of Clinical Epidemiology; Alexandra A.H. van Abswoude, PhD, Methodologist. In addition, we gained input from discussions with Prof. Paul

References (70)

  • D. Sackett et al.

    The architecture of diagnostic research

  • J. Knottnerus et al.

    Assessment of the accuracy of diagnostic tests: the cross-sectional study

  • J.A. Knottnerus et al.

    General introduction: evaluation of diagnostic procedures

  • M.S. Pepe

    Incomplete data and imperfect reference tests

  • P.N. Valenstein

    Evaluating diagnostic tests with imperfect standards

    Am J Clin Pathol

    (1990)
  • A.W. Rutjes et al.

    Evaluation of diagnostic tests when there is no gold standard. A review of methods

    Health Technol Assess

    (2007)
  • P.M. Bossuyt et al.

    The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration

    Ann Intern Med

    (2003)
  • A.J. Birtle et al.

    Clinical features of patients who present with metastatic prostate carcinoma and serum prostate-specific antigen (PSA) levels < 10 ng/mL: the “PSA negative” patients

    Cancer

    (2003)
  • T. Doshi et al.

    CT colonography: false-negative interpretations

    Radiology

    (2007)
  • P. Whiting et al.

    Sources of variation and bias in studies of diagnostic accuracy: a systematic review

    Ann Intern Med

    (2004)
  • T.D. Jones

    The diagnosis of rheumatic fever

    JAMA

    (1944)
  • Guidelines for the diagnosis of rheumatic fever. Jones Criteria, 1992 update. Special Writing Group of the Committee on Rheumatic Fever, Endocarditis, and Kawasaki Disease of the Council on Cardiovascular Disease in the Young of the American Heart Association

    JAMA

    (1992)
  • E.J. Boyko et al.

    Reference test errors bias the evaluation of diagnostic tests for ischemic heart disease

    J Gen Intern Med

    (1988)
  • J.G. Lijmer et al.

    Empirical evidence of design-related bias in studies of diagnostic tests

    JAMA

    (1999)
  • A.W. Rutjes et al.

    Evidence of bias and variation in diagnostic accuracy studies

    CMAJ

    (2006)
  • P.M. Bossuyt et al.

    Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD Initiative

    Ann Intern Med

    (2003)
  • P. Whiting et al.

    Development and validation of methods for assessing the quality of diagnostic accuracy studies

    Health Technol Assess

    (2004)
  • C.B. Begg et al.

    Assessment of diagnostic tests when disease verification is subject to selection bias

    Biometrics

    (1983)
  • O. Harel et al.

    Multiple imputation for correcting verification bias

    Stat Med

    (2006)
  • R.J. Little et al.

    Statistical analysis with missing data

    (2002)
  • G. Molenberghs et al.

    Missing data perspectives of the fluvoxamine data set: a review

    Stat Med

    (1999)
  • X.H. Zhou

    Correcting for verification bias in studies of a diagnostic test's accuracy

    Stat Methods Med Res

    (1998)
  • G.A. Diamond

    Off Bayes: effect of verification bias on posterior probabilities calculated using Bayes' theorem

    Med Decis Making

    (1992)
  • A.S. Kosinski et al.

    Accounting for nonignorable verification bias in assessment of diagnostic tests

    Biometrics

    (2003)
  • A. Hadgu et al.

    Evaluation of nucleic acid amplification tests in the absence of a perfect gold-standard test: a review of the statistical and epidemiologic issues

    Epidemiology

    (2005)
  • Cited by (292)

    View all citing articles on Scopus

    Competing interests: No competing interests.

    Authors' contributions: K.S.K. and P.M.B were the lead applicants of the proposal to the NHS Research Methodology Program (grant: RM04/JH21), but all authors contributed to this proposal. A.W.S.R. performed the electronic searches. J.B.R., A.W.S.R., and A.C. drafted the first version. The literature findings were translated into a flowchart after numerous discussions among all authors and also with many external experts (see also Acknowledgments). All authors read and approved the final manuscript.

    View full text