Review ArticleA review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard
Introduction
A key phase in the evaluation of a test is determining its diagnostic accuracy—the ability of a test to discriminate between patients with and without the target condition [1], [2], [3]. In diagnostic accuracy studies, the presence or absence of the target condition is determined by a gold standard. Ideally, the gold standard provides error-free classification. Accuracy measures, such as test sensitivity, specificity, likelihood ratios, predictive values, or diagnostic odds ratio, express how well the results of the test under evaluation agree with the outcome of the gold standard [4]. For most, if not all conditions in clinical medicine, a gold standard that is without error or uncertainty is not available [5], [6], [7]. In these circumstances, researchers use the best available practicable method to determine the presence or absence of the target condition, and such a method is referred to as “reference standard” rather than the “gold standard” [3], [8].
Even within this framework, several problematic reference standard situations can occur. It may not be possible to perform the reference standard in all patients or it may be substantially imperfect, or there may be no accepted reference standard for a target condition. There is no universally accepted solution in diagnostic research when faced with missing, imperfect, or absent reference standards. Multiple solutions have been proposed, each with their own merits and limitations, but no unifying guidance exists that summarizes these coherently.
In this article, we briefly describe some of the reasons for imperfections of the reference standard. Using a systematic review of the literature, we have tried to identify and classify the solutions proposed for problematic reference standard situations. Based on this work we present a flowchart that provides methodological guidance to researchers planning diagnostic accuracy studies and to readers for critical appraisal of such studies.
An ideal reference standard, a gold standard, in an optimal diagnostic accuracy study would fulfill the following criteria:
- (1)
The reference standard provides error-free classification of all subjects.
- (2)
The same reference standard is used to verify all index test results.
- (3)
The index test and reference standard can be performed within a short interval to avoid changes in target condition status.
Errors in the classification by the reference standard can come from many sources. Intrinsic reference standard errors can occur, for example, when the target condition does not produce the biochemical changes of interest, or when a tumor is missed by imaging because it is too small for the resolution of the technique applied. Examples of these intrinsic errors are prostate-specific antigen–negative prostate cancers [9] and small polyps not detectable by computed tomography (CT) colonography [10].
The presence of an alternative condition in patients, one that produces changes in a biomarker similar to those associated with the target condition, can also lead to misclassification. An additional source of misclassification consists of errors and failures in the reference standard protocol and interpretation errors by observers. Examples include failure of detecting cancer cells after fine-needle aspiration because the biopsy was performed outside the tumor mass or by overlooking a small pulmonary embolism when reading CT images.
A special problem occurs when the test under evaluation is part of the reference standard or its result is known when interpreting the reference standard. In both situations there is a likely increase in agreement between index test results and reference standard outcome; these biases are known as incorporation and diagnostic review bias [8], [11].
For several target conditions there is no reference standard based on histological or biochemical changes and researchers have to rely on combinations of symptoms and signs to define the condition. An example is rheumatic fever in which a combination of major and minor criteria is used to establish the diagnosis [12]. Such classifications may vary over time [13], or across countries, and cannot be error-free.
Whatever the cause of classification errors, using an imperfect reference standard procedure will directly lead to bias in the accuracy statistics [5], [11], [14]. Any disagreement between the reference standard and the index test will be labeled as a “false” index test result. The net effect of reference standard misclassification can be an upward or downward bias in estimates of diagnostic accuracy. The direction depends on whether errors by the index test and imperfect reference standard are correlated. If errors are positively correlated, misclassification will erroneously increase agreement in the 2-by-2 tables and estimates of accuracy will be inflated. The magnitude of the biasing effect will depend on the frequency of errors by the imperfect reference standard and the degree of correlation in errors between index test and reference standard [15], [16].
Section snippets
Methods
We used multiple search strategies to identify methods for dealing with imperfect or missing reference standard situations in diagnostic studies. We performed searches of electronic databases (MEDLINE, EMBASE, MEDION, and Cochrane Library), contacted experts for articles in personal archives, explored databases from previous methodological projects like Standards for Reporting of Diagnostic Accuracy (STARD) [17] and Quality Assessment of Diagnostic Accuracy Studies (QUADAS) [18], and
Results and discussion
We identified many different methods for problematic reference standard situations, which we have categorized into four main groups (see Table 1). These four groups differ in the extent of their departure from the classical “gold standard” diagnostic accuracy paradigm. The main characteristics of each group are described in Table 1, which also lists some key references providing further details. Most of these articles focus on one type of solution, and a few authors compare different
Use of flowchart
The flowchart is organized around a key set of questions. We will illustrate the use of the flowchart by discussing each question and specifically apply these questions to the (hypothetical) case where the accuracy of a new promising marker for the detection of heart failure needs to be evaluated. In addition, Table 2 lists a number of examples from the literature in which one or more of the described methods have been applied.
Conclusions
The verification of index test results in diagnostic accuracy studies can pose a real challenge. Using a reference standard that provides inadequate classification or ignoring missing reference standard results may bias estimates of diagnostic accuracy. We present a flowchart centered around a set of key questions that provides methodological guidance to researchers planning diagnostic accuracy studies and to readers for critical appraisal of such studies.
Researchers and readers should be aware
Acknowledgments
This work was supported by the NHS Research Methodology Program, UK (grant: RM04/JH21). We gratefully acknowledge the many useful comments we received from the following experts: Prof. Aeilko H. Zwinderman, Professor of Biostatistics; Francisca Galindo Garre, PhD, Statistician; Prof. Les Irwig, Professor of Epidemiology; Prof. Karel G.M. Moons, Professor of Clinical Epidemiology; Alexandra A.H. van Abswoude, PhD, Methodologist. In addition, we gained input from discussions with Prof. Paul
References (70)
- et al.
Methodology for the assessment of new dichotomous diagnostic tests
J Chronic Dis
(1981) - et al.
Is everything all right if nothing seems wrong? A simple method of assessing the diagnostic value of endoscopic procedures when a gold standard is absent
J Urol
(1999) - et al.
Comparison of T-cell-based assay with tuberculin skin test for diagnosis of Mycobacterium tuberculosis infection in a school tuberculosis outbreak
Lancet
(2003) - et al.
Implication of different cardiac troponin I levels for clinical outcomes and prognosis of acute chest pain patients
J Am Coll Cardiol
(2004) The discrepancy in discrepant analysis
Lancet
(1996)Bias in discrepant analysis: when two wrongs don't make a right
J Clin Epidemiol
(1998)Discrepant analysis: a biased and an unscientific method for estimating test sensitivity and specificity
J Clin Epidemiol
(1999)- et al.
Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review
J Clin Epidemiol
(1988) - et al.
Randomised comparisons of medical tests: sometimes invalid, not always efficient
Lancet
(2000) - et al.
The efficacy of diagnostic imaging
Med Decis Making
(1991)
The architecture of diagnostic research
Assessment of the accuracy of diagnostic tests: the cross-sectional study
General introduction: evaluation of diagnostic procedures
Incomplete data and imperfect reference tests
Evaluating diagnostic tests with imperfect standards
Am J Clin Pathol
Evaluation of diagnostic tests when there is no gold standard. A review of methods
Health Technol Assess
The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration
Ann Intern Med
Clinical features of patients who present with metastatic prostate carcinoma and serum prostate-specific antigen (PSA) levels < 10 ng/mL: the “PSA negative” patients
Cancer
CT colonography: false-negative interpretations
Radiology
Sources of variation and bias in studies of diagnostic accuracy: a systematic review
Ann Intern Med
The diagnosis of rheumatic fever
JAMA
Guidelines for the diagnosis of rheumatic fever. Jones Criteria, 1992 update. Special Writing Group of the Committee on Rheumatic Fever, Endocarditis, and Kawasaki Disease of the Council on Cardiovascular Disease in the Young of the American Heart Association
JAMA
Reference test errors bias the evaluation of diagnostic tests for ischemic heart disease
J Gen Intern Med
Empirical evidence of design-related bias in studies of diagnostic tests
JAMA
Evidence of bias and variation in diagnostic accuracy studies
CMAJ
Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD Initiative
Ann Intern Med
Development and validation of methods for assessing the quality of diagnostic accuracy studies
Health Technol Assess
Assessment of diagnostic tests when disease verification is subject to selection bias
Biometrics
Multiple imputation for correcting verification bias
Stat Med
Statistical analysis with missing data
Missing data perspectives of the fluvoxamine data set: a review
Stat Med
Correcting for verification bias in studies of a diagnostic test's accuracy
Stat Methods Med Res
Off Bayes: effect of verification bias on posterior probabilities calculated using Bayes' theorem
Med Decis Making
Accounting for nonignorable verification bias in assessment of diagnostic tests
Biometrics
Evaluation of nucleic acid amplification tests in the absence of a perfect gold-standard test: a review of the statistical and epidemiologic issues
Epidemiology
Cited by (292)
Screening for Neglected Tropical Diseases and other infections in African refugees and asylum seekers in Rome and Lazio region, Italy
2023, Travel Medicine and Infectious DiseaseControversy and debate: challenges with the need to improve the reference standard in diagnosis paper 1: two challenges: absence of a clear cut, easily replicable test for the reference standard; unethical/infeasible inclusion of an invasive procedure in the reference standard
2023, Journal of Clinical EpidemiologyIntegrity of randomized clinical trials: Performance of integrity tests and checklists requires assessment
2023, International Journal of Gynecology and Obstetrics
Competing interests: No competing interests.
Authors' contributions: K.S.K. and P.M.B were the lead applicants of the proposal to the NHS Research Methodology Program (grant: RM04/JH21), but all authors contributed to this proposal. A.W.S.R. performed the electronic searches. J.B.R., A.W.S.R., and A.C. drafted the first version. The literature findings were translated into a flowchart after numerous discussions among all authors and also with many external experts (see also Acknowledgments). All authors read and approved the final manuscript.