Abstract
Objectives To estimate and compare the diagnostic accuracy of magnetic resonance imaging (MRI) and ultrasound, for the prediction of rheumatoid arthritis (RA) in unclassified arthritis (UA).
Methods MEDLINE, Embase and BIOSIS were searched from 1987 to May 2019. Studies evaluating any imaging test in participants with UA were eligible. Reference standards were RA classification criteria or methotrexate initiation. Two authors independently extracted data and assessed validity using QUADAS-2. Sensitivities and specificities were calculated for each imaging characteristic and joint area. Summary estimates with 95% confidence intervals (CI) were estimated where possible.
Results Nineteen studies were included; 13 evaluated MRI (n=1,143; 454 with RA) and 6 evaluated ultrasound (n=531; 205 with RA). Studies were limited by unclear recruitment procedures, inclusion of patients with RA at baseline, differential verification, lack of blinding and consensus grading. Study heterogeneity largely precluded meta-analysis, however summary sensitivity and specificity for MRI synovitis in at least one joint were 93% (95% CI 88%, 96%) and 25% (95% CI 13%, 41%) (3 studies). Specificities may be higher for other MRI characteristics but data are limited. Ultrasound results were difficult to synthesise due to different diagnostic thresholds and reference standards.
Conclusions The evidence for MRI or ultrasound as single tests for predicting RA in people with UA is heterogeneous and of variable methodological quality. Larger studies using consensus grading and consistently defined RA diagnosis are needed to identify whether combinations of imaging characteristics, either alone or in combination with other clinical findings, can better predict RA in this population.
Systematic review registration PROSPERO CRD42020158239.
Key messages
To date, the diagnostic accuracy of imaging tests for the earlier identification of RA has not been systematically assessed. We conducted a systematic review to estimate, and if possible compare, the accuracy of MRI and ultrasound for predicting the diagnosis of rheumatoid arthritis in people with unclassified arthritis.
In this systematic review of 13 studies of MRI (1,143 participants) and 6 studies of ultrasound (531 participants), study quality was highly variable with considerable variation in populations, diagnostic thresholds and reference standards limiting potential for meta-analysis.
Individual MRI imaging characteristics demonstrated either high sensitivity (with low specificity) or high specificity (with low sensitivity) with inconsistent results between studies. Similar heterogeneity in results was observed for ultrasound but with considerably fewer data.
Imaging can identify subclinical inflammatory changes in joint areas where no synovitis is apparent, which may be useful in identifying the aetiology of symptoms. However, larger studies using consistent scoring systems for imaging interpretation and definition of RA are needed to identify the extent to which imaging findings alone can predict the development of RA. Until then, imaging should be interpreted in light of other findings.
INTRODUCTION
Despite major treatment advances, rheumatoid arthritis (RA) is associated with long-term morbidity and accelerated mortality1. Uncontrolled RA also impacts on daily activities and quality of life2,3 and can confer a substantial socioeconomic burden and loss in productivity4–6. Early diagnosis and treatment of RA can prevent joint damage, and ensure that people remain productive and in work7. Evidence for a therapeutic window of opportunity early in the course of the disease is accumulating8,9, and has focused attention on earlier identification of which patients will go on to develop RA 10. However, among those who present with unclassified arthritis (UA), about 60% have a self-limiting disease while 40% develop a chronic persistent arthritis (of which only a proportion is RA)11, making the accurate and timely diagnosis of RA in patients with UA an important aim.
There is no single test to confirm the diagnosis of RA. Classification criteria developed by the American College of Rheumatology (ACR 1987) (most recently in collaboration with the European League Against Rheumatism (ACR/EULAR 2010))12,13 are often used as a proxy for diagnosis. The classification criteria have historically relied on clinical signs and symptoms, rheumatoid factor (RF) and radiographic changes13, with the 2010 revision incorporating inflammation markers (ESR and CRP) and anti-cyclic citrullinated antibodies (anti-CCP)12. Because of enhanced sensitivity to detect early disease, the ACR/EULAR 2010 criteria may lead to early and appropriate initiation of anti-rheumatic therapy among some individuals, but at the cost of unnecessary treatment in others14. The development of more accurate diagnostic strategies for RA in patients with very early joint symptoms is therefore of critical importance.
A number of clinical prediction models using various clinical and serological criteria have been developed15–17, however an array of imaging modalities with the potential to improve early diagnosis are also available18,19. Musculoskeletal ultrasound and magnetic resonance imaging (MRI) can identify joint inflammation and structural changes even in those with normal radiographs20–23 and are increasingly used in clinical practice24 or for research purposes 25. Positron emission tomography (PET)26,27 and single-photon emission computed tomography (SPECT)28 are also being investigated in RA, along with novel approaches such as fluorescence optical imaging29–32 and optical spectral transmission imaging33,34.
The diagnostic accuracy of imaging tests for the earlier identification of RA has not yet been systematically assessed. We conducted a systematic review to estimate, and if possible compare, the accuracy of imaging tests in the prediction of RA in newly presenting patients with UA.
METHODS
We followed published methods for systematic reviews35 and report our findings according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) extension for diagnostic test accuracy studies36.
Data sources
We searched MEDLINE, Embase and BIOSIS from 1987 to 2nd May 2019. Full search strategies are available (supplementary appendix 1). No language restrictions were applied. Reference lists of systematic reviews and included study reports were also screened.
Study selection
Study selection was undertaken independently by two reviewers; disagreements were resolved by discussion. Studies evaluating any imaging test characteristic in participants with UA with at least one clinically swollen, or with at least one clinically swollen or tender joint, were eligible for inclusion if they provided a cross-tabulation of imaging results at baseline against the subsequent diagnosis of RA (defined as fulfilment of ACR 198713 or ACR/EULAR 201012 classification criteria or by the initiation of methotrexate or other DMARD treatment), at least three months later. Studies in populations with clinically suspect arthralgia37 or where UA was defined by joint tenderness, or including more than 50% of participants with RA at baseline were excluded unless subgroup data for UA were presented. Case-control studies, conference abstracts and studies recruiting less than five participants with or without a final diagnosis of RA were also excluded. A list of excluded studies with reasons for exclusion is available on request. Authors of eligible studies were contacted when they presented insufficient data to allow for the construction of 2 × 2 contingency tables.
Data collection, quality assessment and analysis
Two reviewers independently extracted data using a pre-specified and piloted data extraction form and assessed study quality using the Quality Assessment of Diagnostic Accuracy Studies-2 tool (QUADAS-2) 38 (supplementary appendix 2). Any disagreements were resolved by consensus. Each study should ideally prospectively recruit a representative sample of participants with UA and should exclude those who meet RA classification criteria at baseline. Blinding of imaging test interpretation and of final diagnosis of RA should be implemented and standard scoring systems such as RAMRIS for MRI39,40, recent EULAR-OMERACT41–44 or previous OMERACT definitions42,45–48 for grading ultrasound synovitis in RA and/or widely used consensus definitions for ultrasound19,41,49–51 should be used.
Estimates of sensitivity and specificity from each study were plotted on coupled forest plots for each imaging characteristic and joint area imaged. Where two or more studies used the same scoring system and reference standard, summary sensitivities and specificities with 95% confidence intervals (95% CI) were obtained using the bivariate hierarchical model52. Due to paucity of studies, the models were simplified by assuming no correlation between sensitivity and specificity estimates and by setting near-zero variance estimates of the random effects to zero53. Study heterogeneity was examined by inspection of forest plots; no formal investigation was conducted due to data scarcity.
Plots were produced using RevMan 5.3 (Nordic Cochrane Centre) and analyses undertaken with STATA 16 software using meqrlogit (bivariate hierarchical models), or blogit commands (fixed effect logistic regression). Absolute differences in sensitivities or specificities were derived using nlcom.
RESULTS
A total of 204 records were selected for full-text assessment from 10,812 unique references (Figure 1). Corresponding authors of 26 publications (13 conference abstracts and 13 full text papers) were contacted; information was supplied by 16, resulting in inclusion of three56,57,70 and exclusion of 12 21,73–84. Common reasons for exclusion were ineligible study participants, ≥50% with RA; 74/185, 40%) or lack of follow up (35/185, 19%). Other exclusion reasons are displayed on Figure 1 and detailed in supplementary appendix 2.
PRISMA Flow diagram
Nineteen studies (19/204, 9%) met inclusion criteria. Table 1 provides a summary of key study characteristics and further imaging details are shown on Table 3. Thirteen studies (N=1,143, RA n=454) evaluated the accuracy of individual MRI characteristics 55,59–63,65–71, and six studies (N=531, RA n=205) evaluated ultrasound54,56–58,64,72. One study reported data for scintigraphy60. Median sample size was 81 (inter-quartile range (IQR) 39, 119) and median prevalence of RA at the end of study follow up was 38% (29%, 51%), with 45% (29%, 58%) for MRI and 34% (25, 40%) for ultrasound.
Summary characteristics across studies
UA was defined as clinical synovitis in at least one joint (6/19, 32%) 54–59, more than one swollen joint (2, 11%) 60,61 and at least one (4, 21%)62–65, or more than one (3, 16%)66–68 swollen or tender joint of the hand or wrist. Four studies (21%) only stipulated that participants not meet RA classification criteria at baseline with no further detail69–71,72. Four studies (21%) restricted inclusion to anti-CCP negative64,65, RF negative61, or either RF or anti-CCP positive58 participants. Fifteen studies (79%) excluded participants meeting one set of RA classification criteria at baseline (n=9 ACR 198759–62,64,66–69; n=5 ACR/EULAR 201054,56,63,70,71), or either set of criteria (Table 1).72
MRI magnet strength ranged between 0.2T60 −3.0T62,63,67, 7 studies used 1.5T55,59,61,65,69–71. MRI synovitis (8/13, 62%) and erosion (8/13, 62%) were commonly assessed. 69% (9/13) studies used RAMRIS59,61,63,65–67,71,85,86.
Four studies (4/6, 67%)54,57,58,64,72 assessed ultrasound tenosynovitis (TS) defined as absent or present19,48,87 or based on a consensus-based US score for TS in RA45,47. All ultrasound evaluations included both grey scale (GS) synovitis and power dower (PD) activity. Synovitis was scored using similar semiquantitative grading scales19 49-51. Two studies54,58 used the OMERACT definition of synovitis45,46 and none used the more recent EULAR-OMERACT combined scoring system for grading synovitis in RA41–44. There was considerable variation between studies in joint sets imaged and use of bilateral imaging (Table 1).
Validity and applicability of the evidence (QUADAS-2)
Six studies were rated low risk of bias for participant selection, 54,55,61,62,71,72 and two57,65 as high risk (Figure 2). Risk of bias could not be judged in 11 studies due to unclear recruitment procedures56,58–60,63,64,66–69 or possible inclusion of participants meeting reference standard criteria at baseline58,70. Study eligibility criteria did not match our UA definition in 10 studies58,60–68, including three reporting median symptom duration over 12 months60,64,67 leading to concerns about applicability of the study population. The generalisability of the study population could not be judged in four studies69–72.
Risk of bias and applicability concerns graph
Risk of bias for the index test was low in 53% (10/19) of studies 55,57,58,61,62,64,65,69–71. Risk was high for two MRI evaluations due to lack of pre-specified diagnostic thresholds63,67 and one of ultrasound due to lack of blinding72. Blinding to baseline clinical findings was not clearly reported in two ultrasound54,56 and four MRI evaluations59,60,66,88. Ultrasound (5/6) was interpreted by a single experienced or trained sonographer or rheumatologist, using clearly defined criteria for interpretation 54,56–58,64,72. We had high (9/13) 59,60,62,65,67–71, or unclear (1/13)55 concerns for the applicability of MRI evaluations; 7 studies reporting consensus or mean results59,62,65,67,69–71 and three provided no MRI scoring details60,68,69. Seven studies did not describe MRI interpreter experience55,60,62,65,68,69,71.
Reference standard assessment was explicitly blinded to imaging test results in three studies 64,67,73 (one scored unclear risk of bias overall because of a combined reference standard64). Lack of blinding was reported (n=1)55 or was inferred in 16 studies 54,56–63,65,66,68–72.
There was high risk of bias for participant flow in 58% of studies (11/19), due to exclusion of participants from analysis 54–56,62,65,66,68,69 or differential verification 61,71,72. No unevaluable images were reported.
Study synthesis
Sensitivities and specificities are presented by reference standard and imaging characteristic (Figures 3–5; supplementary Figures 2-4).
Forest plot of sensitivities and specificities from studies evaluating individual MRI characteristics (any joint positive against ACR 1987 reference standard)
MRI performance
Eight studies 55,59,60,62,66,68–70 evaluated MRI in any joint using ACR 1987 criteria as the reference standard (Figure 3). Four studies55,59,66,70 used RAMRIS and were eligible for statistical pooling.
Summary sensitivity and specificity for synovitis was 93% (95% CI 88% to 96%) and 25% (95% CI 13% to 41%) (n=355,59,70; 453 participants [151 RA]) and for symmetric synovitis 78% (95% CI 70% to 84%) and 39% (95% CI 17% to 67%), respectively (n=255,59; 252 [122]) (Table 2).
Summary of results for imaging test evaluations
Data for other MRI characteristics were heterogeneous but suggested lower sensitivities with some evidence of higher specificities (Figure 3). Summary sensitivities were 51% (95% CI 32% to 70%) for bone marrow oedema (BMO) (n=355,59,70; 453 participants [151 with RA]), 65% (95% CI 18% to 94%) for erosion (n=255,59; 252 [122]) and 66% (95% CI 36% to 87%) for tenosynovitis (n=355,66,70; 440 [103 cases]). Respective summary specificities were 76% (95% CI 39% to 94%), 68% (95% CI 20% to 95%) and 66% (95% CI 39% to 86%) (Table 2).
A combination of MRI characteristics (synovitis/erosion, or, BMO and/or erosion) was assessed in three studies59,60,67, two using RAMRIS59,67 (Figure 4). Results were mixed, and data were insufficient to identify a clear effect on accuracy. Four studies55,65–67 presented results by joint area (MCP, PIP, wrist) using RAMRIS and ACR 1987 as reference (Table 2). Summary sensitivities were 47% to 77% and specificities 34% to 75%, however some characteristics were either highly sensitive or highly specific. Summary sensitivity for wrist synovitis was 90% (95% CI 81% to 95%)(n=255,66) with specificity 19% (95% CI 14% to 26%). In contrast, BMO and erosion of PIPJ were highly specific (summary specificities 96% to 99%), with sensitivities 7% or less. Additional miscellaneous thresholds and reference standards data are presented in supplementary Figure 3.
Forest plot of sensitivities and specificities from studies evaluating combinations of individual MRI characteristics (any joint positive against ACR 1987 reference standard)
Ultrasound performance
Ultrasound evaluations used a variety of scoring systems as reported above. Diagnostic thresholds and reference standards used varied, limiting statistical pooling.
Both GS and PD (grade >=1 in at least one joint) showed high sensitivities (98% or more) but low specificities (Table 2; Figure 5). One study64 reported higher specificities (up to 86%) for higher grades of synovitis, or for synovitis in greater number of joints (supplementary Figure 4). Three studies reporting combinations of GS and PD 54,57,58 had contrasting results with either high sensitivities 57,58 or high specificities54 (Table 2) with variations according to synovitis grade or number of joints (Figure 5; supplementary Figure 4). One study57 reported data by individual joints: specificities for joint synovitis (grade ≥1) were at least 90% for two PIP and one MTP joint on GS, and for three PIP (1 and 4-5) and three MTP (2-3 and 5) joints on PD (supplementary Figure 5); sensitivities were 37% or lower. The highest sensitivities were observed for wrist synovitis on both GS (89%, 95% CI 76, 96%) and PD (87%, 95% CI 74% to 95%)57.
Forest plot of sensitivities and specificities from studies evaluating GS and/or PD ultrasound (any joint positive against mixed reference standards)
* All ultrasound evaluations included both grey scale (GS) synovitis and power dower (PD) activity. Synovitis was scored using similar semiquantitative grading scales. 24,48–50 Two studies53,57 used OMERACT definition of synovitis45,46 and none used the EULAR-OMERACT combined scoring system for grading synovitis in RA3–5.
DISCUSSION
We found 13 studies evaluating four MRI characteristics and 6 ultrasound studies evaluating synovitis. Studies were of variable methodological quality regarding both risk of bias and clinical applicability. Unclear participant recruitment procedures, potential inclusion of patients with RA at baseline, differential verification and lack of blinding were of particular concern. Only five studies included participants meeting our definition of UA54,56,57,63,86, others including at least some participants with tender joints but no synovitis62–65,66–68, or more than one joint with synovitis60,61 or including additional biomarkers58,61,64,65. Most MRI evaluations were blinded to the reference standard, but reported consensus or mean results across more than one observer so that results are unlikely to be replicable in a usual practice setting, in which one radiologist performs MRI interpretations. Most studies did however report using replicable scoring methods for both MRI and ultrasound.
Although the aim is earlier initiation of treatment to reduce morbidity and prevent irreversible damage in RA, of equal importance is to avoid inappropriate treatment with often expensive and potentially harmful treatments in those who do not need it. The ideal testing strategy in early disease should maximise sensitivity but not at the expense of specificity.
MRI synovitis had the highest and most homogenous sensitivities (summary sensitivity 93%, 95% CI 88% to 96%), however specificities, and therefore false positive rates, were poor and variable. The summary specificity of 25% (95% CI 13 to 41%) suggests as much as 75% of those who did not meet RA classification criteria also demonstrated synovitis. There was some evidence of higher specificities for other MRI characteristics, but studies were small and results inconsistent. Ultrasound results were difficult to synthesise due to different diagnostic thresholds, scoring systems and reference standards used.
For both tests, limited data suggested that combinations of findings from different joint areas could help to rule in/rule out those who are most/least likely to develop RA, however further work is needed to determine the optimal combination. Combining imaging test results with other clinical features and biomarkers in formal prediction rules may further add predictive power58,59,63,66 however this was not the focus of the current review.
Challenges in evaluation
Due to varying definitions of UA and RA, we were unable to determine how MRI or ultrasound would perform in our target population of patients with UA who do not meet either set of RA classification criteria at the time of imaging. Almost all studies included people meeting at least one set of classification criteria at baseline; either explicitly reporting the exclusion of participants meeting only one set of criteria, or because the 2010 criteria had not yet been published. Tests may have performed differently if all of those with RA at baseline had been excluded. Changing classification criteria and small number of studies (n=4) using only the ACR/EULAR 2010 criteria as the reference standard was an additional challenge. Participants with resolving symptoms were poorly reported (5%-41%63,57, n=6).
There was a notable lack of consistency in ultrasound scoring. As expected, evaluations published prior to the EULAR-OMERACT combined scoring system for grading ultrasound synovitis in RA, used various ultrasound scoring systems19,46,49,89–91,50,51,87,92,93, but these have remained in use even in the most recent studies published after 2017. The potential for absence of subclinical inflammation to identify those most likely to resolve has not yet been considered.
Only three studies explicitly reported blinding of final disease status to the imaging test result and one reported lack of blinding to the imaging test. Although current classification criteria do not include imaging characteristics, knowledge of the imaging result may implicitly affect diagnosis.
Strengths and weaknesses of this review
We used a comprehensive electronic literature search with stringent systematic review methods that included independent duplicate data extraction and quality assessment of studies, attempted contact with authors, and a clear analysis structure. We identified one previous systematic review94, that identified only two studies in UA populations (both of which were included in our review59,60) and included 11 studies in mixed populations, four of which we also included61,65,68,69. We excluded the other 7 studies on the basis of study population95–98, target condition99, lack of follow-up100, or lack of 2×2 data101. Limited statistical pooling was possible due to our stringent criteria for meta-analysis. Pooling was only considered for studies using similar imaging scoring systems and reference standards for RA. Although the 2010 criteria identify most patients meeting 1987 criteria (e.g. 91% of those meeting 1987 criteria in one study102), the 2010 criteria also classify considerably more patients as having RA compared to the 1987 criteria (e.g. in another study only 36% of those meeting 2010 criteria would have been diagnosed using the 1987 criteria103). The only study reporting accuracy against more than one reference standard54, reported RA prevalence 17% higher when defined by DMARD initiation compared to classification criteria definition.
Implications for research
Future studies of imaging tests for the diagnosis of RA should be based on clearly defined UA populations (e.g. individuals with ≥1 swollen joint and who do not meet RA or other rheumatologic disease classification criteria at baseline). Symptoms and duration of symptoms should be specified. The reference standard diagnosis using current RA classification criteria should be made up to at least 12 months from baseline, and should be clearly blinded to imaging findings. Validated scoring methods and accepted thresholds for defining a positive test (e.g. RAMRIS39,40 and EULAR-OMERACT42–44 for MRI and US, respectively) should be used. Any future research study should conform to reporting guidelines, including the updated Standards for Reporting of Diagnostic Accuracy guideline104.
Conclusion
There is currently insufficient evidence to recommend MRI or ultrasound alone for early RA diagnosis in people with UA. This is in large part because of a lack of consistency in study methodology, which also prevents synthesis of data in studies examining the combined utility of imaging and other clinical and serological tests. In order to address this, the research community needs to develop larger scale studies that apply consistent recruitment and scoring strategies and relate these to common reference standards, and that conform to international reporting guidelines for test accuracy studies 36.
Data Availability
Available on request
Financial support
This paper presents independent research supported by the National Institute for Health Research (NIHR) Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham (grant reference No BRC-1215-20009). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. PdeP is supported by a National Institute for Health Research (NIHR) personal fellowship [grant reference PDF-2014-07-055]. KR, AF, JD, SB and JJD are supported by the NIHR Birmingham Biomedical Research Centre.
Industry affiliations
None to report
The work has not been previously presented at a conference or meeting.
The lead authors and manuscript’s guarantor affirm that the manuscript is an honest, accurate and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.
Acknowledgments
JD and PdeP contributed equally to this work. JD, PdeP, AO, SB and ZL undertook the review. JD, PdeP, KR, AF and JJD contributed to the conception of the work and interpretation of the findings. JD and PdeP drafted the manuscript. All authors critically revised the manuscript and approved the final version. PdeP acts as guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
We thank corresponding authors of included papers for providing additional data not contained in their publications, and Susan Bayliss for additional assistance with bibliographic search.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.
- 97.
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵