Exploring the potential of Claude 2 for risk of bias assessment: Using a large language model to assess randomized controlled trials with RoB 2

Angelika Eisele-Metzger; Judith-Lisa Lieberum; Markus Toews; Waldemar Siemens; Felix Heilmeyer; Christian Haverkamp; Daniel Boehringer; Joerg J Meerpohl

doi:10.1101/2024.07.16.24310483

ABSTRACT

Systematic reviews are essential for evidence based healthcare, but conducting them is time and resource consuming. To date, efforts have been made to accelerate and (semi-) automate various steps of systematic reviews through the use of artificial intelligence and the emergence of large language models (LLMs) promises further opportunities. One crucial but complex task within systematic review conduct is assessing the risk of bias of included studies. Therefore, the aim of this study was to test the LLM Claude 2 for risk of bias assessment of 100 randomized controlled trials using the revised Cochrane risk of bias tool (“RoB 2”; involving judgements for five specific domains and an overall judgement). We assessed the agreement of risk of bias judgements by Claude with human judgements published in Cochrane Reviews. The observed agreement between Claude and Cochrane authors ranged from 41% for the overall judgement to 71% for domain 4 (“outcome measurement”). Cohen’s κ was lowest for domain 5 (“selective reporting”; 0.10 (95% confidence interval (CI): −0.10-0.31)) and highest for domain 3 (“missing data”; 0.31 (95% CI: 0.10-0.52)), indicating slight to fair agreement. Fair agreement was found for the overall judgement (Cohen’s κ: 0.22 (95% CI: 0.06-0.38)). Sensitivity analyses using alternative prompting techniques or the more recent version Claude 3 did not result in substantial changes. Currently, Claude’s RoB 2 judgements cannot replace human risk of bias assessment. However, the potential of LLMs to support risk of bias assessment should be further explored.

INTRODUCTION

Systematic reviews are considered a highly valuable tool for evidence synthesis and informed decision making in healthcare and other fields, however, require a lot of time and resources (1, 2). Steps in conducting a systematic review include framing the research question, preparation of a review protocol, searching for and selecting relevant studies, risk of bias (RoB) assessment of the studies included, data extraction, synthesis and interpretation of the results and finally reporting (3, 4).

In order to make this work more time- and resource efficient, efforts have been underway for several years to assist or even (semi-)automate steps of the systematic review process, using artificial intelligence (AI) and, more specifically, machine learning (ML) techniques (5, 6). Based on (un-/semi-/self-supervised or reinforcement) learning from data provided and further development of pattern recognition systems, algorithms allow to constantly improve performance on specific tasks without being explicitly programmed to do so (7, 8). Exemplary applications that use ML to support steps of systematic reviews include Rayyan (9), Covidence (10) and EPPI Reviewer (11), that are particularly useful to support screening and data extraction, deduplication tools such as Deduklick (12), and the RobotReviewer (13) for RoB assessment.

Recently, further AI systems based on large language models (LLMs) such as ChatGPT (14), PaLM 2 (15), LLaMA (16), or Claude (17) have gained attention, and a variety of potential uses in health care and research alike has been discussed (18–21). LLMs are trained on a very large dataset to always predict the most likely next token, given any textual input. They are commonly fine-tuned to simulate or participate in human dialogues (22). Contrariwise to conventional statistical classification methods, which rely on task specific training using labelled training data, LLMs can be instructed to perform any task without task-specific training. The training process is replaced with crafting and refining detailed instructions in natural language, a process known as prompt-engineering. Limitations of LLMs include the lack of full control including unexpected responses that may contain toxic language, discrimination, or even false (’made up’) information (22–24). So far, a number of attempts to use LLMs for systematic review support have been made, e.g. to help formulating a structured review question (25), screening (26), producing an R code for conducting a meta-analysis (25) or data extraction (27). First experiences are still clearly flawed, albeit promising.

Assessing the RoB in each study included is a pivotal step of a systematic review. For assessing randomized controlled trials (RCTs), the revised Cochrane Risk of Bias tool (“RoB 2”) (28) is considered the gold standard. The tool is structured into five bias domains (1. bias arising from the randomization process, 2. bias due to deviations from intended interventions, 3. bias due to missing outcome data, 4. bias in measurement of the outcome, 5. bias in selection of the reported result). An overall judgement is made on the basis of assessments of each individual domain, each in the categories of “low risk”, “some concerns” or “high risk” (28, 29). RoB assessment not only requires time and at least two reviewers, but also underlies to a degree of subjectivity even when utilizing standardized tools (30–32). Therefore, the objective and reproducible automation of this systematic review step appears particularly important and valuable. Currently, there are very limited methods to support RoB assessment using ML (5). However, also using ChatGPT alone for RoB assessments seems not recommendable, neither for RCTs (33, 34) nor for non-randomized studies of interventions (35), due to limited agreement in RoB judgements between ChatGPT and humans.

Claude 2, first released by Anthropic in March 2023 (17), appears particularly suitable for conducting RoB assessments, perhaps better than ChatGPT: Characterized by a particularly large context window, substantial volumes of data such as full texts of study reports can be processed in one piece – as stated by Anthropic – with a comparatively low rate of hallucinations, high accuracy and robustness (17, 36, 37), making it a promising candidate for supporting RoB assessment. Most recently, in May 2024, Lai et al. (38) first described assessing RoB in RCTs with both ChatGPT and Claude and found substantial accuracy and consistency, however, restricted to a modified version of the original Cochrane RoB-tool (“RoB 1”) from 2011 (39). This tool has been revised in 2019 (28) in order to address some of its limitations and the use of the former tool is no longer recommended (29). Therefore, we aimed at using the revised RoB 2 tool for our study.

In this proof-of concept study, our aim was to determine how well the LLM Claude 2 assesses the RoB of RCTs using the RoB 2 tool compared to conventional RoB 2 assessments published by human reviewers in Cochrane Reviews.

METHODS

The protocol for this proof-of-concept study has been registered on Open Science Framework (OSF) (https://osf.io/42dnb) on September 11, 2023. We applied a validation study design to evaluate the performance of Claude 2 compared to humans (reference standard).

Sample and Eligibility Criteria

To identify a sample of recent Cochrane Reviews of interventions applying RoB 2, we searched the Cochrane Library in October 2023 using the search string “ROB2” OR “ROB-2” OR “ROB 2.0” OR “revised cochrane risk-of-bias” (all text) with a limit for publication date from January 2019 onwards and a filter for review type “intervention”. We manually checked each Cochrane Review retrieved and excluded Cochrane Reviews exclusively using RoB assessment tools other than RoB 2. A random sample of 100 2-arm parallel group RCTs was drawn (see sample size estimation), choosing at least one RCT per Cochrane Review. We excluded Cluster-RCTs and cross-over RCTs because RoB assessment methods slightly differ for those types of RCTs. Furthermore, we excluded RCTs published in languages other than English and RCTs published earlier than 2013 due to our assumption that Claude 2 can best process English texts and that the reporting quality of scientific articles has improved in recent years. As Cochrane Reviews often include RoB assessments for more than one outcome and comparison, we selected the RoB assessment for the first listed outcome and first comparison. If the first comparison and first listed outcome did not contain a suitable RCT, we switched to the next outcome/comparison, and so forth.

Data collection

For each of the selected RCTs, we manually extracted the following data: bibliographic reference details, the results of the RoB assessment of the Cochrane authors (i.e. the judgement for each RoB 2 domain, the overall assessment as well as all text that was provided to support RoB judgements), study location, condition/disease studied, type of intervention (i.e. pharmacological intervention; surgical intervention; non-pharmacological, non-invasive intervention), type of control intervention (i.e. placebo, treatment as usual/other intervention, no intervention), outcome and comparison named in the Cochrane Review (for which RoB was assessed for the selected RCT), original outcome named in the RCT and references to published study protocols and register entries.

Prompt engineering and generation of Claude RoB assessments

We used Claude 2 (17) to create new RoB assessments for each of the selected RCTs. The testing was performed in February 2024.

Prompt engineering

A prompt is an input, usually in textual form, to which the LLM produces an output (40). Prompt engineering refers to the process of developing the most suitable prompt to successfully accomplish a task (41). If a prompt contains one or more variables that are replaced by media (e.g. text extracted from a PDF file), it is referred to as prompt template (40).

During a pilot phase, we developed and refined various prompt templates using different prompting techniques and tested them with a sample of 30 RCTs from three Cochrane Reviews (42–44). These were then excluded from any further analysis or testing. This preliminary testing resulted in one final main prompt template. Two alternative prompt templates that also showed acceptable results during the pilot phase were used for sensitivity testing. All three prompt templates were uploaded on OSF in advance to conducting the actual testing and can be accessed via https://osf.io/2phyt (prompt number 12 is the final main prompt template).

Contents of the final main prompt template

Our prompt template asked Claude 2 to assess the RoB of the respective RCT, considering each of the five domains of the RoB 2 tool and to provide an overall judgement. It also specified the format of the judgement options (i.e. “low risk”, “some concerns” or “high risk”) and asked Claude to provide justifying text for each judgement, embedded in a machine readable JSON structure.

The prompt template included the text extracted from the PDF article of the RCT (but no possibly existing additional reports on the same study), the compressed study protocol/analysis plan, if available, or (if no published study protocol/analysis plan was available) the study register entry (e.g. record from https://clinicaltrials.gov), if available. We used the ConvertAPI service (https://www.convertapi.com) to extract the full text of the PDF articles. As the RoB 2 tool is applied specifically per outcome, we also specified the individual outcome for which the assessment should be made (including time of measurement, if more than one follow-up time point was available). These data were injected into the prompt template in an automated fashion using custom software (see below).

We suspected that some of the Cochrane Reviews used for our dataset might have been in the training data for Claude. To avoid a simple recall of the results from the training data, we opted for a full instruction prompt template that does not mention the RoB 2 tool by name, but instead provides a detailed instruction on how to perform the assessment. The instructions were taken from the official RoB 2 full guidance document (45). The RoB 2 tool provides the option to choose between assessing the effect of assignment to intervention (“intention-to-treat” effect) and assessing the effect of adhering to intervention (“per-protocol” effect) for the second domain (RoB due to deviations from the intended interventions). As the first option is usually used for efficacy studies, we only provided guidance for this first method to Claude.

During the pilot phase, we learned that it is helpful to generate separate prompts for each of the RoB 2 domains in order to minimize the reasoning complexity. We concatenated all five LLM responses (one response for each RoB 2 domain) and proceeded with the prompt for overall assessment on this basis. Furthermore, we learned during the pilot phase that the RCT protocols (and register entries) need to be compressed with a separate prompt and injected into the final prompt template, as they can be very lengthy, often longer than the manuscript itself. Assembling the single prompts via manual copy-pasting would have been unfeasible and error-prone. Therefore, we developed a program to automate the process (see below).

Program

We used a custom program called “Patchbay” to automate the process of assembling the single prompts, including compression of the RCT protocols, and combined all the necessary components into the final prompts according to the defined templates. This allowed us to efficiently create the number of prompts required for the study while minimizing the risk of errors. The source code and documentation for Patchbay are available at https://github.com/daboe01/LLMPatchbay.

Iterations

When using Claude, users can set the temperature, i.e. the randomness of the answers one receives from Claude (36). Lower temperatures lead to more stable and conservative outputs corresponding to the most likely variants while higher temperatures produce more creative and diverse responses (36). For our study, we set the temperature as low as possible. We then performed three iterations, i.e. we ran the prompt template three times for each RCT. This method has recently been used to quantify the uncertainty of LLM outputs (46, 47). If the judgments of the three iterations matched, we selected one at random for our testing (because the justifying text could still vary to some extent). If the judgments did not match, we randomly selected from the results that were more frequent (e.g. if the prompt resulted in one “low risk” judgement and two “some concerns” judgements in a domain, we randomly selected one of the “some concerns” judgements). In the rare cases where all three iterations differed in their assessment, we also selected one at random. This technique is known as “self consistency” (48).

Data analysis

We quantitatively compared the RoB judgements created by Claude to the judgements of the Cochrane authors (reference standard). For each of the 100 RCTs, judgements of either “low risk”, “some concerns” or “high risk” were available for the five RoB 2 domains as well as the overall assessment. We calculated the performance of Claude using Cohen’s weighted kappa coefficient (κ) for ordinal data (R package ‘psych’) (49–51), a measure of interrater agreement that controls for agreement by chance and can take values between −1.0 to 1.0. We adjusted each Cohen’s κ for clustering in case of more than one RCT per Cochrane Review using the design effect as suggested in the Cochrane Handbook (52, 53). Cohen’s κ was interpreted as poor (<0.00), slight (0.00-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80) and almost perfect (0.81-1.00), as suggested by Landis and Koch (54). Additionally, we calculated the observed percentage of agreement between Claude and the reference standard, sensitivity and specificity as well as the positive and negative predictive value (PPV and NPV) of Claude for i) a high RoB rating (versus “some concerns” or “low risk”) or ii) a low RoB rating (versus “some concerns” or “high risk”) compared to the reference standard. Estimates are given with their 95% confidence intervals (CIs). The R code for calculating the primary results of the manuscript can be accessed at https://osf.io/2phyt.

To identify reasons for non-agreement between Claude and the reference standard, we manually checked justifications provided by Claude and the Cochrane authors with the deviating judgements for the RoB 2 domains 1-5. We reviewed all “two-level-discrepancies” (i.e. “high risk” versus “low risk”, which we regarded as more severe) by comparing given justifications to the content of the original reports and protocols/register entries of the trials. We documented whether we agreed with either the Cochrane authors or Claude or whether we would suggest a “some concerns” judgement instead. Additionally, we reviewed a random sample of 10 discrepancies for each of the five specific RoB 2 domains for the remaining discrepancies “some concerns” versus “low risk” or “some concerns” versus “high risk”. We compared justifications and summarized observed reasons for non-agreement (without comparing them to the original reports of the RCTs, for reasons of feasibility). The overall judgement strongly depends on the judgements for the five specific domains (e.g., to reach an overall low RoB, the study must be judged to be at low RoB for all five domains (45)). Therefore, we checked the 100 overall judgements of Claude for compliance with the algorithm provided in the RoB 2 guidance (45).

Additional analyses

We conducted exploratory sensitivity and subgroup analyses as described below. We did not perform any inductive statistics, i.e. the analyses were descriptive only. They had not been pre-specified in our protocol.

Sensitivity analyses

To explore the impact of the prompt characteristics on the results, we performed sensitivity analyses, i.e. we repeated the testing for the same 100 RCTs, using two alternative prompt templates. The first alternative prompt template (“step-by-step prompt”) was very similar to the final main prompt template but additionally based on the framework of zero-shot chain of thought prompting (40). The other alternative prompt template (“minimal prompt”) was much shorter, included only very little information taken from the RoB 2 guidance, and is therefore possibly prone to bias from dataset contamination.

Few days after our testing, a new version of Claude was launched (37). We therefore decided to perform an additional sensitivity analysis using Claude 3 Opus and the prompt template that had proven to be most promising in the previous testing. This was conducted in March 2024. We did not perform further prompt engineering using the new version of Claude.

Subgroup analyses

We carried out the following subgroup analyses using our final main prompt template:

Individual analyses for the different types of interventions studied in the RCTs (due to the low number of surgical interventions, we only performed analyses for pharmacological versus other - non-pharmacological, non-surgical - interventions).
Individual analyses according to whether a published study protocol or a register entry was available. We differentiated between RCTs without protocol or register entry and RCTs with at least one (protocol or register) entry.
Individual analyses according to whether the three iterations of Claude produced the same results or whether results differed between the three iterations. The rationale for this was that we assumed higher uncertainty and possibly poorer accuracy in the assessments where the iterations differed.

Sample size estimation

We assumed a Cohen’s weighted κ of 0.7 (indicating substantial agreement) with a corresponding 95% CI of 0.55-0.85 for the overall RoB rating between Claude and the reference standard. Furthermore, we anticipated proportions of 0.20, 0.50, 0.30 for frequencies of the rating categories (“low risk”, “some concerns”, “high risk”) and an alpha-level of 0.05 (55). This resulted in a minimum of 88 required RCTs for this study. To safely meet these assumptions, we included a sample of 100 RCTs.

RESULTS

Sample

Our search for Cochrane Reviews resulted in 78 Cochrane Reviews of interventions fulfilling our eligibility criteria. The search and selection process is illustrated in a PRISMA flow chart (see figure 1). The full sample of Cochrane Reviews assessed for eligibility along with reasons for exclusion is part of the data stored at OSF and can be accessed via https://osf.io/2phyt.

Figure 1.

PRISMA flow chart illustrating the search process for Cochrane Reviews of interventions.

Our final sample of 100 RCTs consisted of 56 RCTs drawn from 56 unique Cochrane Reviews and 44 RCTs drawn from a total of 22 Cochrane Reviews (2 per review).

Study characteristics

The RCTs were published between 2013 and 2022. Fifty RCTs studied non-pharmacological, non-surgical interventions, such as psychological interventions or exercise interventions. Pharmacological interventions were studied in 44 RCTs and surgical interventions in 6 RCTs. The most common condition studied was COVID-19 (18 RCTs). For 32 RCTs, we were able to identify a published study protocol, and for 82, a register entry was available. For 16 RCTs, neither a protocol nor a register entry was available.

Full extracted data with reference to the corresponding Cochrane Reviews and including the full results of our testing can be accessed via https://osf.io/2phyt.

RoB assessment with Claude

RoB judgements of Claude 2 for the five domains and the overall judgement are summarized in table 1, along with the human judgements of the Cochrane authors (reference standard). The most frequent judgement of Claude 2 for domain 1 to 5 was “low risk” while it judged the overall domain most frequently with “some concerns”. “High risk” judgements occurred rarely. We had no missing values but Claude occasionally deviated from the prescribed response format (e.g. returned “unable to assess” or “no information”; this occurred three times in the results of the final main prompt template). As we performed three iterations (see methods) per RCT, we finally received at least one valid judgement.

View this table:

Table 1. Risk of bias judgements of Claude 2 and Cochrane authors (number of judgements per RoB 2 domain, n=100 RCTs).

Table 2 shows the overall judgements of Claude 2 tabulated against the overall judgements of the Cochrane authors. Tables for the remaining 5 RoB 2 domains can be found in the supplement (table S1-S5). Figure 2 illustrates the overall RoB judgements of Claude versus Cochrane authors using a Sankey diagram (56, 57).

Figure 2.

Sankey diagram illustrating differing and congruent overall risk of bias judgements of the Cochrane authors and Claude 2. An animated version of this figure can be accessed via https://osf.io/2phyt.

View this table:

Table 2. Overall risk of bias judgements of Claude 2 tabulated against the overall judgements of the Cochrane authors (n=100 RCTs).

The observed percentage of agreement, Cohen’s weighted κ, sensitivity, specificity and predictive values are displayed in table 3. Given the low number of “high risk” judgements of Claude, we only present sensitivity, specificity, PPV and NPV of Claude for a low RoB rating (versus “some concerns” and “high risk”). Values for a high RoB rating (versus “low risk” and “some concerns”) are presented in the supplement (table S6).

View this table:

Table 3. Performance of Claude 2 compared to the Cochrane authors (n=100 RCTs).

The observed agreement between judgements of Claude and judgements of the Cochrane authors ranged from 41% for the overall judgement to 71% for domain 4 (“outcome measurement”). Cohen’s κ was lowest for domain 5 (“selective reporting”; 0.10 [95% CI: −0.10-0.31]) and highest for domain 3 (“missing data”; 0.31 [95% CI: 0.10-0.52]), indicating slight to fair agreement. For the overall judgement, Cohen’s κ was 0.22 (95% CI: 0.06-0.38) which can be interpreted as “fair”. There was strong variation for sensitivity (range 0.50 [95% CI: 0.33-0.67] to 0.90 [95% CI: 0.80-0.96]), specificity (range 0.16 [95% CI: 0.05-0.34] to 0.75 [95% CI: 0.43-0.95]), PPV (range 0.46 [95% CI: 0.35-0.58] to 0.96 [95% CI: 0.89-0.98]) and NPV (range 0.30 [95% CI: 0.21-0.41] to 0.70 [95% CI: 0.62-0.78]). Of note, the width of the confidence intervals (including much lower or higher values) must be considered when interpreting these values.

Reasons for non-agreement

Review of two-level-discrepancies

We identified 18 two-level-discrepancies (i.e. “low risk” versus “high risk” judgements) for the 5 specific RoB 2 domains: three for D1, four for D2, three for D3, five for D4 and three for D5. All but two of these 18 discrepancies comprised a “high risk” judgement of the Cochrane authors and a “low risk judgements of Claude 2. For 12 judgements, we would have opted for a “some concerns” judgement instead of the differing judgements of Claude and the Cochrane authors, and for six judgements, we agreed with the decisions of the Cochrane authors. There was no case in which we agreed with Claude’s judgement. Two examples of two-level-discrepancies between Claude and the Cochrane authors are provided in table 4 with additional comments.

View this table:

Table 4. Examples for two-level-discrepancies between Claude and reference standard, with comments and suggested judgement by the authors of this article.

Review of other discrepancies

Below, we give the main identified reasons for disagreement (“some concerns” versus either “low risk” or “high risk”) between Claude and the Cochrane authors for each domain of the RoB 2 tool.

D1 (“randomization“): One main reason for discrepancies in this domain was that Claude assumed appropriate concealment of allocation while the Cochrane authors criticized lacking (information on) allocation concealment.
D2 (“deviations from interventions“): Differences in dealing with lack of blinding (of participants or carers) were one main reason for discrepant judgements. E.g., Claude judged “some concerns” in some cases where Cochrane authors regarded it as unlikely that deviations from the intended interventions had occurred due to the open-label design and judged “low risk”.
D3 (“missing data“): Reasons for discrepancies comprised different interpretations of the potential influence of missing data (i.e. Claude regarded the amount of missing data as less or more concerning, compared to the Cochrane authors), but Claude also seemed to have overseen data in some cases (e.g. reported different percentage of missing data, compared to the Cochrane authors).
D4 (“outcome measurement“): For this domain, justifications especially deviated regarding information on assessor blinding (e.g. Claude assumed assessors to be blinded while the Cochrane authors stated they were aware of the allocated intervention) and the impact of non-blinded assessors on the validity of outcome assessment.
D5 (“selective reporting“): One main reason for discrepancies in this domain was that Claude either ignored the absence of a pre-specified protocol/analysis plan or did not consider an available protocol.
Overall judgement: Of the 100 available overall judgements of Claude, only 2 clearly deviated from the algorithm provided in the RoB 2 guidance (45), i.e. Claude judged the overall RoB as “low”, but assessed single domains as “some concerns”.

Results of the additional analyses

The observed percentage of agreement and Cohen’s κ values for the sensitivity and subgroup analyses are given in the supplement (table S7-S11). These analyses were descriptive only.

Sensitivity analyses

Below, we summarize the results of the sensitivity analyses using two alternative prompt templates and using the latest version of Claude (Claude 3). Cohen’s κ values obtained in the sensitivity analyses indicate slight to fair agreement between reference standard and Claude for the RoB judgements, with two exceptions for RoB 2 domains that had values >0.40, indicating moderate agreement (i.e. domain 4 “outcome measurement” using Claude 2 with the “step-by-step” prompt template and domain 1 “randomization” using Claude 3 with the “step-by-step” prompt template).

“Step-by-step” prompt template: The observed agreement values were comparable to the values obtained using the final main prompt template (42% agreement for the overall judgement). Cohen’s κ had a slightly wider range (0.08 [95% CI: −0.13-0.28] to 0.43 [95% CI: 0.20-0.66], highest κ for domain 4 “outcome measurement”) and was 0.28 (95% CI: 0.11-0.46) for the overall judgement.
“Minimal” prompt template: This prompt resulted in slightly higher observed agreement for all domains (47% for the overall judgement) and a slightly larger range for Cohen’s κ (−0.04 [95% CI: 0.12-0.04] to 0.40 [95% CI: 0.19-0.61], highest κ for domain 1 “randomization”) with a lower Cohen’s κ for the overall judgement (0.19 [95% CI: 0.00-0.38]), when compared to the values obtained using the final main prompt template.
“Step-by-step” prompt template with Claude 3: Generally, the observed agreement and Cohen’s κ were comparable to the other runs using Claude 2, with some variation, but with no apparent pattern. We obtained 45% observed agreement and a Cohen’s κ of 0.19 (95% CI: 0.02-0.37) for the overall judgement. Cohen’s κ values had a larger range compared to the runs with Claude 2 (0.08 [95% CI: −0.07-0.23] to 0.54 [95% CI: 0.36-0.72], highest κ for domain 1 “randomization”).

Subgroup analyses

Below, we summarize the results of the subgroup analyses for RCTs on pharmacological versus other (non-pharmacological, non-surgical) interventions, RCTs without protocol/register entry versus RCTs with at least one of protocol/register entry and RCTs for which the three iterations of Claude produced the same results versus differing results.

Pharmacological (n=44) versus other (n=50) interventions: Cohen’s κ values for all domains were slightly lower for RCTs on pharmacological interventions (range −0.11 [95% CI: −0.36-0.15] to 0.27 [95% CI: 0.01-0.53], highest κ for domain 3 “missing data”) compared to RCTs on other interventions (range 0.11 [95% CI: −0.12-0.35] to 0.36 [95% CI: 0.05-0.66], highest κ for domain 3 “missing data”). The observed agreement for the overall judgement was 38.6% for pharmacological interventions and 42% for other interventions.
No protocol/register entry (n=16) versus at least one of protocol/register entry (n=83): All in all, the observed agreement and Cohen’s κ values were comparable for the two groups, with some variation but no striking differences, except for domain 3 (“missing data”). Cohen’s κ for this domain was 0.41 (95% CI: 0.01-0.82) for the group of RCTs without protocol/register entry, compared to 0.28 (95% CI: 0.06-0.51) for the group of RCTs with at least one of protocol or register entry.
For the three iterations of Claude: Same results (n=68) versus differing results (n=32) of the iterations: The range of observed agreement and Cohen’s κ values was comparable for the two groups. One notable difference was that Cohen’s κ for the group of RCTs with differing results for the three iterations was highest (0.32; 95% CI: 0.11-0.53) for the overall judgement, which was not the case in any other analysis.

DISCUSSION

In this work, we compared RoB assessments of RCTs created by the LLM Claude 2 with assessments created by human reviewers and published in Cochrane Reviews. To our knowledge, this is the first study that uses Claude to assess RCTs applying the RoB 2 tool. We found only slight to fair agreement between Claude and humans for all RoB domains when using our final main prompt template. Only in the sensitivity analyses, using two alternative prompting approaches, we obtained moderate agreement for two domains, i.e. domain 4 “outcome measurement” and domain 1 “randomization”. Based on these results, we infer that Claude should currently not be used as a stand-alone tool to conduct RoB assessment of included studies within the systematic review process.

Additional sensitivity and subgroup analyses did not indicate that our results differed substantially depending on specific characteristics. Thus, it did, e.g., not seem to make a great difference whether a protocol or register entry was available or whether the trial was on pharmacological or other interventions. Using alternative prompt templates or the novel version of Claude also did not substantially change our results.

Reasons for disagreement between Claude and the Cochrane authors include, e.g., that possible problematic features of the trials (such as lack of blinding of participants, carers or assessors or a certain proportion of missing data) were assessed differently. In some cases, Claude also provided wrong information in the supporting text or obviously missed details. Among the 18 two-level-discrepancies (“low risk” versus “high risks”), which we verified by consulting the original articles, there were 12 cases for which we would have opted for a “some concerns” judgement instead of the judgements made by Claude and the Cochrane authors. This highlights that judgements made using the RoB 2 tool underlie a certain degree of subjectivity.

Indeed, also agreement of RoB 2 judgements between humans is far from perfect (62, 63). Additionally, adherence of systematic reviewers to RoB 2 guidance is often poor (64). In a study by Minozzi and colleagues (62), four raters independently used the RoB 2 tool to assess RoB for 70 outcomes of 70 RCTs on various unrelated topics and obtained only slight agreement (Fleiss’ κ of 0.16) for the overall assessment. This is even lower than the agreement between Claude and the Cochrane authors we obtained for the overall assessment in our study. In a follow-up study by Minozzi and colleagues (63), four raters independently applied RoB 2 for 80 results related to seven outcomes reported in 16 RCTs on a similar topic. During a pilot run of the tool (“calibration exercise”), they developed an implementation document specific for this topic in advance. They were then able to increase their interrater agreement from no agreement (Fleiss’ κ of −0.15) during the calibration exercise to finally moderate agreement (Fleiss’ κ of 0.42) for the overall assessment. This implies that, in addition to using the RoB 2 guidance, further consultations and agreements, related to the specific topic of interest for a systematic review, might be necessary to increase reliability of RoB 2 assessments. Thus, comparing RoB 2 assessments by Claude to this “imperfect” and variable reference standard obviously is problematic.

Just recently, other authors have used LLM to conduct RoB assessment, with mixed results. Pitre et al. (34) found comparably low agreement between ChatGPT-4 and Cochrane authors when assessing RoB of 157 RCTs from 34 Cochrane Reviews using RoB 2 (Cohen’s κ of 0.16 for the overall assessment). Testing the use of ChatGPT (GPT-4) for RoB assessment of non-randomized studies of intervention using ROBINS-I (65), Hasan et al. (35) also obtained only slight agreement (Cohen’s κ of 0.13 for the overall assessment). In contrast, Lai et al. (38) reported promising results when using ChatGPT and Claude (versions not specified) to assess RoB of 30 RCTs from three systematic reviews using a modified version of the original Cochrane RoB tool (“RoB 1”) (39): In their study, Cohen’s κ ranged from 0.54 to 0.96 for ChatGPT and from 0.76 to 0.96 for Claude for the different domains (there is no overall judgement included in RoB 1). Although there were some important differences in methodology, such as using another tool that is obviously easier to apply, using only 30 RCTs on only three different topics and calculating agreement from only 2 possible judgements (i.e. “low risk” or “high risk”), their results still seem surprising. Therefore, there is a need to further explore LLM support for RoB assessment of research studies. Future studies should probably also focus on LLM support going beyond the production of stand-alone RoB judgements, for example automatic extraction of the relevant content of an RCT that needs to be reviewed to assess its RoB.

Strengths and limitations

We used a large sample of RCTs drawn from the largest possible number of Cochrane Reviews on various topics for our study. Additionally, we used a thoroughly elaborated prompting approach and also explored two alternative prompt templates. Nevertheless, our work has a number of limitations. First, reproducibility of our testing is limited due to the variations of LLMs in producing output. As development of LLMs is progressing, it is likely that the reproducibility of our results decreases further in the future. Secondly, as pointed out above, we had to compare RoB 2 judgements of Claude to an “imperfect” human reference standard, for which we know that it is variable and interrater agreement is poor. However, as the “true” RoB 2 assessments are unknown, using assessments from different Cochrane authors was, perhaps, the most appropriate method to obtain a reference standard. RoB 2 is currently the recommended tool to assess RoB in RCTs, making its use indispensable. Lastly, Claude had only access to the main article and the compressed protocol or (if no protocol was available) register entry. We did not provide Claude with any supplementary material or further articles on the same study, while the Cochrane authors presumably consulted as many sources as were necessary and available. Compressing the protocols and register entries using an extra prompt was necessary due to their often extensive length.

Conclusion

Based on our results, the use of Claude to conduct RoB assessment of RCTs using RoB 2 is currently not recommended. Further investigation is needed to explore LLM support for RoB assessment of research studies. Other models of use than providing stand-alone RoB judgements could additionally be investigated. In conclusion, RoB assessment of RCTs included in high-quality systematic reviews currently still requires at least two independent human reviewers.

FUNDING INFORMATION

This work was supported by the Research Commission at the Faculty of Medicine, University of Freiburg, Freiburg, Germany (grant no. EIS2244/23).

CONFLICT OF INTEREST STATEMENT

None of the authors has any actual or potential conflicts to declare with respect to the topic of this study.

ETHICS STATEMENT

We used secondary data published in Cochrane Reviews and RCT reports. Ethical approval was therefore not necessary.

ACKNOWLEDGEMENTS

We would like to thank Kathrin Grummich for her support in developing the search strategy, Philipp Kapp for his advice on the use of RoB 2 and Georg Melber for his assistance in the visualization of data.

REFERENCES

1.↵
Nussbaumer-Streit B, Sommer I, Hamel C, Devane D, Noel-Storr A, Puljak L, et al. Rapid reviews methods series: guidance on team considerations, study selection, data extraction and risk of bias assessment. BMJ Evidence-Based Medicine. 2023.
Google Scholar
2.↵
Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7(2):e012545.
OpenUrl Abstract/FREE Full Text Google Scholar
3.↵
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. PLoS Med. 2021;18(3):e1003583.
OpenUrl CrossRef PubMed Google Scholar
4.↵
Aromataris E LC, Porritt K, Pilla B, Jordan Z, editors. JBI Manual for Evidence Synthesis: JBI 2024.
Google Scholar
5.↵
Cierco Jimenez R, Lee T, Rosillo N, Cordova R, Cree IA, Gonzalez A, Indave Ruiz BI. Machine learning computational tools to assist the performance of systematic reviews: A mapping review. BMC Medical Research Methodology. 2022;22(1):322.
OpenUrl Google Scholar
6.↵
Blaizot A, Veettil SK, Saidoung P, Moreno-Garcia CF, Wiratunga N, Aceves-Martins M, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Research Synthesis Methods. 2022;13(3):353–62.
OpenUrl Google Scholar
7.↵
Goodfellow I, Bengio Y, Courville A. Deep Learning: MIT Press; 2016.
Google Scholar
8.↵
Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–60.
OpenUrl Abstract/FREE Full Text Google Scholar
9.↵
Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan—a web and mobile app for systematic reviews. Systematic Reviews. 2016;5(1):210.
OpenUrl Google Scholar
10.↵
Veritas Health Innovation. Covidence systematic review software: Melbourne, Australia; [Available from: www.covidence.org].
Google Scholar
11.↵
Thomas J, Graziosi, S., Brunton, J., Ghouze, Z., O’Driscoll, P., & Bond, M. & Koryakina, A. EPPI-Reviewer: advanced software for systematic reviews, maps and evidence synthesis: EPPI Centre, UCL Social Research Institute, University College London; 2023 [Available from: https://eppi.ioe.ac.uk/cms/Default.aspx?tabid=2914].
Google Scholar
12.↵
Borissov N, Haas Q, Minder B, Kopp-Heim D, von Gernler M, Janka H, et al. Reducing systematic review burden using Deduklick: a novel, automated, reliable, and explainable deduplication algorithm to foster medical research. Systematic Reviews. 2022;11(1):172.
OpenUrl Google Scholar
13.↵
Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association. 2015;23(1):193–201.
OpenUrl Google Scholar
14.↵
Open AI. Introducing ChatGPT 2022 [Available from: https://openai.com/blog/chatgpt].
Google Scholar
15.↵
Google. Introducing PaLM 2 2023 [Available from: https://blog.google/technology/ai/google-palm-2-ai-large-language-model]].
Google Scholar
16.↵
Meta AI. Introducing LLaMA: A foundational, 65-billion-parameter large language model 2023 [Available from: https://ai.meta.com/blog/large-language-model-llama-meta-ai].
Google Scholar
17.↵
Anthropic. Introducing Claude 2023 [Available from: https://www.anthropic.com/index/introducing-claude].
Google Scholar
18.↵
Lund BD, Wang T, Mannuru NR, Nie B, Shimray S, Wang Z. ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information Science and Technology. 2023;74(5):570–81.
OpenUrl Google Scholar
19.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.
OpenUrl Google Scholar
20.
Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Ann Intern Med. 2024;177(2):210–20.
OpenUrl CrossRef Google Scholar
21.↵
Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine. 2023;388(13):1233–9.
OpenUrl CrossRef PubMed Google Scholar
22.↵
Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M, et al. A Comprehensive Overview of Large Language Models. arXiv. 2023;2307.06435.
Google Scholar
23.
Gehman S, Gururangan S, Sap M, Yejin C, Smith N. Real Toxicity Prompts: Evaluating Neural Toxic Degeneration in Language Models 2020. 3356–69 p.
Google Scholar
24.↵
Weidinger L, Mellor JFJ, Rauh M, Griffin C, Uesato J, Huang P-S, et al. Ethical and social risks of harm from Language Models. arXiv. 2021;2112.04359.
Google Scholar
25.↵
Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Systematic Reviews. 2023;12(1):72.
OpenUrl Google Scholar
26.↵
Issaiy M, Ghanaati H, Kolahi S, Shakiba M, Jalali AH, Zarei D, et al. Methodological insights into ChatGPT’s screening performance in systematic reviews. BMC Medical Research Methodology. 2024;24(1):78.
OpenUrl Google Scholar
27.↵
Gartlehner G, Kahwati L, Hilscher R, Thomas I, Kugley S, Crotty K, et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Research Synthesis Methods. 2024;15(4):576–89.
OpenUrl Google Scholar
28.↵
Sterne JAC, Savović J, Page MJ, Elbers RG, Blencowe NS, Boutron I, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;366:l4898.
OpenUrl FREE Full Text Google Scholar
29.↵
1. Higgins JPTTJ,
2. Chandler J,
3. Cumpston M,
4. Li T,
5. Page MJ,
6. Welch VA
Higgins JPT SJ, Page MJ, Elbers RG, Sterne JAC. Chapter 8: Assessing risk of bias in a randomized trial. In: Higgins JPTTJ, Chandler J, Cumpston M, Li T, Page MJ, Welch VA, editor. Cochrane Handbook for Systematic Reviews of Interventions version 62 (updated February 2021). Available from www.training.cochrane.org/handbook: Cochrane; 2021.
Google Scholar
30.↵
Savović J, Weeks L, Sterne JA, Turner L, Altman DG, Moher D, Higgins JP. Evaluation of the Cochrane Collaboration’s tool for assessing the risk of bias in randomized trials: focus groups, online survey, proposed recommendations and their implementation. Syst Rev. 2014;3:37.
Google Scholar
31.
Armijo-Olivo S, Ospina M, da Costa BR, Egger M, Saltaji H, Fuentes J, et al. Poor reliability between Cochrane reviewers and blinded external reviewers when applying the Cochrane risk of bias tool in physical therapy trials. PLoS One. 2014;9(5):e96920.
OpenUrl Google Scholar
32.↵
Hartling L, Hamm MP, Milne A, Vandermeer B, Santaguida PL, Ansari M, et al. Testing the risk of bias tool showed low reliability between individual reviewers and across consensus assessments of reviewer pairs. J Clin Epidemiol. 2013;66(9):973–81.
OpenUrl CrossRef PubMed Google Scholar
33.↵
Barsby J, Hume S, Lemmey HAL, Cutteridge J, Lee R, Bera KD. Pilot study on large language models for risk-of-bias assessments in systematic reviews: A(I) new type of bias? BMJ Evidence-Based Medicine. 2024: bmjebm-2024-112990.
Google Scholar
34.↵
Pitre T, Jassal T, Talukdar J, Shahab M, Ling M, Zeraatkar D. ChatGPT for assessing risk of bias of randomized trials using the RoB 2.0 tool: A methods study. medRxiv. 2024:2023.11.19.23298727.
Google Scholar
35.↵
Hasan B, Saadi S, Rajjoub NS, Hegazi M, Al-Kordi M, Fleti F, et al. Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment. BMJ Evidence-Based Medicine. 2024:bmjebm-2023-112597.
Google Scholar
36.↵
Anthropic. User guides - Glossary 2024 [Available from: https://docs.anthropic.com/claude/docs/glossary].
Google Scholar
37.↵
Anthropic. Introducing the next generation of Claude 2024 [Available from: https://www.anthropic.com/news/claude-3-family].
Google Scholar
38.↵
Lai H, Ge L, Sun M, Pan B, Huang J, Hou L, et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Network Open. 2024;7(5):e2412687-e.
OpenUrl Google Scholar
39.↵
Higgins JPT, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ. 2011;343:d5928.
OpenUrl FREE Full Text Google Scholar
40.↵
Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, et al. The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv. 2024;2406.06608
Google Scholar
41.↵
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput Surv. 2023;55(9):Article 195.
Google Scholar
42.↵
Clark B, Whitall J, Kwakkel G, Mehrholz J, Ewings S, Burridge J. The effect of time spent in rehabilitation on activity limitation and impairment after stroke. Cochrane Database of Systematic Reviews. 2021(10).
Google Scholar
43.
Iannizzi C, Chai KL, Piechotta V, Valk SJ, Kimber C, Monsef I, et al. Convalescent plasma for people with COVID-19: a living systematic review. Cochrane Database of Systematic Reviews. 2023(5).
Google Scholar
44.↵
Willis MA, Toews I, Soltau SLV, Kalff JC, Meerpohl JJ, Vilz TO. Preoperative combined mechanical and oral antibiotic bowel preparation for preventing complications in elective colorectal surgery. Cochrane Database of Systematic Reviews. 2023(2).
Google Scholar
45.↵
Higgins JP, Savović J, Page MJ, Sterne JA. Revised Cochrane risk-of-bias tool for randomized trials (RoB 2) - Full guidance document 2019 [Available from: https://www.riskofbias.info/welcome/rob-2-0-tool/current-version-of-rob-2].
Google Scholar
46.↵
Manakul P, Liusie A, Gales MJF. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. medRxiv. 2023:2303.08896.
Google Scholar
47.↵
Fadeeva E, Vashurin R, Tsvigun A, Vazhentsev A, Petrakov S, Fedyanin K, et al. LM-Polygraph: Uncertainty Estimation for Language Models. medRxiv. 2023:2311.07383.
Google Scholar
48.↵
Sahoo P, Singh AK, Saha S, Jain V, Mondal S, Chadha A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv. 2024;2402.07927.
Google Scholar
49.↵
Armijo-Olivo S, Craig R, Campbell S. Comparing machine and human reviewers to evaluate the risk of bias in randomized controlled trials. Research Synthesis Methods. 2020;11(3):484–93.
OpenUrl Google Scholar
50.
Hirt J, Meichlinger J, Schumacher P, Mueller G. Agreement in Risk of Bias Assessment Between RobotReviewer and Human Reviewers: An Evaluation Study on Randomised Controlled Trials in Nursing-Related Cochrane Reviews. Journal of Nursing Scholarship. 2021;53(2):246–54.
OpenUrl Google Scholar
51.↵
Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research, R package version 2.3.12 Northwestern University, Evanston, Illinois 2023 [Available from: https://cran.r-project.org/web/packages/psych/index.html].
Google Scholar
52.↵
1. Higgins JPT TJ,
2. Chandler J,
3. Cumpston M,
4. Li T,
5. Page MJ,
6. Welch VA
Higgins JPT ES, Li T Chapter 23: Including variants on randomized trials. In: Higgins JPT TJ, Chandler J, Cumpston M, Li T, Page MJ, Welch VA editor. Cochrane Handbook for Systematic Reviews of Interventions version 64 (updated August 2023). Available from www.training.cochrane.org/handbook: Cochrane; 2023.
Google Scholar
53.↵
Rao JNK, Scott AJ. A Simple Method for the Analysis of Clustered Binary Data. Biometrics. 1992;48(2):577–85.
OpenUrl CrossRef PubMed Web of Science Google Scholar
54.↵
Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics. 1977;33(1):159–74.
OpenUrl CrossRef PubMed Web of Science Google Scholar
55.↵
Rotondi MA. kappaSize: Sample Size Estimation Functions for Studies of Interobserver Agreement, R package version 1.2 2018 [Available from: https://CRAN.R-project.org/package=kappaSize].
Google Scholar
56.↵
Allaire J, Gandrud C, Russell K, Yetman C. networkD3: D3 JavaScript Network Graphs from R. R package version 0.4 2017 [Available from: https://CRAN.R-project.org/package=networkD3].
Google Scholar
57.↵
Vaidyanathan R, Xie Y, Allaire J, Cheng J, Sievert C, Russell K. htmlwidgets: HTML Widgets for R. R package version 1.6.4 2023 [Available from: https://CRAN.R-project.org/package=htmlwidgets].
Google Scholar
58.
Richardson R, Dale HE, Robertson L, Meader N, Wellby G, McMillan D, Churchill R. Mental Health First Aid as a tool for improving mental health and well-being. Cochrane Database of Systematic Reviews. 2023(8).
Google Scholar
59.
Hung MSY, Chow MCM, Chien WT, Wong PYK. Effectiveness of the Mental Health First Aid programme for general nursing students in Hong Kong: A randomised controlled trial. Collegian. 2021;28(1):106–13.
OpenUrl Google Scholar
60.
Schrijver J, Lenferink A, Brusse-Keizer M, Zwerink M, van der Valk P, van der Palen J, Effing TW. Self-management interventions for people with chronic obstructive pulmonary disease. Cochrane Database of Systematic Reviews. 2022(1).
Google Scholar
61.
Jolly K, Sidhu MS, Hewitt CA, Coventry PA, Daley A, Jordan R, et al. Self management of patients with mild COPD in primary care: randomised controlled trial. BMJ. 2018;361:k2241.
OpenUrl Abstract/FREE Full Text Google Scholar
62.↵
Minozzi S, Cinquini M, Gianola S, Gonzalez-Lorenzo M, Banzi R. The revised Cochrane risk of bias tool for randomized trials (RoB 2) showed low interrater reliability and challenges in its application. Journal of Clinical Epidemiology. 2020;126:37–44.
OpenUrl CrossRef PubMed Google Scholar
63.↵
Minozzi S, Dwan K, Borrelli F, Filippini G. Reliability of the revised Cochrane risk-of-bias tool for randomised trials (RoB2) improved with the use of implementation instruction. Journal of Clinical Epidemiology. 2022;141:99–105.
OpenUrl CrossRef PubMed Google Scholar
64.↵
Minozzi S, Gonzalez-Lorenzo M, Cinquini M, Berardinelli D, Cagnazzo C, Ciardullo S, et al. Adherence of systematic reviews to Cochrane RoB2 guidance was frequently poor: a meta epidemiological study. Journal of Clinical Epidemiology. 2022;152:47–55.
OpenUrl CrossRef PubMed Google Scholar
65.↵
Sterne JA, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;355:i4919.
OpenUrl FREE Full Text Google Scholar

Posted July 16, 2024.

Download PDF

Author Declarations

Supplementary Material

Data/Code

Citation Tools

Get QR code

Tweet Widget

Subject Area

Health Informatics

Reviews and Context

Comment

TRIP Peer Reviews

Community Reviews

Automated Services

Blogs/Media

Author Videos

Subject Areas

All Articles

Addiction Medicine (419)
Allergy and Immunology (741)
Anesthesia (217)
Cardiovascular Medicine (3192)
Dentistry and Oral Medicine (355)
Dermatology (270)
Emergency Medicine (472)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1136)
Epidemiology (13179)
Forensic Medicine (19)
Gastroenterology (883)
Genetic and Genomic Medicine (5009)
Geriatric Medicine (467)
Health Economics (767)
Health Informatics (3153)
Health Policy (1118)
Health Systems and Quality Improvement (1160)
Hematology (418)
HIV/AIDS (992)
Infectious Diseases (except HIV/AIDS) (14480)
Intensive Care and Critical Care Medicine (900)
Medical Education (466)
Medical Ethics (122)
Nephrology (512)
Neurology (4761)
Nursing (253)
Nutrition (706)
Obstetrics and Gynecology (864)
Occupational and Environmental Health (775)
Oncology (2449)
Ophthalmology (696)
Orthopedics (273)
Otolaryngology (335)
Pain Medicine (317)
Palliative Medicine (89)
Pathology (525)
Pediatrics (1270)
Pharmacology and Therapeutics (538)
Primary Care Research (541)
Psychiatry and Clinical Psychology (4088)
Public and Global Health (7324)
Radiology and Imaging (1643)
Rehabilitation Medicine and Physical Therapy (978)
Respiratory Medicine (959)
Rheumatology (469)
Sexual and Reproductive Health (486)
Sports Medicine (413)
Surgery (530)
Toxicology (68)
Transplantation (227)
Urology (197)

Comments

medRxiv aims to provide a venue for anyone to comment on a medRxiv preprint. Comments are moderated for offensive or irrelevant content (this can take ~24 h). Please avoid duplicate submissions and read our Comment Policy before commenting. The content of a comment is not endorsed by medRxiv.

medRxiv aims to inform readers about online discussion of this preprint occurring elsewhere. The content at the links below is not endorsed by either medRxiv or the preprint's authors.

Community reviews for this article:

There are no community reviews for this paper.

Automated Evaluations

Certain services provide automated analysis of preprints. Analyses invited by the authors are displayed at the top of this tab. Those done independently of authors are shown underneath . None of these analyses is endorsed by medRxiv.

Automated Evaluations:

There are no automated evaluations for this paper.

[1] 1.↵
Nussbaumer-Streit B, Sommer I, Hamel C, Devane D, Noel-Storr A, Puljak L, et al. Rapid reviews methods series: guidance on team considerations, study selection, data extraction and risk of bias assessment. BMJ Evidence-Based Medicine. 2023.
Google Scholar

[2] 2.↵
Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7(2):e012545.
OpenUrl Abstract/FREE Full Text Google Scholar

[3] 3.↵
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. PLoS Med. 2021;18(3):e1003583.
OpenUrl CrossRef PubMed Google Scholar

[4] 4.↵
Aromataris E LC, Porritt K, Pilla B, Jordan Z, editors. JBI Manual for Evidence Synthesis: JBI 2024.
Google Scholar

[5] 5.↵
Cierco Jimenez R, Lee T, Rosillo N, Cordova R, Cree IA, Gonzalez A, Indave Ruiz BI. Machine learning computational tools to assist the performance of systematic reviews: A mapping review. BMC Medical Research Methodology. 2022;22(1):322.
OpenUrl Google Scholar

[6] 6.↵
Blaizot A, Veettil SK, Saidoung P, Moreno-Garcia CF, Wiratunga N, Aceves-Martins M, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Research Synthesis Methods. 2022;13(3):353–62.
OpenUrl Google Scholar

[7] 7.↵
Goodfellow I, Bengio Y, Courville A. Deep Learning: MIT Press; 2016.
Google Scholar

[8] 8.↵
Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–60.
OpenUrl Abstract/FREE Full Text Google Scholar

[9] 9.↵
Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan—a web and mobile app for systematic reviews. Systematic Reviews. 2016;5(1):210.
OpenUrl Google Scholar

[10] 10.↵
Veritas Health Innovation. Covidence systematic review software: Melbourne, Australia; [Available from: www.covidence.org].
Google Scholar

[11] 11.↵
Thomas J, Graziosi, S., Brunton, J., Ghouze, Z., O’Driscoll, P., & Bond, M. & Koryakina, A. EPPI-Reviewer: advanced software for systematic reviews, maps and evidence synthesis: EPPI Centre, UCL Social Research Institute, University College London; 2023 [Available from: https://eppi.ioe.ac.uk/cms/Default.aspx?tabid=2914].
Google Scholar

[12] 12.↵
Borissov N, Haas Q, Minder B, Kopp-Heim D, von Gernler M, Janka H, et al. Reducing systematic review burden using Deduklick: a novel, automated, reliable, and explainable deduplication algorithm to foster medical research. Systematic Reviews. 2022;11(1):172.
OpenUrl Google Scholar

[13] 13.↵
Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association. 2015;23(1):193–201.
OpenUrl Google Scholar

[14] 14.↵
Open AI. Introducing ChatGPT 2022 [Available from: https://openai.com/blog/chatgpt].
Google Scholar

[15] 15.↵
Google. Introducing PaLM 2 2023 [Available from: https://blog.google/technology/ai/google-palm-2-ai-large-language-model]].
Google Scholar

[16] 16.↵
Meta AI. Introducing LLaMA: A foundational, 65-billion-parameter large language model 2023 [Available from: https://ai.meta.com/blog/large-language-model-llama-meta-ai].
Google Scholar

[17] 17.↵
Anthropic. Introducing Claude 2023 [Available from: https://www.anthropic.com/index/introducing-claude].
Google Scholar

[18] 18.↵
Lund BD, Wang T, Mannuru NR, Nie B, Shimray S, Wang Z. ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information Science and Technology. 2023;74(5):570–81.
OpenUrl Google Scholar

[19] 19.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.
OpenUrl Google Scholar

[20] 20.
Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Ann Intern Med. 2024;177(2):210–20.
OpenUrl CrossRef Google Scholar

[21] 21.↵
Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine. 2023;388(13):1233–9.
OpenUrl CrossRef PubMed Google Scholar

[22] 22.↵
Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M, et al. A Comprehensive Overview of Large Language Models. arXiv. 2023;2307.06435.
Google Scholar

[23] 23.
Gehman S, Gururangan S, Sap M, Yejin C, Smith N. Real Toxicity Prompts: Evaluating Neural Toxic Degeneration in Language Models 2020. 3356–69 p.
Google Scholar

[24] 24.↵
Weidinger L, Mellor JFJ, Rauh M, Griffin C, Uesato J, Huang P-S, et al. Ethical and social risks of harm from Language Models. arXiv. 2021;2112.04359.
Google Scholar

[25] 25.↵
Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Systematic Reviews. 2023;12(1):72.
OpenUrl Google Scholar

[26] 26.↵
Issaiy M, Ghanaati H, Kolahi S, Shakiba M, Jalali AH, Zarei D, et al. Methodological insights into ChatGPT’s screening performance in systematic reviews. BMC Medical Research Methodology. 2024;24(1):78.
OpenUrl Google Scholar

[27] 27.↵
Gartlehner G, Kahwati L, Hilscher R, Thomas I, Kugley S, Crotty K, et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Research Synthesis Methods. 2024;15(4):576–89.
OpenUrl Google Scholar

[28] 28.↵
Sterne JAC, Savović J, Page MJ, Elbers RG, Blencowe NS, Boutron I, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;366:l4898.
OpenUrl FREE Full Text Google Scholar

[29] 29.↵
Higgins JPTTJ,
Chandler J,
Cumpston M,
Li T,
Page MJ,
Welch VA
Higgins JPT SJ, Page MJ, Elbers RG, Sterne JAC. Chapter 8: Assessing risk of bias in a randomized trial. In: Higgins JPTTJ, Chandler J, Cumpston M, Li T, Page MJ, Welch VA, editor. Cochrane Handbook for Systematic Reviews of Interventions version 62 (updated February 2021). Available from www.training.cochrane.org/handbook: Cochrane; 2021.
Google Scholar

[30] Higgins JPTTJ,

[31] Chandler J,

[32] Cumpston M,

[33] Li T,

[34] Page MJ,

[35] Welch VA

[36] 30.↵
Savović J, Weeks L, Sterne JA, Turner L, Altman DG, Moher D, Higgins JP. Evaluation of the Cochrane Collaboration’s tool for assessing the risk of bias in randomized trials: focus groups, online survey, proposed recommendations and their implementation. Syst Rev. 2014;3:37.
Google Scholar

[37] 31.
Armijo-Olivo S, Ospina M, da Costa BR, Egger M, Saltaji H, Fuentes J, et al. Poor reliability between Cochrane reviewers and blinded external reviewers when applying the Cochrane risk of bias tool in physical therapy trials. PLoS One. 2014;9(5):e96920.
OpenUrl Google Scholar

[38] 32.↵
Hartling L, Hamm MP, Milne A, Vandermeer B, Santaguida PL, Ansari M, et al. Testing the risk of bias tool showed low reliability between individual reviewers and across consensus assessments of reviewer pairs. J Clin Epidemiol. 2013;66(9):973–81.
OpenUrl CrossRef PubMed Google Scholar

[39] 33.↵
Barsby J, Hume S, Lemmey HAL, Cutteridge J, Lee R, Bera KD. Pilot study on large language models for risk-of-bias assessments in systematic reviews: A(I) new type of bias? BMJ Evidence-Based Medicine. 2024: bmjebm-2024-112990.
Google Scholar

[40] 34.↵
Pitre T, Jassal T, Talukdar J, Shahab M, Ling M, Zeraatkar D. ChatGPT for assessing risk of bias of randomized trials using the RoB 2.0 tool: A methods study. medRxiv. 2024:2023.11.19.23298727.
Google Scholar

[41] 35.↵
Hasan B, Saadi S, Rajjoub NS, Hegazi M, Al-Kordi M, Fleti F, et al. Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment. BMJ Evidence-Based Medicine. 2024:bmjebm-2023-112597.
Google Scholar

[42] 36.↵
Anthropic. User guides - Glossary 2024 [Available from: https://docs.anthropic.com/claude/docs/glossary].
Google Scholar

[43] 37.↵
Anthropic. Introducing the next generation of Claude 2024 [Available from: https://www.anthropic.com/news/claude-3-family].
Google Scholar

[44] 38.↵
Lai H, Ge L, Sun M, Pan B, Huang J, Hou L, et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Network Open. 2024;7(5):e2412687-e.
OpenUrl Google Scholar

[45] 39.↵
Higgins JPT, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ. 2011;343:d5928.
OpenUrl FREE Full Text Google Scholar

[46] 40.↵
Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, et al. The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv. 2024;2406.06608
Google Scholar

[47] 41.↵
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput Surv. 2023;55(9):Article 195.
Google Scholar

[48] 42.↵
Clark B, Whitall J, Kwakkel G, Mehrholz J, Ewings S, Burridge J. The effect of time spent in rehabilitation on activity limitation and impairment after stroke. Cochrane Database of Systematic Reviews. 2021(10).
Google Scholar

[49] 43.
Iannizzi C, Chai KL, Piechotta V, Valk SJ, Kimber C, Monsef I, et al. Convalescent plasma for people with COVID-19: a living systematic review. Cochrane Database of Systematic Reviews. 2023(5).
Google Scholar

[50] 44.↵
Willis MA, Toews I, Soltau SLV, Kalff JC, Meerpohl JJ, Vilz TO. Preoperative combined mechanical and oral antibiotic bowel preparation for preventing complications in elective colorectal surgery. Cochrane Database of Systematic Reviews. 2023(2).
Google Scholar

[51] 45.↵
Higgins JP, Savović J, Page MJ, Sterne JA. Revised Cochrane risk-of-bias tool for randomized trials (RoB 2) - Full guidance document 2019 [Available from: https://www.riskofbias.info/welcome/rob-2-0-tool/current-version-of-rob-2].
Google Scholar

[52] 46.↵
Manakul P, Liusie A, Gales MJF. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. medRxiv. 2023:2303.08896.
Google Scholar

[53] 47.↵
Fadeeva E, Vashurin R, Tsvigun A, Vazhentsev A, Petrakov S, Fedyanin K, et al. LM-Polygraph: Uncertainty Estimation for Language Models. medRxiv. 2023:2311.07383.
Google Scholar

[54] 48.↵
Sahoo P, Singh AK, Saha S, Jain V, Mondal S, Chadha A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv. 2024;2402.07927.
Google Scholar

[55] 49.↵
Armijo-Olivo S, Craig R, Campbell S. Comparing machine and human reviewers to evaluate the risk of bias in randomized controlled trials. Research Synthesis Methods. 2020;11(3):484–93.
OpenUrl Google Scholar

[56] 50.
Hirt J, Meichlinger J, Schumacher P, Mueller G. Agreement in Risk of Bias Assessment Between RobotReviewer and Human Reviewers: An Evaluation Study on Randomised Controlled Trials in Nursing-Related Cochrane Reviews. Journal of Nursing Scholarship. 2021;53(2):246–54.
OpenUrl Google Scholar

[57] 51.↵
Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research, R package version 2.3.12 Northwestern University, Evanston, Illinois 2023 [Available from: https://cran.r-project.org/web/packages/psych/index.html].
Google Scholar

[58] 52.↵
Higgins JPT TJ,
Chandler J,
Cumpston M,
Li T,
Page MJ,
Welch VA
Higgins JPT ES, Li T Chapter 23: Including variants on randomized trials. In: Higgins JPT TJ, Chandler J, Cumpston M, Li T, Page MJ, Welch VA editor. Cochrane Handbook for Systematic Reviews of Interventions version 64 (updated August 2023). Available from www.training.cochrane.org/handbook: Cochrane; 2023.
Google Scholar

[59] Higgins JPT TJ,

[60] Chandler J,

[61] Cumpston M,

[62] Li T,

[63] Page MJ,

[64] Welch VA

[65] 53.↵
Rao JNK, Scott AJ. A Simple Method for the Analysis of Clustered Binary Data. Biometrics. 1992;48(2):577–85.
OpenUrl CrossRef PubMed Web of Science Google Scholar

[66] 54.↵
Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics. 1977;33(1):159–74.
OpenUrl CrossRef PubMed Web of Science Google Scholar

[67] 55.↵
Rotondi MA. kappaSize: Sample Size Estimation Functions for Studies of Interobserver Agreement, R package version 1.2 2018 [Available from: https://CRAN.R-project.org/package=kappaSize].
Google Scholar

[68] 56.↵
Allaire J, Gandrud C, Russell K, Yetman C. networkD3: D3 JavaScript Network Graphs from R. R package version 0.4 2017 [Available from: https://CRAN.R-project.org/package=networkD3].
Google Scholar

[69] 57.↵
Vaidyanathan R, Xie Y, Allaire J, Cheng J, Sievert C, Russell K. htmlwidgets: HTML Widgets for R. R package version 1.6.4 2023 [Available from: https://CRAN.R-project.org/package=htmlwidgets].
Google Scholar

[70] 58.
Richardson R, Dale HE, Robertson L, Meader N, Wellby G, McMillan D, Churchill R. Mental Health First Aid as a tool for improving mental health and well-being. Cochrane Database of Systematic Reviews. 2023(8).
Google Scholar

[71] 59.
Hung MSY, Chow MCM, Chien WT, Wong PYK. Effectiveness of the Mental Health First Aid programme for general nursing students in Hong Kong: A randomised controlled trial. Collegian. 2021;28(1):106–13.
OpenUrl Google Scholar

[72] 60.
Schrijver J, Lenferink A, Brusse-Keizer M, Zwerink M, van der Valk P, van der Palen J, Effing TW. Self-management interventions for people with chronic obstructive pulmonary disease. Cochrane Database of Systematic Reviews. 2022(1).
Google Scholar

[73] 61.
Jolly K, Sidhu MS, Hewitt CA, Coventry PA, Daley A, Jordan R, et al. Self management of patients with mild COPD in primary care: randomised controlled trial. BMJ. 2018;361:k2241.
OpenUrl Abstract/FREE Full Text Google Scholar

[74] 62.↵
Minozzi S, Cinquini M, Gianola S, Gonzalez-Lorenzo M, Banzi R. The revised Cochrane risk of bias tool for randomized trials (RoB 2) showed low interrater reliability and challenges in its application. Journal of Clinical Epidemiology. 2020;126:37–44.
OpenUrl CrossRef PubMed Google Scholar

[75] 63.↵
Minozzi S, Dwan K, Borrelli F, Filippini G. Reliability of the revised Cochrane risk-of-bias tool for randomised trials (RoB2) improved with the use of implementation instruction. Journal of Clinical Epidemiology. 2022;141:99–105.
OpenUrl CrossRef PubMed Google Scholar

[76] 64.↵
Minozzi S, Gonzalez-Lorenzo M, Cinquini M, Berardinelli D, Cagnazzo C, Ciardullo S, et al. Adherence of systematic reviews to Cochrane RoB2 guidance was frequently poor: a meta epidemiological study. Journal of Clinical Epidemiology. 2022;152:47–55.
OpenUrl CrossRef PubMed Google Scholar

[77] 65.↵
Sterne JA, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;355:i4919.
OpenUrl FREE Full Text Google Scholar

Exploring the potential of Claude 2 for risk of bias assessment: Using a large language model to assess randomized controlled trials with RoB 2

ABSTRACT

INTRODUCTION

METHODS

Sample and Eligibility Criteria

Data collection

Prompt engineering and generation of Claude RoB assessments

Prompt engineering

Contents of the final main prompt template

Program

Iterations

Data analysis

Additional analyses

Sensitivity analyses

Subgroup analyses

Sample size estimation

RESULTS

Sample

Study characteristics

RoB assessment with Claude

Reasons for non-agreement

Review of two-level-discrepancies

Review of other discrepancies

Results of the additional analyses

Sensitivity analyses

Subgroup analyses

DISCUSSION

Strengths and limitations

Conclusion

DATA AVAILABILITY STATEMENT

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

ETHICS STATEMENT

ACKNOWLEDGEMENTS

REFERENCES

Subject Area

Citation Manager Formats

Exploring the potential of Claude 2 for risk of bias assessment: Using a large language model to assess randomized controlled trials with RoB 2

ABSTRACT

INTRODUCTION

METHODS

Sample and Eligibility Criteria

Data collection

Prompt engineering and generation of Claude RoB assessments

Prompt engineering

Contents of the final main prompt template

Program

Iterations

Data analysis

Additional analyses

Sensitivity analyses

Subgroup analyses

Sample size estimation

RESULTS

Sample

Study characteristics

RoB assessment with Claude

Reasons for non-agreement

Review of two-level-discrepancies

Review of other discrepancies

Results of the additional analyses

Sensitivity analyses

Subgroup analyses

DISCUSSION

Strengths and limitations

Conclusion

DATA AVAILABILITY STATEMENT

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

ETHICS STATEMENT

ACKNOWLEDGEMENTS

REFERENCES

Subject Area

Follow this preprint