Demystifying AI in healthcare
BMJ 2020; 370 doi: https://doi.org/10.1136/bmj.m3505 (Published 09 September 2020) Cite this as: BMJ 2020;370:m3505Linked RMR
Guidelines for clinical trial protocols for interventions involving artificial intelligence
Linked RMR
Reporting guidelines for clinical trial reports for interventions involving artificial intelligence
- Laure Wynants, assistant professor1 2 3,
- Luc J M Smits, professor1,
- Ben Van Calster, associate professor2 3 4
- 1Department of Epidemiology, CAPHRI Care and Public Health Research Institute, Maastricht University, Maastricht, Netherlands
- 2Department of Development and Regeneration, KU Leuven, Leuven, Belgium
- 3EPI-Centre, KU Leuven, Leuven, Belgium
- 4Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, Netherlands
- Correspondence to: L Wynants laure.wynants{at}maastrichtuniversity.nl
In academia and society at large, attention on artificial intelligence (AI) in healthcare is tremendous. Although many researchers and commentators claim that AI improves screening, diagnosis, and prognostication, those who delve deeper will notice a scarcity of external validation studies and randomised controlled trials evaluating the true impact of AI on healthcare.123 Findings from the few published randomised controlled trials are mixed. In one trial, endoscopy assisted by an automatic AI detection system found more colorectal adenomas than did unassisted endoscopy.4 In another, an AI platform for diagnosing childhood cataracts was less accurate than a senior consultant.5 To gauge the quality of such evidence, readers need a detailed account of study methods and results. Systematic reviews, however, show that studies on AI are often poorly reported.26
Reporting guidelines
New extensions of the SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) (doi:10.1136/bmj.m3210) and CONSORT (Consolidated Standards of Reporting Trials) (doi:10.1136/bmj.m3164) reporting guidelines, published in The BMJ, encourage authors to be transparent and comprehensive when writing protocols for trials that evaluate AI interventions,7 and when reporting the results of such trials.8 They cover important issues specific to AI interventions, such as specifying the level of expertise required for researchers interacting with the study’s AI (for example, to identify a region of interest on an image, or to translate AI output into clinical decisions). The operational requirements for integrating AI into the study’s clinical setting also must be clear, as well as any need to fine tune an AI algorithm using data from the local environment.
We can anticipate a positive effect of these reporting guidelines on the quality (and perhaps quantity) of trial reports in this rapidly developing area. Registering a trial protocol improves transparency and discourages research practices that might yield misleading results, such as switching the primary outcome after the results are known.9 Similarly, empirical research suggests that CONSORT guidelines improved the quality of reporting, but that it remains suboptimal.1011 Funders, scientific publishers, and peer reviewers have an important responsibility to enforce protocol registration and the adoption of appropriate guidelines.11
But even a transparently reported study can lead to misguided conclusions if the trial is poorly designed, if it targets an inappropriate primary outcome, or if the AI system is not well embedded in the clinician’s digital environment and workflow. In addition, owing to the difficulty and cost of running randomised controlled trials, it is important to evaluate the performance of AI algorithms in external validation studies first.123
One example of a primary outcome that could lead to unjustified claims about AI’s benefits is the number of detected cases in a trial comparing clinicians’ diagnostic performance with or without AI support. Such a trial is likely to show that AI helps detect more cases, even if the AI’s alerts are completely random. A balanced evaluation must weigh up the increase in detected cases against the risk of false alerts.
Another example of the potential for misleading results is a trial of a very accurate AI system that has poor user adherence as a result of the way it is embedded in the clinician’s environment. Poor adherence might be an important reason why clinical decision support systems have largely failed to improve patient health or reduce healthcare costs in trials.1213 Factors that have been shown to improve outcomes associated with clinical decision support systems include user friendliness, involving stakeholders in implementation, and using systems that give actionable recommendations, nudge users to comply (for example, by asking for a reason to overrule a recommendation), and target clinicians and patients simultaneously in a shared decision making context.1213
Reporting harm
Similar to the monitoring of drug side effects, AI errors and other associated harms must be monitored and reported—both during trials and later in clinical practice. The new CONSORT and SPIRIT extensions encourage transparent reporting of errors, such as errors in diagnosing rare tumour subtypes or diagnostic errors in certain population subgroups.
One particularly worrying type of error arises from underrepresentation of minorities in the training data for AI systems—such as an application for detecting melanoma that is trained only on white skin. Another is the replication of social biases such as delayed lung cancer diagnosis in patients of low socioeconomic status.1415 By mechanisms such as these, AI replicates and could even exacerbate health inequities. This is particularly harmful when an AI system is wrongly perceived as objective and free from bias. Using large and diverse samples that allow subgroup analyses provides an opportunity to tackle these problems.
Despite the above considerations, we have an exciting new era to look forward to, in which the true potential of AI will gradually emerge. Sceptics might become enthusiasts, enthusiasts might be disappointed. But whatever happens, well designed trials, registered and published protocols, and transparent reporting will help ensure that a nuanced appraisal of all AI interventions is based on robust evidence instead of fears or aspirations.
Footnotes
Competing interests: The BMJ has judged that there are no disqualifying financial ties to commercial companies. The authors declare the following other interests: LW and BVC acknowledge funding from Internal Funds KU Leuven, KOOR, and the COVID-19 Fund, and Research Foundation-Flanders (FWO), all unrelated to the current editorial. LW and LJMS acknowledge funding from ZonMw, unrelated to the current editorial.
Provenance and peer review: Commissioned; not peer reviewed.