Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

A universal translator for AI scores: Providing context using error

View ORCID ProfileMaggie Chung, View ORCID ProfileMicheal H. Bernstein, View ORCID ProfileAdam Yala, View ORCID ProfileGrayson L. Baird
doi: https://doi.org/10.1101/2025.02.28.25323066
Maggie Chung
1Department of Radiology and Biomedical Imaging, University of California, San Francisco, CA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Maggie Chung
Micheal H. Bernstein
2Brown Radiology Human Factors Lab, Department of Diagnostic Imaging, The Warren Alpert Medical School, Brown University, and Brown University Health, Providence, RI
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Micheal H. Bernstein
Adam Yala
3Computational Precision Health, University of California, Berkeley and University of California, San Francisco, CA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Adam Yala
Grayson L. Baird
2Brown Radiology Human Factors Lab, Department of Diagnostic Imaging, The Warren Alpert Medical School, Brown University, and Brown University Health, Providence, RI
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Grayson L. Baird
  • For correspondence: grayson_baird{at}brown.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Artificial intelligence (AI) programs in radiology typically provide a numeric score for each case that correlates with the underlying pathology. However, these scores are not readily interpretable by themselves. To address this, we propose improving score interpretability by providing the False Discovery Rate (FDR) and False Omission Rate (FOR) corresponding with each score threshold. Using an open-source AI program for breast cancer, we estimated FDR and FOR across a range of AI scores using data from 130,712 digital screening mammograms, of which 907 were positive and 129,805 were negative. FDR and FOR ranged from 99.27% and 0.03%, respectively, at the low end of the score distribution to 60.98% and 0.65%, respectively, at the high end of the distribution. Providing these error rates alongside AI scores allows clinicians to consider the balance of trade-offs between false positive and false negative interpretations.

Introduction

Artificial intelligence (AI) in medicine has expanded rapidly in recent years [1,2], particularly in radiology. AI outputs almost always include a numeric score for the radiologist interpreting images with AI’s aid [3-7]. These scores come in various forms, including risk scores, severity scores, etc. All are designed to provide a numeric value correlating with possible pathology for a particular image. However, these scores have several limitations, which we have discussed at length [8]. Below, we briefly review a few of these key limitations. We then discuss a solution for how scores can be presented and provide an empirical example.

Limitations of AI Scores

First, AI scores are not readily interpretable, especially when they fall within the mid-range rather than at the extreme ends of the scale. For instance, if scores can range from 0.0 to 1.0, it is unclear what a score of 0.50 or 0.70 represents in a clinical context. In other words, AI-generated scores lack intrinsic interpretability— that is, AI scores are not directly interpretable by themselves. This ambiguity may paradoxically increase a clinician’s uncertainty when evaluating a given case rather than providing useful guidance to inform a radiologist’s interpretation.

Second, each AI algorithm has its own, often proprietary, scoring system [8]. As a result, a score of 0.90 from one AI model may not correspond to the same level of risk as a score of 0.90 in another model, even when assessing the same pathology. Moreover, different algorithms may use entirely different scoring scales, further complicating direct comparisons between models. Finally, the relationship between scores with and without pathology is unknown. For example, we cannot assume that the relative increase in risk between a score of 0.20 and 0.30 is the same as the relative increase between a score of 0.60 and 0.70. Without a clear understanding of how scores relate to pathology, clinicians may struggle to determine what constitutes a meaningful change in risk. Moreover, because a score is not inherently interpretable, clinicians will likely not agree with each other about the meaning of a score, placing the inter-rater reliability of the interpretation of scores into question. These factors markedly limit the utility of these scores in clinical decision-making.

How to Make AI Scores Interpretable

Since AI scores alone cannot resolve these ambiguities, information needs to accompany these scores to provide context. Traditional accuracy metrics, like sensitivity and specificity, are commonly used when validating AI systems, but they are not good options for aiding score interpretation. This is because radiologists only know the AI score, not if the case has pathology. Specifically, the question relevant to a radiologist is what is the proportion of cases for a given AI score or higher that have pathology (i.e., the denominator being all cases with a given AI score or higher) instead of sensitivity, which is the proportion of cases with pathology that have a given AI score or higher (i.e., the denominator being all cases with pathology).

Therefore, we propose providing radiologists with the probability of pathology conditioned on the AI score, which is the only piece of information known to the radiologist—that is, the rates of error corresponding with a score as a threshold. Presenting the predicted probabilities (PP) and corresponding false discovery rate (FDR) and false omission rate (FOR) with AI scores provides radiologists with clinically relevant information. For example, the PP allows a radiologist to compare a given AI score’s likelihood of pathology relative to the base rate of pathology. Likewise, the FDR represents the probability that the AI score or higher is actually negative for pathology (1-positive predictive value, PPV), while the FOR represents the probability of the values under the AI score being actually positive for pathology (1-negative predictive value, NPV).

Calculating the PP, FDR, and FOR is straightforward. Practices using a given AI system must first run that algorithm on their local historical data—a local validation. From this validation, they need only to regress outcomes with the AI scores using the generalized linear model (mixed, if applicable). If that is not possible, practices can also use their local prevalence rate along with published sensitivities and specificities of the AI algorithm to calculate the FDR and FOR for their local practice. We will demonstrate both applications using screening mammography as a case study. By pairing AI scores with PPs, FDRs, and FORs, we provide a framework for clinicians to make more informed, probability-based decisions using local prevalence.

Methods

University of California, San Francisco (UCSF) Institutional Review Board gave ethical approval for this Health Insurance Portability and Accountability Act–compliant study and waived the requirement for written informed consent.

Study sample. We conducted a retrospective review using a single-institution radiology database to identify 130,712 digital screening mammograms acquired between January 2006 and January 2023. All cases included at least one year of mammographic and/or clinical follow-up. Positive exams were defined as those with a histopathologic diagnosis of invasive breast carcinoma or ductal carcinoma in situ (DCIS) within 12 months of imaging. Negative exams were defined as those with at least 12 months of follow-up without a breast cancer diagnosis. Among the 129,805 cases, there were 907 positive exams.

AI Model. To promote reproducibility and open science, we applied Mirai, an open-source AI model for mammogram-based risk prediction, to estimate 1-year breast cancer risk using 2D digital mammograms [9]. No additional image processing was performed before the model application.

Statistics. The logistic function was fit using the LOGISTIC and GLIMMIX procedures in SAS 9.4 (SAS Cary, NC). Sensitivities and specificities were estimated using the %ROCPLOT macro and PPV and NPV (and their complements) were calculated using Bayes’ Theorem. Presence of cancer was regressed on AI scores from Mirai. The base rate of cancer was 0.69%. We also use a base of 0.57% as an example when local historical data are unavailable (Table 1).

View this table:
  • View inline
  • View popup
Table 1.

Mirai Scores corresponding with Diagnostic Performance Metrics and outcomes (false positive and negative counts)

Results

Providing Context with Scores

As demonstrated in Table 1 (and partially illustrated in Supplemental Video 1), each selected AI score is presented with its corresponding PP, FDR, and FOR values as a reference. For brevity, these values are provided at 0.01-unit increments from 0.01 to 0.85 (inclusive). Also included in Table 1, for reference, are sensitivity, specificity, NPV, PPV, and prevalence.

For example, for a score of 0.50, the PP of cancer is 2.62% (corresponding PP in Table 1), representing a 3.7-fold increase in the risk of cancer compared to the overall prevalence of 0.69%. The FDR at this threshold is 95%, meaning that out of all cases being 0.50 or higher, 95% will be negative, and 5% will be positive for cancer. Thus, among 100 cases with scores ≥0.50, 95 would be negative and 5 positive for cancer. The FOR at this threshold is 0.2%, meaning that out of all cases under 0.50, 99.8% will be negative, and 0.2% will be positive or cancer. Thus, among 1,000 cases with scores <0.50, 998 will be negative and 2 will be positive for cancer. Now, consider a higher score threshold of 0.70. The PP of cancer is 7.9%, corresponding to an 11.3-fold increase in the risk of cancer relative to the prevalence rate. The FDR is 84%, and the FOR is 0.5%.

And finally, let us consider the extremes. At a score of 0.01, the PP of cancer is 0.15%, representing a 0.22-fold decrease in risk relative to the prevalence rate of 0.69%. At the 0.01 threshold, FDR is 99.27% and the FOR is 0.03%. A score of 0.90 corresponds to a PP of cancer of 21.5%, representing a relative increase of 31-fold compared to a prevalence rate of 0.69%. At that threshold, the FDR is 56%, and the FOR is 0.7%. Note, at the highest levels of the score (0.90), the PP of cancer is roughly 20%, and about half of the cases at or above 0.90 are actually negative.

Context with change in scores

This demonstration allows us to examine how a particular change in scores should be interpreted across the range of all possible scores. For example, a 0.10 increase from an AI score of 0.20 to 0.30 means an increase in PP from 0.46% to 0.82%, a small decrease in FDR from 97.9% to 97.2%, and an increase in FOR from 0.2% to 0.3%. In contrast, an identical 0.10 increase in the score from 0.60 to 0.70 results in an increase in PP from 4.5% to 7.9%, a larger decrease in FDR from 91.6% to 84.4%, and no change in FOR (remaining at 0.5%). These examples highlight that changes in error rates are not uniform across the score scale.

Comparison across AI systems

Providing scores with FDR and FOR allows for a more direct comparison across AI systems in clinical settings. Different AI algorithms or systems may produce similar raw scores, but the trade-offs between false positives and false negatives may vary significantly. For example (hypothetically), imagine comparing two AI algorithms for the same pathology: a score of “5” using one algorithm may translate into an FDR of 95% and FOR of 0.1%, while a score of “5” using another algorithm may translate into an FDR of 67% and FOR of 0.3%. By reporting both FDR and FOR for each score across AI systems, clinicians are informed of how each algorithm behaves across various thresholds. This allows them to make more informed clinical decisions when selecting an AI model by allowing for cross-comparison of AI models and assessing the real-world implications of each model in their specific patient population.

Bridging scores and clinical practice

FDR and FOR can easily be converted to number of patients, making the clinical interpretability of scores straightforward for clinicians and patients. As seen in Table 1, out of 1,000 mammograms and a 0.69% prevalence, a score of 0.50 or higher corresponds with 54 false positives and 4 false negatives.

When historical data are not available

When it is not possible to run an AI system on historical data, published sensitivities and specificities can be used to calculate the local FDR and FOR if the local prevalence of pathology is known, assuming these sensitivities and specificity estimates are for the same population. This is demonstrated in Table 1 using a common prevalence of 0.57% for breast cancer, though any prevalence could be used.

Discussion

By reporting the error rates corresponding to each score, scores can now be contextualized using a common language—the language of probability and error. This is achieved because AI scores can be regressed with outcomes. When done with local historical data, this provides the estimates of PP, FDR, and FOR (and sensitivity, specificity, PPV, NPV) for each score for future reference. As mentioned, if a practice cannot derive these estimates with their own historical data, they can use published sensitivity and specificity estimates, along with their local prevalence, to estimate FDR and FOR for each score. Both approaches can be done with every new AI algorithm or AI algorithm update.

Providing PP, FDR, and FOR in conjunction with scores enables clinicians not just to interpret scores in a general sense but also to interpret these scores within the specific context of their patient population. Disease prevalence serves a critical role in how scores should be interpreted [10]. Commercial AI models often come with thresholds pre-set by the vendor based on their own training data and performance metrics. These thresholds determine the cut-off at which the AI classifies cases as positive or negative and are usually based on the model of the development datasets. However, it may not necessarily reflect the local population where the model will be deployed. Reporting FDR and FOR helps clinicians apply generalized AI models, consider their local population and prevalence, and assess if the trade-off between false positives and false negatives is aligned with their practice patterns.

As AI tools become integrated into clinical practice, providing error rates can also assist with patient education and the shared decision-making process between clinicians and patients. For instance, if an AI system predicts a high risk of cancer but the FDR at that threshold is 95%, this means that while the AI identifies cases as positive, the vast majority (95%) of those cases are actually negative. Educating patients about this error rate can help manage anxiety and set appropriate expectations [11].

Data Availability

Data generated or analyzed during the study are available upon reasonable request to the corresponding author.

References

  1. 1.↵
    Haug, C.J. and Drazen, J.M., 2023. Artificial intelligence and machine learning in clinical medicine, 2023. New England Journal of Medicine, 388(13), pp.1201–1208.
    OpenUrlCrossRefPubMed
  2. 2.↵
    Reddy, S., 2022. Explainability and artificial intelligence in medicine. The Lancet Digital Health, 4(4), pp.e214–e215.
    OpenUrl
  3. 3.↵
    Li, M.D., Little, B.P., Alkasab, T.K., Mendoza, D.P., Succi, M.D., Shepard, J.A.O., Lev, M.H. and Kalpathy-Cramer, J., 2021. Multi-radiologist user study for artificial intelligence-guided grading of COVID-19 lung disease severity on chest radiographs. Academic radiology, 28(4), pp.572–576.
    OpenUrl
  4. 4.
    Lessmann, N., Sánchez, C.I., Beenen, L., Boulogne, L.H., Brink, M., Calli, E., Charbonnier, J.P., Dofferhoff, T., van Everdingen, W.M., Gerke, P.K. and Geurts, B., 2021. Automated assessment of COVID-19 reporting and data system and chest CT severity scores in patients suspected of having COVID-19 using artificial intelligence. Radiology, 298(1), pp.E18–E28.
    OpenUrlCrossRef
  5. 5.
    Van Assen, M., Zandehshahvar, M., Maleki, H., Kiarashi, Y., Arleo, T., Stillman, A.E., Filev, P., Davarpanah, A.H., Berkowitz, E.A., Tigges, S. and Lee, S.J., 2022. COVID-19 pneumonia chest radiographic severity score: variability assessment among experienced and in-training radiologists and creation of a multireader composite score database for artificial intelligence algorithm development. The British Journal of Radiology, 95(1134), p.20211028.
    OpenUrl
  6. 6.
    Pacilè, S., Lopez, J., Chone, P., Bertinotti, T., Grouin, J.M. and Fillard, P., 2020. Improving breast cancer detection accuracy of mammography with the concurrent use of an artificial intelligence tool. Radiology: Artificial Intelligence, 2(6), p.e190208.
    OpenUrl
  7. 7.↵
    Ahn, J.S., Ebrahimian, S., McDermott, S., Lee, S., Naccarato, L., Di Capua, J.F., Wu, M.Y., Zhang, E.W., Muse, V., Miller, B. and Sabzalipour, F., 2022. Association of artificial intelligence–aided chest radiograph interpretation with reader performance and efficiency. JAMA Network Open, 5(8), pp.e2229289–e2229289.
    OpenUrl
  8. 8.↵
    Is a score enough? Pitfalls and Solutions for AI Severity Scores (under review).
  9. 9.↵
    Yala, A., Mikhael, P.G., Strand, F., Lin, G., Smith, K., Wan, Y.L., Lamb, L., Hughes, K., Lehman, C. and Barzilay, R., 2021. Toward robust mammography-based models for breast cancer risk. Science Translational Medicine, 13(578).
  10. 10.↵
    Scaringi, J.A., McTaggart, R.A., Alvin, M.D., Atalay, M., Bernstein, M.H., Jayaraman, M.V., Jindal, G., Movson, J.S., Swenson, D.W. and Baird, G.L., 2024. Implementing an AI algorithm in the clinical setting: a case study for the accuracy paradox. European Radiology, pp.1–7.
  11. 11.↵
    Song, E.C., Bernstein, M.H., Lay, P.S., Druart, L., Dibble, E.H., Lourenco, A.P. and Baird, G.L., 2024. Accessing AI mammography reports impacts patient interest in pursuing a medical malpractice claim: The unintended consequences of including AI in patient portals. medRxiv, pp.2024–12.
Back to top
PreviousNext
Posted March 04, 2025.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A universal translator for AI scores: Providing context using error
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A universal translator for AI scores: Providing context using error
Maggie Chung, Micheal H. Bernstein, Adam Yala, Grayson L. Baird
medRxiv 2025.02.28.25323066; doi: https://doi.org/10.1101/2025.02.28.25323066
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
A universal translator for AI scores: Providing context using error
Maggie Chung, Micheal H. Bernstein, Adam Yala, Grayson L. Baird
medRxiv 2025.02.28.25323066; doi: https://doi.org/10.1101/2025.02.28.25323066

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Radiology and Imaging
Subject Areas
All Articles
  • Addiction Medicine (434)
  • Allergy and Immunology (758)
  • Anesthesia (222)
  • Cardiovascular Medicine (3311)
  • Dentistry and Oral Medicine (366)
  • Dermatology (282)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1175)
  • Epidemiology (13396)
  • Forensic Medicine (19)
  • Gastroenterology (900)
  • Genetic and Genomic Medicine (5171)
  • Geriatric Medicine (482)
  • Health Economics (785)
  • Health Informatics (3283)
  • Health Policy (1144)
  • Health Systems and Quality Improvement (1198)
  • Hematology (432)
  • HIV/AIDS (1022)
  • Infectious Diseases (except HIV/AIDS) (14649)
  • Intensive Care and Critical Care Medicine (914)
  • Medical Education (478)
  • Medical Ethics (128)
  • Nephrology (525)
  • Neurology (4946)
  • Nursing (262)
  • Nutrition (734)
  • Obstetrics and Gynecology (888)
  • Occupational and Environmental Health (796)
  • Oncology (2528)
  • Ophthalmology (730)
  • Orthopedics (284)
  • Otolaryngology (347)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (546)
  • Pediatrics (1304)
  • Pharmacology and Therapeutics (551)
  • Primary Care Research (558)
  • Psychiatry and Clinical Psychology (4223)
  • Public and Global Health (7524)
  • Radiology and Imaging (1713)
  • Rehabilitation Medicine and Physical Therapy (1017)
  • Respiratory Medicine (981)
  • Rheumatology (480)
  • Sexual and Reproductive Health (500)
  • Sports Medicine (425)
  • Surgery (551)
  • Toxicology (72)
  • Transplantation (237)
  • Urology (206)