Abstract
Artificial intelligence (AI) programs in radiology typically provide a numeric score for each case that correlates with the underlying pathology. However, these scores are not readily interpretable by themselves. To address this, we propose improving score interpretability by providing the False Discovery Rate (FDR) and False Omission Rate (FOR) corresponding with each score threshold. Using an open-source AI program for breast cancer, we estimated FDR and FOR across a range of AI scores using data from 130,712 digital screening mammograms, of which 907 were positive and 129,805 were negative. FDR and FOR ranged from 99.27% and 0.03%, respectively, at the low end of the score distribution to 60.98% and 0.65%, respectively, at the high end of the distribution. Providing these error rates alongside AI scores allows clinicians to consider the balance of trade-offs between false positive and false negative interpretations.
Introduction
Artificial intelligence (AI) in medicine has expanded rapidly in recent years [1,2], particularly in radiology. AI outputs almost always include a numeric score for the radiologist interpreting images with AI’s aid [3-7]. These scores come in various forms, including risk scores, severity scores, etc. All are designed to provide a numeric value correlating with possible pathology for a particular image. However, these scores have several limitations, which we have discussed at length [8]. Below, we briefly review a few of these key limitations. We then discuss a solution for how scores can be presented and provide an empirical example.
Limitations of AI Scores
First, AI scores are not readily interpretable, especially when they fall within the mid-range rather than at the extreme ends of the scale. For instance, if scores can range from 0.0 to 1.0, it is unclear what a score of 0.50 or 0.70 represents in a clinical context. In other words, AI-generated scores lack intrinsic interpretability— that is, AI scores are not directly interpretable by themselves. This ambiguity may paradoxically increase a clinician’s uncertainty when evaluating a given case rather than providing useful guidance to inform a radiologist’s interpretation.
Second, each AI algorithm has its own, often proprietary, scoring system [8]. As a result, a score of 0.90 from one AI model may not correspond to the same level of risk as a score of 0.90 in another model, even when assessing the same pathology. Moreover, different algorithms may use entirely different scoring scales, further complicating direct comparisons between models. Finally, the relationship between scores with and without pathology is unknown. For example, we cannot assume that the relative increase in risk between a score of 0.20 and 0.30 is the same as the relative increase between a score of 0.60 and 0.70. Without a clear understanding of how scores relate to pathology, clinicians may struggle to determine what constitutes a meaningful change in risk. Moreover, because a score is not inherently interpretable, clinicians will likely not agree with each other about the meaning of a score, placing the inter-rater reliability of the interpretation of scores into question. These factors markedly limit the utility of these scores in clinical decision-making.
How to Make AI Scores Interpretable
Since AI scores alone cannot resolve these ambiguities, information needs to accompany these scores to provide context. Traditional accuracy metrics, like sensitivity and specificity, are commonly used when validating AI systems, but they are not good options for aiding score interpretation. This is because radiologists only know the AI score, not if the case has pathology. Specifically, the question relevant to a radiologist is what is the proportion of cases for a given AI score or higher that have pathology (i.e., the denominator being all cases with a given AI score or higher) instead of sensitivity, which is the proportion of cases with pathology that have a given AI score or higher (i.e., the denominator being all cases with pathology).
Therefore, we propose providing radiologists with the probability of pathology conditioned on the AI score, which is the only piece of information known to the radiologist—that is, the rates of error corresponding with a score as a threshold. Presenting the predicted probabilities (PP) and corresponding false discovery rate (FDR) and false omission rate (FOR) with AI scores provides radiologists with clinically relevant information. For example, the PP allows a radiologist to compare a given AI score’s likelihood of pathology relative to the base rate of pathology. Likewise, the FDR represents the probability that the AI score or higher is actually negative for pathology (1-positive predictive value, PPV), while the FOR represents the probability of the values under the AI score being actually positive for pathology (1-negative predictive value, NPV).
Calculating the PP, FDR, and FOR is straightforward. Practices using a given AI system must first run that algorithm on their local historical data—a local validation. From this validation, they need only to regress outcomes with the AI scores using the generalized linear model (mixed, if applicable). If that is not possible, practices can also use their local prevalence rate along with published sensitivities and specificities of the AI algorithm to calculate the FDR and FOR for their local practice. We will demonstrate both applications using screening mammography as a case study. By pairing AI scores with PPs, FDRs, and FORs, we provide a framework for clinicians to make more informed, probability-based decisions using local prevalence.
Methods
University of California, San Francisco (UCSF) Institutional Review Board gave ethical approval for this Health Insurance Portability and Accountability Act–compliant study and waived the requirement for written informed consent.
Study sample. We conducted a retrospective review using a single-institution radiology database to identify 130,712 digital screening mammograms acquired between January 2006 and January 2023. All cases included at least one year of mammographic and/or clinical follow-up. Positive exams were defined as those with a histopathologic diagnosis of invasive breast carcinoma or ductal carcinoma in situ (DCIS) within 12 months of imaging. Negative exams were defined as those with at least 12 months of follow-up without a breast cancer diagnosis. Among the 129,805 cases, there were 907 positive exams.
AI Model. To promote reproducibility and open science, we applied Mirai, an open-source AI model for mammogram-based risk prediction, to estimate 1-year breast cancer risk using 2D digital mammograms [9]. No additional image processing was performed before the model application.
Statistics. The logistic function was fit using the LOGISTIC and GLIMMIX procedures in SAS 9.4 (SAS Cary, NC). Sensitivities and specificities were estimated using the %ROCPLOT macro and PPV and NPV (and their complements) were calculated using Bayes’ Theorem. Presence of cancer was regressed on AI scores from Mirai. The base rate of cancer was 0.69%. We also use a base of 0.57% as an example when local historical data are unavailable (Table 1).
Mirai Scores corresponding with Diagnostic Performance Metrics and outcomes (false positive and negative counts)
Results
Providing Context with Scores
As demonstrated in Table 1 (and partially illustrated in Supplemental Video 1), each selected AI score is presented with its corresponding PP, FDR, and FOR values as a reference. For brevity, these values are provided at 0.01-unit increments from 0.01 to 0.85 (inclusive). Also included in Table 1, for reference, are sensitivity, specificity, NPV, PPV, and prevalence.
For example, for a score of 0.50, the PP of cancer is 2.62% (corresponding PP in Table 1), representing a 3.7-fold increase in the risk of cancer compared to the overall prevalence of 0.69%. The FDR at this threshold is 95%, meaning that out of all cases being 0.50 or higher, 95% will be negative, and 5% will be positive for cancer. Thus, among 100 cases with scores ≥0.50, 95 would be negative and 5 positive for cancer. The FOR at this threshold is 0.2%, meaning that out of all cases under 0.50, 99.8% will be negative, and 0.2% will be positive or cancer. Thus, among 1,000 cases with scores <0.50, 998 will be negative and 2 will be positive for cancer. Now, consider a higher score threshold of 0.70. The PP of cancer is 7.9%, corresponding to an 11.3-fold increase in the risk of cancer relative to the prevalence rate. The FDR is 84%, and the FOR is 0.5%.
And finally, let us consider the extremes. At a score of 0.01, the PP of cancer is 0.15%, representing a 0.22-fold decrease in risk relative to the prevalence rate of 0.69%. At the 0.01 threshold, FDR is 99.27% and the FOR is 0.03%. A score of 0.90 corresponds to a PP of cancer of 21.5%, representing a relative increase of 31-fold compared to a prevalence rate of 0.69%. At that threshold, the FDR is 56%, and the FOR is 0.7%. Note, at the highest levels of the score (0.90), the PP of cancer is roughly 20%, and about half of the cases at or above 0.90 are actually negative.
Context with change in scores
This demonstration allows us to examine how a particular change in scores should be interpreted across the range of all possible scores. For example, a 0.10 increase from an AI score of 0.20 to 0.30 means an increase in PP from 0.46% to 0.82%, a small decrease in FDR from 97.9% to 97.2%, and an increase in FOR from 0.2% to 0.3%. In contrast, an identical 0.10 increase in the score from 0.60 to 0.70 results in an increase in PP from 4.5% to 7.9%, a larger decrease in FDR from 91.6% to 84.4%, and no change in FOR (remaining at 0.5%). These examples highlight that changes in error rates are not uniform across the score scale.
Comparison across AI systems
Providing scores with FDR and FOR allows for a more direct comparison across AI systems in clinical settings. Different AI algorithms or systems may produce similar raw scores, but the trade-offs between false positives and false negatives may vary significantly. For example (hypothetically), imagine comparing two AI algorithms for the same pathology: a score of “5” using one algorithm may translate into an FDR of 95% and FOR of 0.1%, while a score of “5” using another algorithm may translate into an FDR of 67% and FOR of 0.3%. By reporting both FDR and FOR for each score across AI systems, clinicians are informed of how each algorithm behaves across various thresholds. This allows them to make more informed clinical decisions when selecting an AI model by allowing for cross-comparison of AI models and assessing the real-world implications of each model in their specific patient population.
Bridging scores and clinical practice
FDR and FOR can easily be converted to number of patients, making the clinical interpretability of scores straightforward for clinicians and patients. As seen in Table 1, out of 1,000 mammograms and a 0.69% prevalence, a score of 0.50 or higher corresponds with 54 false positives and 4 false negatives.
When historical data are not available
When it is not possible to run an AI system on historical data, published sensitivities and specificities can be used to calculate the local FDR and FOR if the local prevalence of pathology is known, assuming these sensitivities and specificity estimates are for the same population. This is demonstrated in Table 1 using a common prevalence of 0.57% for breast cancer, though any prevalence could be used.
Discussion
By reporting the error rates corresponding to each score, scores can now be contextualized using a common language—the language of probability and error. This is achieved because AI scores can be regressed with outcomes. When done with local historical data, this provides the estimates of PP, FDR, and FOR (and sensitivity, specificity, PPV, NPV) for each score for future reference. As mentioned, if a practice cannot derive these estimates with their own historical data, they can use published sensitivity and specificity estimates, along with their local prevalence, to estimate FDR and FOR for each score. Both approaches can be done with every new AI algorithm or AI algorithm update.
Providing PP, FDR, and FOR in conjunction with scores enables clinicians not just to interpret scores in a general sense but also to interpret these scores within the specific context of their patient population. Disease prevalence serves a critical role in how scores should be interpreted [10]. Commercial AI models often come with thresholds pre-set by the vendor based on their own training data and performance metrics. These thresholds determine the cut-off at which the AI classifies cases as positive or negative and are usually based on the model of the development datasets. However, it may not necessarily reflect the local population where the model will be deployed. Reporting FDR and FOR helps clinicians apply generalized AI models, consider their local population and prevalence, and assess if the trade-off between false positives and false negatives is aligned with their practice patterns.
As AI tools become integrated into clinical practice, providing error rates can also assist with patient education and the shared decision-making process between clinicians and patients. For instance, if an AI system predicts a high risk of cancer but the FDR at that threshold is 95%, this means that while the AI identifies cases as positive, the vast majority (95%) of those cases are actually negative. Educating patients about this error rate can help manage anxiety and set appropriate expectations [11].
Data Availability
Data generated or analyzed during the study are available upon reasonable request to the corresponding author.