Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case

Background: Oropharyngeal cancer (OPC) is a widespread disease, with radiotherapy being a core treatment modality. Manual segmentation of the primary gross tumor volume (GTVp) is currently employed for OPC radiotherapy planning, but is subject to significant interobserver variability. Deep learning (DL) approaches have shown promise in automating GTVp segmentation, but comparative (auto)confidence metrics of these models predictions has not been well-explored. Quantifying instance-specific DL model uncertainty is crucial to improving clinician trust and facilitating broad clinical implementation. Therefore, in this study, probabilistic DL models for GTVp auto-segmentation were developed using large-scale PET/CT datasets, and various uncertainty auto-estimation methods were systematically investigated and benchmarked. Methods: We utilized the publicly available 2021 HECKTOR Challenge training dataset with 224 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations as a development set. A separate set of 67 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations was used for external validation. Two approximate Bayesian deep learning methods, the MC Dropout Ensemble and Deep Ensemble, both with five submodels, were evaluated for GTVp segmentation and uncertainty performance. The segmentation performance was evaluated using the volumetric Dice similarity coefficient (DSC), mean surface distance (MSD), and Hausdorff distance at 95% (95HD). The uncertainty was evaluated using four measures from literature: coefficient of variation (CV), structure expected entropy, structure predictive entropy, and structure mutual information, and additionally with our novel Dice-risk measure. The utility of uncertainty information was evaluated with the accuracy of uncertainty-based segmentation performance prediction using the Accuracy vs Uncertainty (AvU) metric, and by examining the linear correlation between uncertainty estimates and DSC. In addition, batch-based and instance-based referral processes were examined, where the patients with high uncertainty were rejected from the set. In the batch referral process, the area under the referral curve with DSC (R-DSC AUC) was used for evaluation, whereas in the instance referral process, the DSC at various uncertainty thresholds were examined. Results: Both models behaved similarly in terms of the segmentation performance and uncertainty estimation. Specifically, the MC Dropout Ensemble had 0.776 DSC, 1.703 mm MSD, and 5.385 mm 95HD. The Deep Ensemble had 0.767 DSC, 1.717 mm MSD, and 5.477 mm 95HD. The uncertainty measure with the highest DSC correlation was structure predictive entropy with correlation coefficients of 0.699 and 0.692 for the MC Dropout Ensemble and the Deep Ensemble, respectively. The highest AvU value was 0.866 for both models. The best performing uncertainty measure for both models was the CV which had R-DSC AUC of 0.783 and 0.782 for the MC Dropout Ensemble and Deep Ensemble, respectively. With referring patients based on uncertainty thresholds from 0.85 validation DSC for all uncertainty measures, on average the DSC improved from the full dataset by 4.7% and 5.0% while referring 21.8% and 22% patients for MC Dropout Ensemble and Deep Ensemble, respectively. Conclusion: We found that many of the investigated methods provide overall similar but distinct utility in terms of predicting segmentation quality and referral performance. These findings are a critical first-step towards more widespread implementation of uncertainty quantification in OPC GTVp segmentation.


Appendix B: Additional Qualitative Analysis
In this section, we present qualitative results for select cases in the MDA holdout dataset. Specifically, we describe our interpretations of model predictions and uncertainty maps relative to ground truth across multiple axial image slices, similar to how a case would be reviewed in the clinic. For simplicity, we only describe results of the MD Dropout Ensemble model. "High" and "low" values are relative to median values described in the main text (e.g., high DSC is greater than 0.61).
Here we describe a case with low DSC (0.43) but high certainty (− G = −0.43). Axial slice representations from superior to inferior slices for this case are shown in Figure B1. As can be seen in the inferior slices (slice 54), there is initially a relatively large degree of uncertainty about the beginning of the prediction. Subsequently (slice 66), the model correctly predicts the tumor at the left base of tongue, with a simultaneous region of uncertainty appearing at the right base of tongue, likely secondary to the high PET signal causing a potential area of false positivity. This false positive PET signal is not ultimately included in the predicted segmentation mask, which in this case is seen as a desired outcome. More superiorly (slice 79), in terms of uncertainty and the resultant prediction, the model seems to have erroneously localized to the hyper-metabolic core of the primary tumor. Finally, at the most superior slices (slice 90), it is noted that there was metal streak artifact induced by dental hardware, which may have interfered with model inference and subsequent uncertainty quantification, as no prediction was generated. Main takeaways from this case include the model overemphasizing PET signal (which has been previously noted in PET/CT auto-segmentation models) which is also reflected in the resultant uncertainty measures. Moreover, the image artifact may also impact performance and uncertainty estimation.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 24, 2023. ; https://doi.org/10.1101/2023.02.20.23286188 doi: medRxiv preprint Figure B1: Additional qualitative investigation of a case with low performance and high certainty. Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation. Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.
Here we describe a case with high DSC (0.64) but low certainty (− G = −0.5). Axial slice representations from superior to inferior slices for this case are shown in Figure B2. In the inferior-most slices (slices 18-27), uncertainty is noted near the larynx, likely a byproduct of high PET signal. As before, this false positive PET signal is not ultimately included in the predicted segmentation mask, which in this case is seen as a desired outcome. More superiorly (slice 61), the model begins to predict a segmentation on only the right side of the base of tongue, when in reality the ground-truth is a bilateral segmentation. Importantly, the model starts to note uncertainty on the contralateral part of the image, which is a desired outcome. As we move further superiorly towards the tonsils (slice 70) the tumor begins to exhibit an uncommon presentation (discontinuous fragment, bilateral in both tonsils), but the prediction better starts to approximate the ground-truth; the uncertainty previously demonstrated at the contralateral side (left) is still present but has now started to become included in the predicted segmentation. Continuing superiorly (slice 78), there is still high . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 24, 2023. ; https://doi.org/10.1101/2023.02.20.23286188 doi: medRxiv preprint uncertainty in the discontinuous fragment but the model is able to generate a reasonable prediction, however the model eventually starts to generate an implausible prediction in an air space (slice 85) as the prediction begins to generate a bilateral segmentation erroneously. As with the previous case, towards the superior-most part of the image (slices 90-94), metal streak artifact induced by dental hardware may alter the predictions and uncertainty estimation; notably, the prediction ignores the false positive PET signal. Main takeaways from this case include uncommon tumor presentations (e.g., fragmentation of tumor from one continuous piece to two pieces) may present issues in generating prediction and uncertainty. Moreover, as before, image artifacts may impact predictions and uncertainty estimation. Figure B2: Additional qualitative investigation of a case with high performance and low certainty. Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation. Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.

Case 3: Contralateral uncertainty.
Here we describe an interesting case with high DSC (0.64) and high certainty (− G = −0.41). Axial slice representations from superior to inferior slices for this case are shown in Figure B3. At the inferior slice (slice 60), the model generates the prediction correctly at the left tonsil but starts to note uncertainty at the contralateral tonsil. Subsequently, at the more superior slice (slice 70) the contralateral portion is revealed as part of the ground truth segmentation. The model is still uncertain about the area and ultimately does not include it . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 24, 2023. ; https://doi.org/10. 1101/2023 as part of the prediction. In other words, the contralateral uncertainty indicates a false negative area that the model is uncertain about. In a clinical workflow this would correspond to an area the clinician could choose to further investigate. Figure B3: Additional qualitative investigation of a case with contralateral uncertainty. Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation. Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.

Case 4: Nodal uncertainty.
Here we describe an interesting case with high DSC (0.71) and high certainty (− G = −0.44). Axial slice representations from superior to inferior slices for this case are shown in Figure B4. In the inferior-most slice (slice 22) there is noted uncertainty in the area of high PET signal (likely spurious signal), which is not included in the prediction, which in this case is seen as a desired outcome. More superiorly (slices 61-70) a metastatic lymph node is present on the right side of the image; there is corresponding noted uncertainty about this area and it is ultimately not included in the prediction. The model is able to generate a prediction for the right base of tongue tumor without issues. Notably, as observed through the majority of other cases, metastatic lymph nodes are normally not considered by the model at all, likely due to the often large geometric distances between the nodal metastases and the primary tumors. In this case the node exhibits features (i.e. high PET signal) in close proximity to the primary tumor, which could have led to the model uncertainty about this prediction.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 24, 2023. ; https://doi.org/10.1101/2023.02.20.23286188 doi: medRxiv preprint Figure B4: Additional qualitative investigation of a case with nodal uncertainty. Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation. Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 24, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023