International Multi-Specialty Expert Physician Preoperative Identification of Extranodal Extension n Oropharyngeal Cancer Patients using Computed Tomography: Prospective Blinded Human Inter-Observer Performance Evaluation

Importance: Extranodal extension (pENE) is a critical prognostic factor in oropharyngeal cancer (OPC) that drives therapeutic disposition. Determination of pENE from radiological imaging has been associated with high inter-observer variability. However, the impact of clinician specialty on human observer performance of imaging-detected extranodal extension (iENE) remains poorly understood. Objective: To characterize the impact of clinician specialty on the accuracy of pre-operative iENE in human papillomavirus-positive (HPV+) OPC using computed tomography (CT) images. Design, Setting, and Participants: This prospective observational human performance study analyzed pre-therapy CT images from 24 HPV+ OPC patients, with duplication of 6 scans (n=30) of which 21 were pathologically confirmed pENE. Thirty-four expert observers, including 11 radiologists, 12 surgeons, and 11 radiation oncologists, independently assessed these scans for iENE and reported human-detected radiologic criteria and observer confidence. Main Outcomes and Measures: The primary outcomes included accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and Brier score for each physician, compared to ground-truth pENE. The significance of radiographic signs for prediction of pENE were determined through logistic regression analysis. Fleiss’ kappa measured interobserver agreement, and Hanley-MacNeil AUC discrimination testing. Results: Median accuracy across all specialties was 0.57 (95%CI 0.39 to 0.73), with no specialty showing discriminate performance greater than random estimation (median AUC 0.64, 95%CI 0.44 to 0.83). Significant differences between radiologists and surgeons in Brier scores (0.33 vs. 0.26, p < 0.01), radiation oncologists and surgeons in sensitivity (0.48 vs. 0.69, p > 0.1), and radiation oncologists and radiologists/surgeons in specificity (0.89 vs. 0.56, p > 0.1). Indistinct capsular contour and nodal necrosis were significant predictors of correct pENE status among all specialties. Interobserver agreement was weak for all the radiographic criteria, regardless of specialty (κ<0.6). Conclusions and Relevance: Multiobserver testing shows physician discrimination of HPV+OPC pENE on pre-operative CT remains non-different than blind guessing, with high interrater variability and low diagnostic accuracy, regardless of clinician specialty. While minor differences in diagnostic performance among specialties are noted, they do not significantly affect the overall poor agreement and discrimination rates observed. The findings underscore the need for further research into automated detection systems or enhanced imaging techniques to improve the accuracy and reliability of iENE assessments in clinical practice.


INTRODUCTION
Extranodal extension (ENE), a phenomenon where tumor cells extend beyond the capsule of a lymph node with tumor metastasis, is among the most important adverse prognostic factors in oropharyngeal cancer (OPC), and head and neck squamous cell carcinoma (HNSCC) more broadly 1 . ENE is often used in clinical decision-making to determine the therapeutic approach for human papillomavirus-positive (HPV+) OPC patients. While there is ambiguity with regard to the impact of clinical/radiographic nodal extension in terms of chemoradiation efficacy, large-scale surgical registry data from the National Cancer Database showed that in >66,000 patients, documented ENE was associated with a >60% decrease in overall survival (Hazzard ratio = 1.63) in patients treated surgically 2 . Currently, existing treatment paradigms recommend adjuvant chemoradiotherapy if ENE is present 3 . Alternatively, minimally invasive surgery, e.g., trans-oral robotic surgery, may be preferred if ENE is not present. Therefore, accurate determination of ENE status is crucial for appropriate treatment stratification, which may have significant impacts on patient outcomes.
The current gold-standard approach to identify ENE status in OPC patients involves histopathological evaluation of lymph nodes 1 . Radiological identification of ENE using commonly available imaging modalities, such as computed tomography (CT), has long been seen as an attractive alternative for non-invasive determination of ENE. Unfortunately, numerous studies have demonstrated that clinician-based radiological identification of ENE in OPC is prone to high variability and poor discriminative performance [4][5][6][7][8] . Naturally, most of these studies have specifically investigated the discriminative ability of diagnostic radiologists. However, contemporary evaluation and treatment of OPC is typically dependent on the consensus of a multidisciplinary team 9,10 , with diverse input from clinicians specialized in radiology, surgery, and radiation oncology. Therefore, it is of vital importance to investigate and understand differences between clinical specialties in the interpretation of radiological detectability of ENE, in addition to overall observer performance.
In this study, using a large number of clinician annotators, we prospectively benchmarked specialty-specific discriminative ability of ENE in HPV+ OPC. Through the use of various measures of discriminative performance and observer variability, we probed the underlying relationships between radiologists, surgeons, and radiation oncologists in their interpretation of ENE. Additionally, we prospectively determine the relative intra-and inter-observer performance of expert physicians for detection of extranodal extension using an in silico blinded performance benchmarking task.

Clinician annotator characteristics
Thirty-four expert clinician annotators from various medical specialties were recruited for this prospective study -11 radiologists, 12 surgeons, and 11 radiation oncologists. All observers provided informed consent to have their data utilized in this study.

Patient and imaging characteristics
Twenty-four patients with a pathologically confirmed diagnosis of HPV+ OPC were included in this analysis. Demographic characteristics of patients used in this study are shown in Table 1. All patients received lymph node dissection confirming the presence or absence of pathological ENE. Specifically, lymph nodes from 17 patients exhibited the presence of histopathological ENE, while lymph nodes from the 7 remaining patients did not (ENE absent). Pre-surgery contrast-enhanced CT images for these patients were retrospectively acquired from The University of Texas MD Anderson Cancer Center picture archiving system. All images were collected in Digital Imaging and Communications in Medicine (DICOM) format. Data were collected under a HIPAA-compliant protocol approved by The University of Texas MD Anderson Cancer Center Institutional Review Board (RCR03-0800 and PA19-0491) which gave ethical approval for this work. CT images were acquired on various scanner devices (GE Discovery CT750 HD = 16; GE Revolution HD = 3; GE LightSpeed VCT = 3; GE Revolution GSI = 1; Siemens SOMATOM Edge Plus = 1) using a diagnostic head and neck CT imaging protocol with intravenous contrast administration. CT acquisition parameters are shown in Table 2. Table 1. Patient demographic characteristics for the 24 OPC patients used in this study. Values for the first three characteristics are displayed as median and range. All others are displayed the total number of patients.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 26, 2023. ; X-ray tube current (mA) 260 (159-409) kVp (kV) 120

Image processing
Patient CT scans were exported as DICOM radiotherapy structure (RTS) files and converted to Neuroimaging Informatics Technology Initiative (NIfTI) format for ease of use using the DICOMRTTool v.3.2.0 Python package 11 . In order to minimize observer exposure to irrelevant tissue, CT images were cropped to the cephalad border of the sternum and inferior border of the hard palate. In order to measure intraobserver variability, images from a random subset of 6 patients (4 with ENE present, 2 with ENE absent) were added twice in random positions of the final case set, leading to a total of 30 cases: 21 with ENE present and 9 with ENE absent.

Survey Instrument
Anonymized NIfTI formatted images for the 30 cases were independently shown to the clinician observers using 3D Slicer 12 via remote control of the 3D Slicer screen over Zoom. For each patient's contrast enhanced CT scan, physicians scrolled through the axial, sagittal, and/or coronal planes of the images and answered nine questions (Appendix, A1). Seven commonly applied radiological features 13 were evaluated by annotators: indistinct capsular contour, irregular lymph node margin, thick-walled enhancing nodal margin, perinodal fat stranding, perinodal fat plane or gross invasion, nodal necrosis, and nodal matting. Annotators marked "present" or "absent" for each of the features if any of the lymph nodes in the patient met the criteria. Additionally, annotators were asked if ENE was present or absent in any lymph nodes and to provide an estimate of their confidence in predicting ENE status (0-100% certain).

Discriminative Performance
Discriminative performance was quantified using various evaluation metrics, including accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. These metrics were selected due to their ubiquity in literature and relevance to ENE discrimination [14][15][16] . Observer predictions were directly used for calculating accuracy, sensitivity, and specificity, while observer confidence was used for calculating AUC. Accuracy, AUC, sensitivity, and specificity were measured from 0 to 1, with higher values being deemed better. The Brier score 17 was also investigated to determine the calibration (i.e., reliability of observer confidence) of individual predictions. Brier score values were measured from 0 to 1, with lower values being deemed better. Aggregated metric performance was reported as median values with corresponding interquartile range (IQR) values. Mann-Whitney U tests were used to compare performance metrics between clinical specialties due to non-normal distributions of data. p values less than or equal to 0.05 were considered significant. All discriminative metrics were calculated in Python v.3.8.8 using the scikit-learn v.1.0.2 package 18 ; Mann-Whitney U tests were calculated using the statannotations v.0.4.4 package.

Radiographic Criteria Analysis
Overall percentages for the presence of each radiographic criteria across all the cases that were correctly identified (ENE correctly identified as present or ENE correctly identified as absent) were stratified by expert specialty and displayed in tabular format. Logistic regression was performed using R version 4.2.2 to determine significant factors in the correct determination of ENE status.

Performance Variability
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 26, 2023. ; Observer variability was evaluated using multiple methods. Agreement on the status of radiographic features among specialties was assessed by Fleiss' Kappa using the irr v.0.84.1 package in R 19 . To measure the reliability of the assessment of ENE by physicians, the intraclass correlation coefficient (ICC) was calculated using the pingouin v.0.5.3 package in Python. Finally, the standard error of measurement (SEm) was calculated using the duplicated cases in order to evaluate the interobserver and intraobserver variability in ENE status assessment using the SEofM v.0.1.0 package in R 20 .

Radiographic Criteria Analysis
The breakdown of annotators' utilization of radiographic criteria in relation to cases that were correctly identified is shown in Table 3. The criteria most observed in aggregate for correct identification of ENE presence was nodal necrosis (92.9%). Similarly, by specialty, nodal necrosis was the most used criteria for radiologists and radiation oncologists in the correct identification of ENE presence (98.6% and 88%, respectively), while irregular lymph node margin was the most used criteria by surgeons (95.3%). Notably, nodal necrosis was also the most observed criteria in aggregate (40.1%) and stratified by specialty (radiologists = 48.2%, radiation oncologists = 40.5%, and surgeons = 32.3%) in the correct identification of ENE absence. The criteria least observed in aggregate (7.1%) and by specialty (radiologists = 3.6%, radiation oncologists = 3.8%, and surgeons = 14.5%) for the correct identification of ENE absence was perinodal fat plane or gross invasion. Logistic regression was performed to evaluate which radiographic features were associated with the correct prediction of ENE status. As seen in Table 4, identifying the presence of an indistinct capsular contour, nodal necrosis, or nodal matting significantly increased the odds of correctly predicting ENE status. When applying separate regression analyses stratified by specialty, only indistinct capsular contour and perinodal fat plane or gross invasion was significant for radiologists, only nodal necrosis was significant for radiation oncologists, and only nodal necrosis and nodal matting were significant for surgeons.   To evaluate the intraobserver and interobserver variability in ENE and radiographic feature assessment, the standard error of measurement within each observer and among observers was calculated, respectively. Figure 3 shows that there was generally greater interobserver variability than intraobserver variability. Moreover, surgeons had the highest interobserver variability (perinodal fat plane or gross invasion SEm = 0.47) and highest intraobserver variability (perinodal fat stranding SEm = 0.37). Additionally, as a measure of the consistency in the given results, the ICC between all physicians was calculated using a single rating model resulting in a value of 0.36 (95% confidence interval = [0.26, 51], p-value < 0.0001).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 26, 2023. ; Figure 3. Interobserver vs. intraobserver variability plots as measured with the standard error of measurement. Each colored dot corresponds to a radiographic criterion. Results are presented for all observers and stratified by clinician specialty. Values in the bottom left corner represent features with low interobserver variability and low intraobserver variability, so would be preferred.

DISCUSSION
In this study, we queried a large number of clinicians across three different specialties relevant to HPV+ OPC patient management to determine differences in interpretation of radiological ENE. Broadly, we determine that though differences do exist between specialists, they are often minimal. Moreover, due to the difficulty of determining ENE from radiological features, almost all specialists are unanimously poor predictors of ENE status. To our knowledge, this is the largest individual study to investigate radiological interpretation for ENE patients in HNSCC using multiple clinician annotators.
We employed a systematic approach to investigating annotator performance by utilizing several evaluation metrics. A recent meta-analysis reported pooled sensitivity, specificity, and AUC values of 0.77, 0.60, and 0.72, respectively, for CTbased identification of ENE in OPC 14 . While our aggregated values are notably lower for sensitivity (though still within the 95% confidence interval), our specificity and AUC are similar. Interestingly, when stratified by clinician specialty, radiation oncologists had significantly higher specificity than the other specialties. These results indicate that radiation oncologists are relatively superior at correctly determining the absence of ENE. Specialty-specific factors may have led to an improved ability of radiation oncologists to correctly determine the absence of ENE. Finally, while less commonly investigated, we utilized the Brier score to measure probabilistic prediction accuracy of specialists based on their . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 26, 2023. confidence in their assessment of ENE. We showed that surgeons yielded the lowest (best) Brier score of all specialties and were significantly lower than radiologists, indicating relatively good calibration of predictions, likely due to more conservative estimates of confidence.
In a large-scale meta-analysis for all HNSCC subtypes, it was found that central node necrosis showed high pooled sensitivity, while infiltration of adjacent planes showed a high pooled specificity 15 . These findings are echoed in our study as nodal necrosis was the most commonly observed feature in aggregate for correctly determining ENE presence, while perinodal fat plane or gross invasion was the least commonly observed feature for correctly determining ENE absence. It should be noted that nodal necrosis was observed in almost all cases correctly identified with ENE and in a large portion of cases correctly identified without ENE, as could be expected for HPV+ OPC 21 . For surgeons, rather than nodal necrosis, irregular lymph node margin was the most observed criterion for correct identification of ENE presence, which may be linked to their high sensitivity. Notably, on regression analysis, several radiographic criteria were significant contributors to the correct determination of ENE status. Moreover, there were some differences that emerged in significant criteria when stratifying the regression analysis by clinician specialty. However, irregular lymph node margin, thick-walled enhancing nodal margin, and perinodal fat stranding were among the criteria not deemed significant. This is not necessarily surprising given that these criteria have been less routinely reported in ENE studies [14][15][16] .
Recent literature in HPV+ OPC ENE identification has suggested that CT radiographic criteria have poor reproducibility among expert observers 16 , though there could be some improvements in reproducibility when using a high certainty threshold for ENE identification, consolidating operational definitions, and the sharing of experience among observers 22 . We sought to determine if these findings were consistent when stratified by clinician specialty. Notably, Fleiss' kappa was always less than 0.6, regardless of specialty or radiographic criteria, consistent with findings from Tran et al. 16 . As expected, radiographic features that had higher agreement, both overall and within specialties, tended to have lower intraobserver and interobserver variability. Additionally, though there were features with relatively high agreement and low intra/interobserver variability, it is not clear if these features can be used to predict ENE as their presence may not be significantly associated with the correct prediction of ENE, as seen with thick-walled enhancing nodal margin 15 .
Our study is not without limitations. Firstly, we only investigated a single imaging modality for the identification of ENE status, namely CT. While recent evidence has suggested the incorporation of additional imaging modalities, such as magnetic resonance imaging (MRI) and positron emission tomography (PET), could improve the discrimination of ENE in OPC 14,23 , CT is among the most ubiquitous imaging modalities available for OPC patients. Therefore, we have chosen to focus on CT as an exemplar imaging modality in this study. Secondly, due to not all patients having complete pathological ground truth information for ENE extent, we did not utilize this as a factor in our analysis. However, it is well known that depending on the ENE extent (i.e., > 2 mm), discriminant capacity often increases 7 . Finally, while most patients in this dataset only had one positive lymph node, some patients with multiple positive nodes could have added unaccounted for ambiguity in clinician determination of ENE status. Additionally, while pathologic assessment of ENE was used as a gold standard for this study, the accuracy of this assessment method has been questioned in the literature [24][25][26] .
Overall, our study reinforces the findings of previous investigations, which caution against relying solely on human interpretation of ENE from radiological imaging. Given the difficulty of ENE detection for human observers regardless of clinical specialty, even when utilizing defined radiographic criteria, it is pertinent that solutions are put forth that could improve or automate this task. In recent years, machine learning approaches have been proposed as accurate and reproducible tools for determining ENE status from radiological images of HNSCC patients [27][28][29] . We anticipate these methods to play an increasing role in the clinical utility of radiological determination of OPC ENE status in the future.

CONCLUSIONS
In summary, by querying 34 clinician annotators across 30 HPV+ OPC cases, we demonstrate that there are minimal differences in CT-based radiologic ENE interpretation between radiologists, radiation oncologists, and surgeons. On average, all annotators performed poorly in discriminating ENE status as determined through various evaluation metrics. Moreover, there was high variability between and within specialties. Future studies should incorporate the utilization of additional complementary imaging modalities (e.g., MRI and PET) and/or automated approaches (e.g., machine learning) that would improve discriminative performance and minimize variability of ENE identification. Figure A1. Example of CT scan in 3D Slicer with (top) and without (bottom) ENE presence as seen by observers. Observers could scroll through the scan remotely via Zoom, change planes between axial, sagittal, or coronal, and change the window level and width.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 26, 2023. ; https://doi.org/10.1101/2023.02.25.23286432 doi: medRxiv preprint