Large-scale crowdsourced radiotherapy segmentations across a variety of cancer anatomic sites: Interobserver expert/non-expert and multi-observer composite tumor and normal tissue delineation annotation from a prospective educational challenge

Clinician generated segmentation of tumors and healthy tissue regions of interest (ROIs) on medical images is crucial for radiotherapy workflows. However, interobserver segmentation variability has long been considered a significant detriment to the implementation of high-quality and consistent radiotherapy dose delivery. This has prompted the increasing development and utilization of computational approaches for automated ROI segmentation, in particular deep learning based methods. However, extant public segmentation datasets typically only provide segmentations generated by a limited number of annotators with varying, and often unspecified, levels of expertise. In this data descriptor, a large number of clinician annotators manually generated segmentations for numerous ROIs on computed tomography images across a variety of cancer sites (breast, sarcoma, head and neck, gynecologic, gastrointestinal) for the Contouring Collaborative for Consensus in Radiation Oncology challenge. Over 200 clinician annotators (verified experts and non-experts) contributed to the generation of these segmentation data using a standardized annotation platform (ProKnow). Subsequently, we converted image and segmentation data into Neuroimaging Informatics Technology Initiative format with standardized nomenclature for ease of use by the research community. In addition, we generated consensus segmentations for experts and non-experts using the simultaneous truth and performance level estimation method. These standardized, structured, and easily accessible data are a valuable resource for systematically studying interobserver variability in ROI segmentation applications with an unprecedented large number of clinician annotators.


Background & Summary
Since the advent of contemporary radiation delivery techniques for cancer treatment, clinician generated segmentation (also termed contouring or delineation) of target structures (e.g., primary tumors and metastatic lymph nodes) and organs at risk (e.g., healthy tissues whose irradiation could lead to damage and/or side effects) on medical images has become a necessity in the radiotherapy workflow 1 .These segmentations are typically provided by trained medical professionals, such as radiation oncologists.While segmentations can be performed on any imaging modality that provides sufficient discriminative capabilities to visualize regions of interest (ROIs), the current radiotherapy workflow prioritizes the use of computed tomography (CT) for ROI segmentation due to its ubiquitous nature and use in radiotherapy dose calculations.Subsequently, clinicians spend a large fraction of their time and effort generating ROI segmentations on CT imaging necessary for the radiotherapy workflow.
Interobserver and intraobserver variability are well-documented byproducts of the use of manual human-generated segmentations 2,3 .While consensus radiotherapy guidelines to ensure ROI segmentation quality have been developed and shown to reduce variability 4 , these guidelines are not necessarily followed by all practicing clinicians.Therefore, segmentation variability remains a significant concern in maintaining radiotherapy plan quality and consistency.Recent computational improvements in machine learning, particularly deep learning, have prompted the increasing development and deployment of accurate ROI auto-segmentation algorithms to reduce radiotherapy segmentation variability [5][6][7] .However, for auto-segmentation algorithms to be clinically useful, their input data (training data) should reflect high-quality "gold-standard" annotations.While research has been performed on the impact of interobserver variability and segmentation quality for auto-segmentation training [8][9][10][11] , it remains unclear how "gold-standard" segmentations should be defined and generated.One common approach, consensus segmentation generation, seeks to crowdsource multiple segmentations from different annotators to generate a high-quality ground-truth segmentation.While multi-observer public medical imaging segmentation datasets exist [12][13][14][15][16][17] , there remains a lack of datasets with a large number of annotators for radiotherapy applications.
The Contouring Collaborative for Consensus in Radiation Oncology (C3RO) challenge was developed to engage radiation oncologists across various expertise levels in cloud-based ROI crowdsourced segmentation 18 .Through this collaboration, a large number of clinicians generated ROI segmentations using CT images from 5 unique radiotherapy cases: breast, sarcoma, head and neck, gynecologic, and gastrointestinal.In this data descriptor, we present the curation and processing of the data from the C3RO challenge.The primary contribution of this dataset is unprecedented large-scale multi-annotator individual and consensus segmentations of various ROIs crucial for radiotherapy planning in an easily accessible and standardized imaging format.These data can be leveraged for exploratory analysis of segmentation quality across a large number of annotators, consensus segmentation experiments, and auto-segmentation model benchmarking.An overview of this data descriptor is shown in Fig. 1.

Methods
Patient population.Five separate patients who had undergone radiotherapy were retrospectively collected from our collaborators at various institutions.Each patient had received a pathologically confirmed diagnosis of cancer of one of the following sites: breast (post-mastectomy intraductal carcinoma), sarcoma (malignant peripheral nerve sheath tumor of the left thigh), head and neck (oropharynx with nodal spread, [H&N]), gynecologic (cervical cancer, [GYN]), and gastrointestinal (anal cancer, [GI]).Clinical characteristics of these patients are shown in Table 1.Of note, these five disease sites were included as part of the C3RO challenge due to being among the most common disease sites treated by radiation oncologists; additional disease sites were planned but were not realized due to diminishing community participation in C3RO.Specific patient cases were selected by C3RO collaborators on the basis of being adequate reflections of routine patients a generalist radiation oncologist may see in a typical workflow (i.e., not overly complex).Further details on the study design for C3RO can be found in Lin & Wahid et al.  2. All images were acquired on scanners that were routinely used for radiotherapy planning at their corresponding institutions with appropriate calibration and quality assurance by technical personnel.The sarcoma, H&N, and GI cases received intravenous contrast, the GU case received oral contrast, and the breast case did not receive any contrast.Of note, the H&N case had metal streak artifacts secondary to metallic implants in the upper teeth, which obscured anatomy near the mandible.No other cases contained noticeable image artifacts.Notably, the sarcoma case also received a magnetic resonance imaging (MRI) scan, while the H&N and GI cases received full body positron emission tomography (PET) scans.The sarcoma MRI scan was acquired on a GE Signa HDxt device and corresponded to a post-contrast spin echo T1-weighted image with a slice thickness of 3.0 mm and in-plane resolution of 0.35 mm.The H&N PET scan was acquired on a GE Discovery 600 device with a slice thickness of 3.3 mm and in-plane resolution of 2.73 mm.The GI PET scan was acquired on a GE Discovery STE device with a slice thickness of 3.3 mm and in-plane resolution of 5.47 mm.DICOM anonymization.For each image, the DICOM header tags containing the patient name, date of birth, and patient identifier number were consistently removed from all DICOM files using DicomBrowser v. 1.5.2 20 .The removal of acquisition data and time metadata (if available in DICOM header tags) caused compatibility issues with ProKnow so were kept as is.Moreover, if institution name or provider name were available in the DICOM file, they were not removed as they were not considered protected health information.Select cases (breast, GYN, GI) were previously anonymized using the DICOM Import Export tool (Varian Medical Systems, CA, USA).
Participant details.To register for the challenge, participants completed a baseline questionnaire that included their name, email address, affiliated institution, country, specialization, years in practice, number of disease sites treated, volume of patients treated per month for the designated tumor site, how they learned about this challenge, and reasons for participation.Registrant intake information was collected through the Research Electronic Data Capture (REDCap) system -a widely used web application for managing survey databases 21 ; an example of the intake form can be found at: https://redcap.mskcc.org/surveys/?s=98ARPWCMAT.The research conducted herein was approved by the HRRP at MSK (IRB#: X19-040 A(1); approval date: May 26, 2021).All subjects prospectively consented to participation in the present study, as well as to the collection, use, and disclosure of de-identified aggregate subject information and responses.Participants were categorized as recognized experts or non-experts.Recognized experts were identified by our C3RO team (EFG, CDF, DL) based on participation in the development of national guidelines or other extensive scholarly activities.Recognized experts were board-certified physicians with expertise in the specific disease site.Non-experts were any participants not categorized as an expert for that disease site.All non-experts had some knowledge of human anatomy, with the majority being composed of practicing radiation oncologists but also included resident physicians, radiation therapists, and medical physicists.Worthy of note, a participant could only be considered an expert for one disease site, but could have participated as a non-expert for other disease sites.Out of 1,026 registrants, 221 participated in generating segmentations, which were used for this dataset; due to the low participation rate, participants may represent a biased sample of registrants.Of note, participants could provide segmentations for multiple cases.Additional demographic characteristics of the participants can be found in Lin & Wahid et al. 19 .
ProKnow segmentation platform.Participants were given access to the C3RO workspace on ProKnow (Elekta AB, Stockholm, Sweden).ProKnow is a commercially available radiotherapy clinical workflow tool that allows for centralization of data in a secure web-based repository; the ProKnow system has been adopted by several large scale medical institutions and is used routinely in clinical and research environments.Anonymized CT DICOM images for each case were imported into the ProKnow system for participants to segment; anonymized MRI and PET images were also imported for select cases as available.Each case was attributed a short text prompt describing the patient presentation along with any additional information as needed.
Participants were allowed to utilize common image manipulation (scrolling capabilities, zooming capabilities, window leveling, etc.) and segmentation (fill, erase, etc.) tools for generating their segmentations.No auto-segmentation capabilities were provided to the participants, i.e., all segmentations were manually generated.Notably, for the sarcoma case, an external mask of the patient's body and a mask of the left femur was provided to participants.Screenshots of the ProKnow web interface platform for the various cases are shown in Fig. 2. head and neck, gynecologic, and gastrointestinal cases are outlined in pink, red, blue, purple, and green borders, respectively.
Segmentation details.For each case, participants were requested to segment a select number of ROIs corresponding to target structures or OARs.Notably, not all participants generated segmentations for all ROIs.ROIs for each participant were combined into one structure set in the ProKnow system.ROIs were initially named in a consistent, but non-standardized format, so during file conversion ROIs were renamed based on The Report of American Association of Physicists in Medicine Task Group 263 (TG-263) suggested nomenclature 22 ; TG-263 was chosen due its ubiquity in standardized radiotherapy nomenclature.A list of each ROI and the number of available segmentations stratified by participant expertise level is shown in Table 3. * Not all participants generated segmentations for all ROIs.
Image processing and file conversion.For each case, anonymized CT images and structure sets for each annotator were manually exported from ProKnow in DICOM and DICOM radiotherapy structure (RTS) format, respectively.The Neuroimaging Informatics Technology Initiative (NIfTI) format is increasingly used for reproducible imaging research [23][24][25][26][27] due to its compact file size and ease of implementation in computational models 28 .Therefore, in order to increase the interoperability of these data, we converted all our DICOM imaging and segmentation data to NIfTI format.For all file conversion processes, Python v. 3.8.8 29was used.An overview of the image processing workflow is shown in Fig. 3A.In brief, using an inhouse Python script, DICOM images and structure sets were loaded into numpy array format using the DICOMRTTool v. 0.4.2 library 30 , and then converted to NIfTI format using SimpleITK v. 2.1.1 31 .For each annotator, each individual structure contained in the structure set was separately converted into a binary mask (0 = background, 1 = ROI), and was then converted into separate NIfTI files.Notably, voxels fully inside and outside the contour are included and not include in the binary mask, respectively, while voxels that overlapped the segmentation (edge voxels) were counted as surface coordinates and included in the binary mask; additional details on array conversion can be found in the DICOMRTTool documentation 30 .Examples of random subsets of five expert segmentations for each ROI from each case are shown in Fig. 4.  Consensus segmentation generation.In addition to ground-truth expert and non-expert segmentations for all ROIs, we also generated consensus segmentations using the Simultaneous Truth and Performance Level Estimation (STAPLE) method, a commonly used probabilistic approach for combining multiple segmentations [32][33][34][35] .Briefly, the STAPLE method uses an iterative expectation-maximization algorithm to compute a probabilistic estimate of the "true" segmentation by deducing an optimal combination of the input segmentations and incorporating a prior model for the spatial distribution of segmentations as well as implementing spatial homogeneity constraints 36 .For our specific implementation of the STAPLE method, we utilized the SimpleITK STAPLE function with a default threshold value of 0.95.For each ROI, all available binary segmentation masks acted as inputs to the STAPLE function for each expertise level, subsequently generating binary STAPLE segmentation masks for each expertise level (i.e., STAPLEexpert and STAPLEnon-expert).An overview of the consensus segmentation workflow is shown in Fig. 3B.Examples of STAPLEexpert and STAPLEnon-expert segmentations for each ROI are shown in Fig. 5.
Figure 5. Examples of consensus segmentations using the simultaneous truth and performance level estimation (STAPLE) method for each region of interest (ROI) provided in this data descriptor.STAPLE segmentation generated by using all available expert segmentations (STAPLEexpert) and STAPLE segmentation generated by using all available non-expert segmentations (STAPLEnon-expert) are displayed as green and red dotted outlines, respectively, and overlaid on zoomed in images for each case.Subplots for breast, sarcoma, head and neck, gynecologic, and gastrointestinal cases are outlined in pink, red, blue, purple, and green borders, respectively.

Data Records
Medical images and multi-annotator segmentation data.This data collection primarily consists of 1985 3D volumetric compressed NIfTI files (.nii.gz file extension) corresponding to CT images and segmentations of ROIs from various disease sites (breast, sarcoma, H&N, GYN, GI).Analogously formatted MRI and PET images are available for select cases (sarcoma, H&N, GI).ROI segmentation NIfTI files are provided in binary mask format (0 = background,1 = ROI); file names for each ROI are provided in TG-263 notation.All medical images and ROI segmentations were derived from original DICOM and DICOM RTS files (.dcm file extension) respectively, which for completeness are also provided in this data collection.In addition, Python code to recreate the final NIfTI files from DICOM files is also provided in the corresponding GitHub repository (see Code Availability section).
Consensus segmentation data.Consensus segmentations for experts and non-experts generated using the STAPLE method for each ROI have also been provided in compressed NIfTI file format (.nii.gz file extension).Consensus segmentation NIfTI files are provided in binary mask format (0 = background, 1 = ROI consensus).Python code to recreate the STAPLE NIfTI files from input annotator NIfTI files is also provided in the corresponding GitHub repository (see Code Availability section).
Annotator demographics data.We also provide a single Microsoft Excel file (.xlsx file extension) containing each annotator's gender, race/ethnicity, geographic setting, profession, years of experience, practice type, and categorized expertise level (expert, non-expert).Geographic setting was re-coded as "United States" or "International" to further de-identify the data.Each separate sheet corresponds to a separate disease site (sheet 1 = breast, sheet 2 = sarcoma, sheet 3 = H&N, sheet 4 = GU, sheet 5 = GI).Moreover, in order to foster secondary analysis of registrant data, we also include a sheet containing the combined intake data for all registrants of C3RO, including those who did not provide annotations (sheet 6).
Folder structure and identifiers.Each disease site is represented by a top-level folder, containing a subfolder for images and segmentations.The annotator demographic excel file is located in the same top-level location as the disease site folders.Image folders contain separate subfolders for NIfTI format and DICOM format images.Segmentation folders contain separate subfolders for expert and non-expert segmentations.Each expertise folder contains separate subfolders for each annotator (which contains separate subfolders for DICOM and NIfTI formatted files) and the consensus segmentation (only available in NIfTI format).The data have been specifically structured such that for any object (i.e., an image or segmentation), DICOM and NIfTI subdirectories are available for facile partitioning of data file types.An overview of the organized data records for an example case is shown in Fig. 6.Segmentation files (DICOM and NIfTI) are organized by anonymized participant ID numbers and can be cross referenced against the excel data table using this identifier.The raw data, records, and supplemental descriptions of the meta-data files are cited under Figshare doi: 10.6084/m9.figshare.21074182 37.

Technical Validation
Data annotations: Segmentation DICOM and NIfTI files were manually verified by study authors (D.L., K.A.W., O.S.) to be annotated with the appropriate corresponding ROI names.
Segmentation interobserver variability.We calculated the pairwise interobserver variability (IOV) for each ROI for each disease site across experts and non-experts.Specifically, for each metric all pairwise combinations between all available segmentations in a given group (expert or nonexpert) were calculated; median and interquartile range values are reported in Table 4. Calculated metrics included the Dice Similarity coefficient (DSC), average surface distance (ASD), and surface DSC (SDSC).SDSC was calculated based on ROI specific thresholds determined by the median pairwise mean surface distance of all expert segmentations for that ROI as suggested in literature 38 .Metrics were calculated using the Surface Distances Python package 38,39 and in-house Python code.For specific equations for metric calculations please see corresponding Surface Distances Python package documentation 39 .Resultant values are broadly consistent with previous work in breast 40 , sarcoma 41 , H&N 35,42,43 , GYN 44 , and GI [44][45][46] IOV studies.

Usage Notes
The image and segmentation data from this data collection are provided in original DICOM format (where applicable) and compressed NIfTI format with the accompanying excel file containing demographic information indexed by annotator identifiers.We invite all interested researchers to download this dataset for use in segmentation, radiotherapy, and crowdsourcing related research.Moreover, we encourage this dataset's use for clinical decision support tool development.While the individual number of patient cases for this dataset is too small for traditional machine learning development (i.e., deep learning auto-segmentation training), this dataset could act as a benchmark reference for testing existing auto-segmentation algorithms.Importantly, this dataset could also be used as a standardized reference for future interobserver variability studies seeking to investigate further participant expertise criteria, e.g., true novice annotators (no previous segmentation or anatomy knowledge) could attempt to segment ROI structures on CT images, which could then be compared to our expert and non-expert annotators.Finally, in line with the goals of the eContour collaborative 47 , these data could be used to help develop educational tools for radiation oncology clinical training.
The segmentations provided in this data descriptor have been utilized in a study by Lin & Wahid et al. 19 .This study demonstrated several results that were consistent with existing literature, including: 1).target ROIs tended to exhibit greater variability than OAR ROIs 35 , 2).H&N ROIs exhibited higher interobserver variability compared to other disease sites 43,48 , and 3).nonexpert consensus segmentations could approximate gold-standard expert segmentations 49 .
Original DICOM format images and structure sets may be viewed and analyzed in radiation treatment planning software or select digital image viewing applications, depending on the enduser's requirements.Current open-source software for these purposes includes ImageJ

Figure 1 .
Figure 1.Data descriptor overview.Multi-annotator segmentations were generated for the Contouring Collaborative for Consensus in Radiation Oncology (C3RO) challenge.Imaging and segmentation data were extracted from the ProKnow cloud-based platform in Digital Imaging and Communications in Medicine (DICOM) format, which were then converted to Neuroimaging Informatics Technology Initiative (NIfTI) format for ease of use.Consensus segmentation data were then generated from the available multi-annotator segmentations.The provided data collection contains all original DICOM files along with converted NIfTI files.

Figure 2 .
Figure 2. Examples of ProKnow segmentation platform used by participants for the breast (A), sarcoma (B), head and neck (C), gynecologic (D), and gastrointestinal (E) cases.Participantswere given access to the standard image visualization and segmentation capabilities for generating their segmentations of target structures and organs at risk.Participants were also given access to a short prompt describing the patient presentation along with any additional information as needed.For the sarcoma case, an external mask of the patient's body (green) and a mask of the left femur (pink) was provided to participants.Subplots for breast, sarcoma,

Figure 3 .
Figure 3. Image processing Python workflows implemented for this data descriptor for an example case (head and neck case, parotid glands).(A) Original Digital Imaging and Communications in Medicine (DICOM) formatted images and DICOM radiotherapy structure (RTS) formatted region of interest (ROI) segmentations are transformed to Neuroimaging Informatics Technology Initiative (NIfTI) format.(B).ROIs from multiple annotators are combined into a single consensus segmentation using the Simultaneous Truth and Performance Level Estimation (STAPLE) method.

Figure 4 .
Figure 4.Examples of a random subset of five expert segmentations for each region of interest (ROI) provided in this data descriptor.Segmentations are displayed as green, yellow, blue, red, and orange dotted lines corresponding to annotators 1, 2, 3, 4, and 5, respectively, and overlaid on zoomed-in images for each case.Subplots for breast, sarcoma, head and neck, gynecologic, and gastrointestinal cases are outlined in pink, red, blue, purple, and green borders, respectively.Notably, the gastrointestinal case only had four expert annotators, so only four lines are displayed.

Figure 6 .
Figure 6.Overview of folder and file structure for dataset provided in this data descriptor.Each disease site folder contains separate subfolders for the computed tomography (CT) image and segmentations.Select cases had additional imaging modalities where available.Image subfolders contain separate subfolders for different data formats (Digital Imaging and Communications in Medicine [DICOM] and Neuroimaging Informatics Technology Initiative [NIfTI]).Segmentation subfolders contain separate subfolders which stratify expert and non-

Table 2 .
Computed Office on 05/26/2021 and were determined to be exempt research as per 45 CFR 46.104(d)(3),(i)(a), (ii) and (iii), (i)(b),(ii) and (iii), (i)(c), (ii) and (iii) and 45.CFR.46.111(a)(7)).A limited IRB review of the protocol X19-040 A(1) was conducted via expedited process in accordance with 45 CFR 46.110(b), and the protocol was approved on May 26, 2021.DICOM files were obtained and stored on MIMcloud (MIM Software Inc., Ohio, USA), which is a HIPAA-compliant cloud-based storage for DICOM image files that has been approved for use at MSK by MSK's Information Security team.

Table 3 .
Summary of all region of interest (ROI) segmentations generated by participants for this data descriptor.ROIs included radiotherapy target volumes and organs at risk (OARs).

Table 4 .
Pairwise interobserver variability values for experts and non-experts.Pairwise Dice similarity coefficient (DSC), average surface distance (ASD), and surface DSC (SDSC) are shown for experts and non-experts separately.Median values reported with interquartile range in parenthesis.
54, dicompyler51, ITK-Snap52, and 3D Slicer 53 with the SlicerRT extension54.Processed NIfTI format images and segmentations may be viewed and analyzed in any NIfTI viewing application, depending on the end-user's requirements. Curent open-source software for these purposes includes ImageJ 50 , ITK-Snap 52 , and 3D Slicer 53 .