High-performing Multi-task Model of Urinary Tract Dilation (UTD) Classification for Neonatal Ultrasound Reports Through Natural Language Processing

Objective: The urinary tract dilation (UTD) classification system provides objective assessment relevant to hydronephrosis management for children. However, the lack of uniform language regarding UTD in radiology reports leads to significant difficulty in both clinical management and research. We seek to develop a unified multi-task/multi-class model that can effectively extract UTD components and classifications from early postnatal ultrasound (US) reports. Methods: Radiology records from our institution were reviewed to identify infants aged 0-90 days undergoing early ultrasound for antenatal UTD. The report and images were reviewed by the study team to create the ground truth of UTD classification and components (primary outcome). Bio_ClinicalBERT, a variant of the Bidirectional Encoder Representations from Transformers (BERT) model, was used as the embedding layers of the classification model. The model was fine-tuned with 11 linear classification layers. All but the last BERT layer were frozen during the fine-tuning process. The model performance was evaluated with five-fold cross-validation with an 80:20 train-test ratio. Results: 2460 early (0-90 days) US reports were included. The five-fold cross-validated model performance is satisfactory (Weighted F1 > 0.9 for all UTD components). We report the weighted F1 scores, accuracies, and standard deviations for all 11 tasks and their average performance. Conclusions: By applying deep state-of-the-art NLP neural networks, we developed a high-performing, efficient, and scalable solution to extract UTD components from unstructured ultrasound reports using one single multi-task model. This can potentially help standardize and facilitate large-scale computer vision research for pediatric hydronephrosis. Key Words: machine learning, efficiency, ambulatory care, forecasting


INTRODUCTION
Urinary tract dilation (UTD), defined as hydronephrosis or hydroureter, is a common (1 in 200-300 during prenatal ultrasound evaluation with 0.15-0.67%prevalence in the general population) 1 and potentially significant clinical finding in children that requires accurate assessment for appropriate clinical management.Radiologic evaluations by ultrasound play a critical role in evaluating UTD.In order to standardize the evaluation of UTD, there are several attempts to provide a more objective and universal grading system for UTD.The most commonly noted ones included the Society of Fetal Urology (SFU) 2 grading and UTD grading. 3SFU grading is simple and easy to use.However, the fact that it does not account for salient details like ureteral dilation, bladder abnormalities, and renal parenchymal evaluations limits its clinical generalizability.On the other hand, the UTD classification system has gained significant popularity with more comprehensive evaluations for both upper and lower tracts.
Accurate identification and classification of UTD components are crucial for effective management and treatment planning.However, the variability in reporting styles and inconsistent terminology used in radiology reports hinder efficient data extraction and analysis.Consequently, there is a need for methods that can reliably and consistently extract UTD-related information from unstructured ultrasound reports and provide standardized and actionable data.To manually correct and categorize the radiology report would require a significant commitment of resources and is unlikely to be feasible as a long-term solution.To address these challenges, we propose a novel approach that leverages advanced natural language processing (NLP) techniques and deep-learning-enabled pre-trained language models to develop a unified multitask/multi-class model for extracting UTD components and classifications from early postnatal ultrasound reports.Given the success of these techniques in clinical prediction tasks, [4][5][6][7][8][9] we hypothesize that we can develop an NLP model that can effectively and accurately identify UTD classification components and grading from unstructured ultrasound reports.

Data Source and Outcome Definition
The overall study design and workflow is shown in Figure 1.We reviewed ultrasound radiology reports from Boston Children's Hospital, a freestanding acute care children's hospital located in Boston, Massachusetts, to identify infants aged 0-90 days who underwent early ultrasound for antenatal urinary tract dilation (UTD) from 2010-2022.The report and images were carefully examined by the study team to create the ground truth for UTD classification and its components, which served as the primary outcome.
A total of 2,500 early postnatal ultrasound reports (0-90 days) out of over 14,817 reports were randomly selected for the study cohort.Each report in the study dataset includes information such as the PACS accession number, medical record number (MRN), date of imaging, date of birth, and radiology report texts.
We removed cases with invalid MRNs or accession numbers, missing notes, or uncertain annotations.The final study cohort consists of 2460 unique radiology reports.We then removed HIPPA-level sensitive information and unrelated details.This involved stripping dates, times, patient IDs, and zip codes, as well as numerical figures with over three digits.We also excluded content data tied to hospital names or places specific to our system.We defined our outcome (labels) by the UTD classification of 11 tasks for abnormalities including central calyceal dilation (left/right), peripheral calyceal dilation (left/right), abnormal ureter (left/right), abnormal parenchymal thickness (left/right), abnormal parenchymal appearance (left/right), and bladder abnormality.For each task, we defined 3 possible classes: yes, no, or absence (defined as absent kidney or .multi-cystic dysplastic kidney).All labels were created by our research team after reviewing the reports as well as the images.
This study has been approved by the Boston Children's Hospital Institutional Review Board.

Model Development
The multi-task model consisted of a Bidirectional Encoder Representations from Transformers (BERT) 10 encoder, followed by task-specific linear classification layers for each of the 11 UTD classification tasks.The decision to adopt BERT was motivated by its outstanding performance across various NLP fields and its relatively compact size compared to other large language models (LLMs).
BERT's contextualized word representations have consistently achieved state-of-the-art results in NLP applications.Bio_ClinicalBERT 11 , a variant of BERT trained on a biomedical corpus that includes radiology notes, was specifically selected, which enables the model to benefit from domain-specific knowledge and improved performance in radiology note comprehension.
These linear layers were connected to the BERT encoder, allowing the model to learn task-specific representations and make predictions.During the fine-tuning process, the BERT layers, except the last one, were frozen to preserve the pre-trained weights and focus on adapting the model to the specific UTD classification tasks.
In our experiments, we employed a stratified sampling technique specifically to address the challenge posed by the imbalanced distribution of our three outcome classes.Stratified sampling ensured that each fold in the 5x5 cross-validation had a representative consistent proportion of each class, thereby improving the model's ability to generalize across different subsets of the data.WeWe conducted analysis through 5x5 cross-validation, each iteration was initiated with a random seed number to enhance the experiment robustness and reliability.We employed a stratified sampling technique to ensure that each fold of the cross-validation had a consistent proportion of each outcome class, thereby improving the model's ability to generalize across different subsets of the data.In each iteration, we used an 80:20 ratio for traintest splits.The training phase spanned 25 epochs with a batch size of 12, and the Adam optimizer was configured with a learning rate of 5e-5.Within the training set, we performed a further division into training and validation subsets, adhering to an 80:20 ratio.To monitor the model's progress and mitigate the risk of overfitting, early stopping was implemented with a patience of 5 epochs.
Finally, we assessed the model's performance on the out-of-sample test sets using a comprehensive set of evaluation metrics including weighted accuracy, F1 score, precision, recall, the area under the ROC curve (AUC), and its 95% confidence interval (CI) calculated supported by bootstrap resampling on the test sets (n = 1,000).

Semi-supervised Learning and Error analysis
During the evaluation phase, it came to our attention that a few initial labels were questionable, prompting the adoption of a semi-supervised learning approach to iteratively rectify these labels and enhance the model's performance in our study.To correct the labels, the model was initially trained on the collected data.Afterwards, instances where the model produced incorrect predictions across five cross-validation iterations were identified for error analysis and relabeling.These relabeled instances, along with the remaining dataset, were then utilized to retrain the model.This iterative process was repeated two times to show the efficiency of this approach, progressively refining the label accuracy, and advancing the overall model performance.

Experiment Against Existing Large Language Model (LLM)
In our study, we benchmark our algorithm against GPT-3.5, a state-of-the-art LLM developed by OpenAI and commercially available as "ChatGPT."This model is distinguished for its extensive training data and exceptional capabilities in diverse language-based tasks.For a thorough comparison, we employed GPT-3.5-Turbo, the most advanced version available.We used both "0-shot" and "3-shot" approaches for task execution following previous literature. 12,13In the "0-shot" method, we did not provide the model with any task-specific examples, relying solely on its pre-existing training.Conversely, the "3-shot" method included three task-specific examples to facilitate the model's understanding and performance.These examples were randomly sampled, one from each possible outcome category, to provide a comprehensive task primer.
Prompts used in this experiment are provided in Appendix S1 and S2.

Software and Hardware
An alpha of 0.05 and 95% confidence intervals (CI) were used as criteria for statistical significance.
Analyses were performed using Python 3.9 (package Pandas, Numpy, PyTorch) and NVIDIA Titan V with 12GB of GPU memory to accelerate our machine learning computations.

Demographics and Cohort Characteristics
Table 1 shows the demographics and year of imaging distribution of our study cohort.The majority of the cohort were males, constituting 67.68% of the total, while females represented 32.32%.The median age of individuals at the time of their ultrasound was 33 years, with an interquartile range spanning from 22 to 62 years.Imaging frequency showed fluctuations over the years, with a notable peak in 2018 at 12.40% of the total scans.Table 2 shows the distribution of labels pertaining to the 11 UTD classification tasks before model-instructed correction.There is a noticeable class imbalance, with the "No" category often overwhelmingly higher than the "Yes" category.Notably, while many tasks present a balance between left and right abnormalities, asymmetry is evident in "central calyceal dilation" with left at 59.35% and right at 31.50%, and "peripheral calyceal dilation" showing left at 52.76% and right at 26.91%.

Model Performance
The model exhibited high F1 scores, ranging from 0.90 to 0.98, across the 11 tasks with tight confidence intervals, suggesting a robust and consistent performance (results shown in Table 3).Table 4 shows the baseline results generated by ChatGPT, supported by its best GPT3.5 model at the time (i.e., GPT 3.5-Turbo).In terms of F1, our model outperforms GPT 3.5-Turbo by 20% -30%.
Another interesting observation is that providing GPT 3.5 with examples from each of the classes did not help it perform better in the tasks.Another key aspect of our approach is the adoption of semi-supervised learning to rectify incorrect labels.This is crucial in scenarios where labeled data could be prone to errors.By iteratively correcting the labels based on the model's predictions, we enhance the accuracy and reliability of the training data.This In addition, the observation that our integrated multi-task human-model substantially outperforms ChatGPT underscores the limitations of relying solely on LLMs.These models, while powerful, are not panaceas; they do not invariably outperform more lightweight models, especially in complex tasks requiring nuanced understanding.The extensive data needed to fine-tune such LLMs, coupled with their intensive computational resource requirements, can be prohibitive.In our case, few-shot learning proved ineffective, possibly introducing noises rather than providing useful information, likely because of the complexity of the study question, involving multiple tasks with a variety of possible classes.This is reflected in our results where our tailored model significantly outperformed GPT 3.5-Turbo by 20% to 30% in F1 scores, indicating that merely providing the GPT model with additional class examples did not enhance its task performance.
In real-life expert-domain applications, developing targeted models are likely more efficient than deploying LLMs.
Nonetheless, the results of this study must be interpreted in the context of its limitations.This is a single-institution study based on retrospective review at a tertiary referral center which may limit the generalizability of our findings.Different institutions might have varied patient populations, imaging protocols, or reporting styles, which could influence the outcomes.External validation sets from a different institution or dataset could offer a more rigorous assessment of our model's broader applicability.
Additionally, the process of determining the ground truth for UTD classification, based on the review of reports and images by our research team, introduces another potential source of bias.This subjective approach might lead to errors in labeling, a concern that was substantiated during our evaluation phase when we identified and rectified erroneous annotations.While our model assisted in correcting these annotations, it's worth noting that if not properly managed, such a process could theoretically reinforce biases, even if we did not observe this trend in our study.Furthermore, the challenge of imbalanced data (i.e., rare labels such as increased parenchymal echogenicity or bladder abnormality) remains a potential concern.Although we employed stratified sampling to mitigate the effects of the imbalanced distribution of our outcome classes, the inherent imbalances in the dataset could still influence the model's performance.
Future studies should consider these limitations and potentially explore multi-institutional datasets and more rigorous validation strategies to enhance the robustness and generalizability of the findings.Lastly, the exclusion of cases with missing values may introduce another layer of bias into our study.By removing these instances, we may inadvertently skew the dataset towards more complete but potentially unrepresentative cases.This could result in a model that is less generalizable to real-world scenarios where missing data are common.For example, cases with missing values might systematically differ from those without, perhaps reflecting more complex or severe conditions that were not fully captured in the dataset.
Consequently, the model's ability to generalize to all types of patients, including those with incomplete records, could be compromised.

Figure 1 .
Figure 1.Overall study design and workflow.

Figure 2 and
Figure 2 and Figure 3 present comparative error analyses of human errors and model errors, where 'Human incorrect' indicates instances where human labeling was erroneous upon review; 'Model incorrect' denotes cases where the model's predictions were inaccurate; and 'Human and model incorrect'represents scenarios where both human labeling and model predictions.We observe that the error analysis from the semi-supervised learning approach reveals distinct error patterns between human and model predictions, with conditions such as 'left central calyceal dilation' more commonly mislabeled by humans, while 'bladder' errors are more prevalent in model predictions.This suggests that an integrated correction approach leveraging both human and model strengths could improve overall accuracy.Furthermore, the reduction in errors through iterative rounds of label refinement, as reflected by 592 incorrect prediction in round 1 error analysis vs. 529 incorrect predictions in round 2 error analysis, suggests that this method is effective in enhancing the quality of the training data.Appendix S1 and S2 further present the model performance results of rounds 1 and 2 learning and evaluation, from which we observe that training and label correction helped with improving model performance.

Figure 3 .
Figure 3. Error Analysis from Round 2 evaluation.'Model incorrect' denotes cases where the model's predictions were inaccurate; and 'Human and model incorrect' represents scenarios where both human labeling and model prediction.
iterative process progressively refines the model's understanding of UTD components, resulting in improved performance over multiple iterations.Our findings demonstrate the effectiveness of semisupervised learning in enhancing the model's capabilities and addressing the challenges inherent in UTD classification.The implications of our research extend beyond the scope of UTD classification and component extraction.By establishing a standardized and efficient method for extracting UTD information from ultrasound reports, our approach lays the foundation for large-scale machine-learning research on pediatric hydronephrosis.The availability of accurately labeled data enables the development and evaluation of future computer vision algorithms for automated detection, quantification, and analysis of UTD components.This opens up new avenues for research, potentially leading to improved diagnostic accuracy, treatment planning, and patient outcomes.

Table 1 .
Demographics and Ultrasound Trends over Years.

Table 2 .
Distribution of the Initial Human-Labeled UTD Classification Labels.

Table 3 .
Performance of the Bio_ClincalBERT-based prediction model, round 3.

Table 4 .
Baseline results using GPT 3.5-Turbo.Error Analysis from Round 1 evaluation.'Model incorrect' denotes cases where the model's predictions were inaccurate; and 'Human and model incorrect' represents scenarios where both human labeling and model prediction.