Abstract
Objective Develop a deep learning-based methodology using the foundations of systems pathology to generate highly accurate predictive tools for complex gastrointestinal diseases, using celiac disease (CD) as a prototype.
Design To predict the severity of CD, defined by Marsh–Oberhüber classification, we used deep learning to develop a model based on histopathologic features.
Results The study was based on a pediatric cohort of 124 patients identified with different classes of CD severity. The model predicted CD with an overall 88.7% accuracy with the highest for Marsh IIIc (91.0%; 95% sensitivity; 91% specificity). The model identified EECs as a defining feature of children with Marsh IIIc CD and endocrinopathies which was confirmed using immunohistochemistry.
Conclusion This deep learning image analysis platform has broad applications in disease treatment, management, and prognostication and paves the way for precision medicine.
What is already known about this subject?
– Deep Learning has the potential to generate predictive models for complex gastrointestinal diseases.
What are the new findings?
– Our deep learning-based model used the foundations of systems pathology to generate a highly accurate predictive tool for complex gastrointestinal diseases, using a celiac disease (CD) pediatric cohort as a prototype.
– The model predicated CD severity with high accuracy and identified enteroendocrine cells as a defining feature of children with severe CD and endocrinopathies.
How might it impact on clinical practice in the foreseeable future?
– Assessment of histopathological markers at the time of diagnosis that can predict risk of severity or complications can have broad applications in disease treatment, management, and prognostication and pave the way for precision medicine.
Introduction
Several Celiac Disease (CD) histopathological features and serologic tests the diagnose disease. Biopsy-based assessment of proximal small intestine is the current is the gold standard for confirmation of CD [1, 2, 3], with severity assessment using Marsh–Oberhüber classification (modified Marsh score) being common in clinical practice [4, 5]. Serologic tests are frequently used to identify individuals who require a biopsy; these include anti-gliadin IgA and IgG (AGA IgA and AGA IgG), anti-reticulin IgA (ARA), anti-endomysium IgA(EMA), and anti-tissue transglutaminase IgA (TTG) antibodies [1]. Despite there being guidelines for CD diagnosis, its severity assessment is challenging. It falls privy to interobserver variability, non-specificity of discerning histopathological features, and a lack of specific histopathological markers associated with either CD severity classes or comorbid conditions.
The Marsh–Oberhüber classification is a commonly used method for CD severity assessment in clinical practice based on histopathological features involving villus and crypt injury and intraepithelial inflammation [4, 5]. Unfortunately, high interobserver variability is found when this system is applied to clinical samples, and agreement is generally only fair (κ = 0.29-0.35) [6, 7].
Alternate purely quantitative methods of assessing mucosal damage by direct measurement of the villous height to crypt depth ratio (Vh: Cd) and intraepithelial lymphocyte (IEL) counts have been proposed [8]. Using these methods, the inter-observer agreement is moderate (κ = 0.55) [7]; however, this method is also highly dependent on the use of correctly oriented specimens and is not used in clinical practice.
In patients with CD, there is also evidence of comorbid endocrinopathies such as T1DM and hypothyroidism, which substantially more common than in the general population. Further, in comparison to those with isolated T1DM, patients with undiagnosed CD and T1DM also have a higher prevalence of retinopathy (58% vs. 25%) and nephropathy (42% vs. 4%) [9, 10, 11].
We chose a deep learning-based methodology for image analytics to develop a novel predictive model for CD severity and the association of disease features with comorbid conditions. These models utilizing Convolutional Neural Networks (CNNs) have shown potential benefit for disease characterization using histological specimens in a broad range of diseases [12, 13, 14, 15], including CD [16]. They are optimized to discern fine features from regions of interest within the biopsy images. One such example is of the deep residual network (ResNet) that has outperformed early deep learning models and achieved superior performance on image recognition benchmarks [17, 18, 19]. We also deployed a deep learning method [20] to provide insight into the model decision-making and discerning specific histologic features for each class. We confirmed the features identified by the model, specific for severe CD (Marsh IIIc) and concurrent endocrine commodities, using immunohistochemistry, which confirmed the utility of deep learning CNN models for discerning disease-specific features.
Methods
Participants and Data Collection
Duodenal biopsy glass slides from children (under the age of 18 years) at the time of CD diagnosis (baseline) from 1992 to 2017 (25 years) were retrieved at the University of Virginia (UVA), Charlottesville, Virginia. These were already Hematoxylin and eosin (H&E) stained as per standard pathological workflow. To digitize the glass biopsy slides, scanning was done at 40x magnification at the Biorepository and Tissue Research Facility, UVA, using a Leica SCN400 brightfield scanning instrument (Leica Microsystems CMS GmbH, Germany).
Biopsies selected and retrieved as CD were based on pathology reports. Histologically normal controls (referred to as controls for the purpose of this study) were based on the inclusion criteria of no evidence of disease (eosinophilic esophagitis, gastritis, inflammatory bowel disease, etc.) being present in any other part of the gastrointestinal tract examined via the esophagogastroduodenoscopy.
Evaluation of Celiac Disease Severity
CD severity for each patient was assessed using the Marsh–Oberhüber classification, which involves evaluating duodenal villous architecture and intraepithelial lymphocytosis (Figure 1). In the case where more than one biopsy fragment was present per image, each fragment was separately assessed via the Marsh–Oberhüber classification. Details of the Marsh–Oberhüber classification (referred to as Marsh score as part of this study) are described in supplemental appendix S1.
Deep Learning Image Analysis Model
Biopsy Image Model Input Dataset Construction
A. Biopsy Image Patch Creation
Approximately 50,000 patches were created using biopsy whole slide images (WSIs) to maximize of computational resources (details in supplemental appendix S2). Each patch was further rotated on its horizontal and vertical axes to augment the data sample.
B. Model Training and Test Set Development
Biopsy image patches were then randomly divided (70:30 ratio) into training and testing sets. Patient samples were kept separate among training and test sets without any overlap.
C. Biopsy Patch Clustering
A large proportion of the digital WSIs consist of white background, which does not provide clinically useful information and may bias the model. To overcome this, a two-step clustering process was applied to identify and remove patch clusters with more than 50% white background (figure and details in supplemental appendix S3).
D. Biopsy Stain Color Normalization
Slight variations in H&E staining chemical composition over the years result in visible H&E color differences. Therefore, stain color normalization as described by Vahadane et al. [21], was deployed to eliminate erroneous biopsy classification based on color differences rather than distinct biopsy image features. Details of this method are explained in supplemental appendix S4.
Deep Learning Model Development
First, a patch-level model was used to focus on microscopic histologic features in CD versus control biopsies. This was followed by a patch-level model for CD severity classes assessed via Marsh score only. Further, a WSI-level model was developed to obtain overall low magnification-level information, emulating the pathologist workflow of scanning zoomed out biopsy images to obtain an overall assessment. The detailed computational methods of our patch- and WSI-level models have been previously published [22].
A. Patch-level Model
ResNet50 trained on the ImageNet dataset was used. We customized Resnet50 by removing fully connected layers, retaining only the ResNet backbone as a feature extractor. Finally, a fully connected layer with 1024 neurons receiving a flattened output of feature extractor was added. The output layer represented the biopsy classification for each of the four Marsh score classes present in our data sample: I, IIIa, IIIb, and IIIc. Details of the model are described in the supplemental appendix S5.
B. Whole Slide Image-level Model
A heuristic approach was used for the WSI-level classification model. This approach aggregated patch-level classifications to provide WSI-level inferences. Computational details are described in the supplemental appendix S6.
Benchmark Visualization of Discerning Features for Model Classification
Features utilized by the image analysis model for decision making were visualized using Gradient-weighted Class Activation Mappings (Grad-CAMs). Grad-CAM computational details in supplemental appendix S7.
Immunohistochemistry (IHC) Analysis
Biopsy slides were constructed from paraffin blocks. Immunohistochemistry was performed on a robotic platform (Ventana discover Ultra Staining Module, Ventana Co., Tucson, AZ, USA). A heat-induced antigen retrieval protocol set for 64 min was carried out using a TRIS– ethylenediamine tetraacetic acid (EDTA)–boric acid pH 8.4 buffer (Cell Conditioner 1). Endogenous peroxidases were blocked with peroxidase inhibitor (CM1) for 8 min before incubating the tissues with Chromogranin A antibody (Dako, Cat # A0430) at 1:400 dilution for 60 min at room temperature. The antigen-antibody complex was then detected using DISCOVERY anti-rabbit HQ HRP detection system and DISCOVERY ChromoMap DAB Kit (Ventana Co.). All the slides were counterstained with hematoxylin; they were dehydrated, cleared, and mounted for the assessment.
IHC Qualitative and Quantitative Assessment
Pathologists confirmed qualitative assessment based on the color of the stained cells. Quantitative assessment was done using an unsupervised clustering algorithm called Gaussian Mixture Model (GMM), which groups biopsy image features using their pixel colors [23]. Our data was divided into five pixel color-based clusters, and details of the GMM method are mentioned in supplemental appendix S8.
Descriptive Statistics
Biodemographic data were analyzed using IBM SPSS Statistics v. 22.0. Categorical variables were expressed as frequencies (%) and continuous as median with first and third quartile (Q1, Q3). Differences in the means of continuous variables were assessed via the Mann-Whitney U test and categorical variables using the Pearson Chi-Square test with p-value>0.05 as statistically significant. Anthropometric z-scores were calculated using reference guidelines and R application macros available through the World Health Organization.
Ethical Considerations
This study was approved by the UVA Institutional Review Board (IRB #20448).
Results
Study Population for CD Severity Prediction
A total of 63 pediatric (under 18 years of age) CD patients (44.4% male) and 61 controls (54.1% male) with 239 and 174 whole-slide images (WSIs), respectively, were selected (3-4 biopsies images per patient). Population biodemographic characteristics are shown in Table 1. The biopsy image dataset encompassed disease severity ranging from Marsh I to IIIc (Figure 1 shows an illustrative explanation of the Marsh–Oberhüber classification; also referred to as the modified Marsh score) with no biopsy images present with a modified Marsh score of II and it is known to be rare. The detailed distribution of the patch- and WSI-level biopsy image datasets are mentioned in supplemental appendix S9.
Deep Learning Model Performance for CD Severity Prediction
The model for CD versus histologically normal controls demonstrated an accuracy of 94%, and the details are mentioned in supplemental appendix S10. For classifying CD Marsh-scored biopsy patches, the model had greater than 84% accuracy with the highest accuracy for Marsh IIIc (90.6%), as shown in Figure 2 with performance metrics in Table 2. Receiver operating curves (ROC) and areas under the curve (AUC) for each Marsh score class are shown in Figure 3, where AUC for all classes was greater than 0.96. For WSIs, the model correctly classified each biopsy image with 100% accuracy via inference, as mentioned in the methods section.
Image Analysis Model Discerning Feature Recognition
Grad-CAMs were obtained for the patch- and WSI-level models (Figures 4 and 5). Notably, patch-level Grad-CAMs for Marsh IIIc found the presence of enteroendocrine cells (EECs), and on retrospective chart review, these children also had endocrine abnormalities (details in supplemental appendix S11). WSI Grad-CAMs focused more on the surface epithelium for Marsh score I compared to more severe scores (Marsh IIIa, IIIb, and IIIc), as shown in Figure 5. Detailed Grad-CAM findings are mentioned in supplemental appendix S12.
Immunohistochemistry (IHC) Analysis
IHC was done to confirm the EEC Grad-CAM findings in Marsh IIIc biopsies of children with endocrine abnormalities. The comparison group included children with Marsh IIIc without endocrine abnormalities. Qualitative IHC showed that chromogranin A was more localized to EECs for children with Marsh IIIc with endocrine abnormalities versus the comparison group (Figure 6). Quantitative assessment was performed using a Gaussian Mixture Model (GMM) to detect EECs based on clusters of stain colors per pixel. The EEC color cluster, represented as dark gray (Figure 7), had a higher percentage of color pixels for children with Marsh IIIc and endocrine abnormalities than those without endocrine complications (details in supplemental appendix S13).
Discussion
Once thought to be rare, Celiac Disease (CD) affects approximately 1% of the global population [24]. It appears to be associated with increased mortality along with substantial morbidity, much of which may be preventable or reversible with a gluten-free diet [add ref]. The current gold standard for CD diagnosis involves assessing proximal small intestinal biopsies [1, 2, 3] with the lack of specific histopathological markers associated with either CD severity or comorbid conditions. CD severity assessment is also challenging and falls privy to interobserver variability and non-specificity of discerning histopathological features [6, 7]. Due to this, there is a need to refine risk stratification and prediction for those who will develop severe CD or comorbidities. We utilized a deep learning-based image analytics model to assess of disease severity and identify of histopathological markers specific for certain disease severity classes and phenotype with endocrinopathies.
Our deep learning model for CD classification demonstrated accuracy comparable with a previously published study by Wei et al. [16]. This group also utilized a CNN similar to our model, although the classification was limited to CD, normal, and nonspecific duodenitis with accuracies of 95.3%, 91.0%, and 89.2%, respectively. We further categorized our CD versus normal model for CD severity classes. We also added a layer of explainability regarding the features deemed important by the model for classification with the use of Grad-CAMs.
Applying our benchmark method using Grad-CAMs [20] for discerning features specific for each severity class, we found EECs as potential biomarkers for children with severe disease (Marsh IIIc) and endocrinopathies (Type 1 diabetes and endocrine abnormalities). These markers were further confirmed via immunohistochemistry. EECs have been reported to be increased in duodenal biopsies from patients with refractory CD [25] and produce insulin or insulin-like hormones [26, 27]. They have also been linked to the development of diabetes via the Neurogenin-3 protein encoded by the NEUROG3 gene expressed in progenitor cells required for intestinal enteroendocrine cell development [28]. The proximity between enteroendocrine and immune cells in the gastrointestinal mucosa has also suggested that the interaction between the endocrine and immune systems may modulate inflammatory response in CD.
Recently European guidelines suggest that CD diagnosis can be accurately established with or without duodenal biopsies if the given recommendations are followed [11]. Our findings suggest a possible need for requirement of biopsies if not for diagnosis but for risk stratification and prediction of concurrent endocrinopathies using histopathologic markers.
Major strengths of our study include the classification of CD based on histological severity using the Marsh–Oberhüber classification. A wide variety of performance metrics (e.g., accuracy, sensitivity/specificity, PPV/NPV, F1 score) were also used to assess the deep learning model’s validity, something often missing from work done on deep learning in medicine. Moreover, our use of Grad-CAMs will assist in increasing physician confidence in deep learning-based decision making and paving the way for discerning disease-specific histopathologic markers. However, despite these strengths, we experienced limitations as well. Our study design primarily does not account for direct comparisons of human interpreters to our deep learning models, and our comparisons to human results are indirect. Additionally, there were slight variations in H&E staining chemical composition over the years within our institute, resulting in visible H&E color differences. We performed stain color normalization in an attempt to mitigate this data bias. Further, Chromogranin A also stained tissue edges along with EECs, but since this artefactual staining for equal for both comparison groups, it may not have led to the percentage of EEC color pixel-based quantification (GMM) for the group with Marsh IIIc and endocrine complications.
We have successfully developed a deep learning image analytics platform that enables visualization of discerning disease features. Using IHC, the deep feature, EEC, identified specifically for severe CD along with Type 1 diabetes and thyroid abnormalities, was confirmed via IH. We believe these findings have a broad application in the field of personalized medicine as they relate to risk stratification and patient prognostications. Additionally, it might suggest a need for biopsies at the time of diagnosis for risk and comorbidity prediction despite the presence of strong serologic markers that already confirm the presence of disease. Further work to assess whether the histopathologic markers align with gene expression signatures and external validation models are underway.
Data Availability
Data can be available on request
Author contributions
Concept and design: Syed, Brown, Ehsan, Khan, Sali.
Acquisition, analysis, or interpretation of data: Syed, Brown, Ehsan, Khan, Sali, Catalano, Kowsari, Cheng, Pramoonjago, Raghavan, DeBoer, Moskaluk.
Drafting of the manuscript: Syed, Ehsan, Sali.
Critical revision of the manuscript for important intellectual content: Syed, Brown, Ehsan, Khan, Raghavan, Silvester, Cheng, DeBoer, Moskaluk, Moore.
Statistical analysis: Ehsan, Sali, Adorno. Obtained funding: Syed, Brown.
Administrative, technical, or material support: Syed, Ehsan, Khan, Sali, Adorno, Pramoonjago. Supervision: Syed, Brown.
Acknowledgments
The authors would like to acknowledge the members of Gastro Data Science Lab for their extensive feedback and continued support for this work.
Footnotes
Conflict of Interest: The authors have declared that no conflict of interest exists.
Funding Information: Research reported in this manuscript was supported by National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health under award number K23DK117061-01A1 (Syed), University of Virginia Center for Engineering in Medicine Grant (Syed and Brown), and University of Virginia THRIV Scholar Career Development Award (Syed). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.