Generating highly accurate pathology reports from gigapixel whole slide images with HistoGPT

in a zero-shot fashion. Our work represents an important step toward integrating AI into the medical workﬂow. We publish both model code and weights so that the scientiﬁc community can apply and improve HistoGPT to advance the ﬁeld of computational pathology.


Introduction
Histopathology is the study of diseased tissues and cells under the microscope.It plays a critical role in the diagnosis of many diseases, including malignant cancers, viral infections, and inflammatory responses 1 .In many cases, the detailed analysis provided by histopathological examinations remains the diagnostic gold standard 2 .It involves the analysis of slides by pathologists, the dictation of findings, and the writing of the report.However, this process is time-consuming and labor-intensive 3 .The turnaround time for patients is likely to worsen in the future as the number of pathologists is decreasing at an alarming rate.
Combined with an increase in tumor cases in an aging society, the workload for pathologists is unsustainable 4 .
Artificial Intelligence (AI) offers a potential solution to handle frequent and uncomplicated diagnoses and effectively assist medical professionals in their daily routines by using advanced tools such as deep neural networks (DNNs) 5 .These brain-inspired systems are typically applied to digitized microscope slides, also known as whole slide images (WSIs).
Modern deep learning (DL) techniques allow to effectively automate several tasks, including cancer classification 6 , tissue segmentation 7 , survival prediction 8 , and biomarker detection 9 .
These approaches have already shown promising results and could reduce the burden on pathologists in today's medical landscape 10 .
A major drawback of current methods is that they are typically limited to a narrow task, providing only a single scalar output for each input.Consider, for example, an image classification model for benign versus malignant tissue.Beyond predicting these two labels, the model cannot do anything else: neither solve new unseen problems (called zero-shot prediction) nor provide its reasoning steps for better explainability.Vision language foundation models offer an exciting alternative to these rigid approaches by processing both images and text simultaneously.However, due to methodological limitations, current multimodal AI algorithms [11][12][13][14][15][16] can only process small image patches of 224 x 224 pixels, or regions of interest (ROIs) of 1024 x 1024 pixels.These so-called patch-based approaches are suboptimal because they are limited to a tiny fraction of the WSI, ignoring potentially relevant areas in the remaining tissue sample.
Here, we present HistoGPT, a vision language model (VLM) that can generate histopathology reports from gigapixel WSIs (see Figure 1) with impressive quality.Given a slide, the model uses a vision foundation model (VFM) to extract meaningful visual features from the tissue sample and combines them with a large language model (LLM) via cross-attention mechanisms to generate the final report.The generated report describes the WSI with high fidelity, explaining tissue composition, cellular subtypes, and potential diagnoses.In an unprecedented way, users can interact with the model through various prompts ("Expert guidance") to extract additional information such as tumor subtypes and tumor thickness.To make the output text interpretable, HistoGPT provides saliency maps that highlight the corresponding image regions that led to the specific findings in the generated text -providing an insightful and detailed understanding not possible before.
In our experiments, HistoGPT outperforms a state-of-the-art biomedical language model for text generation 17 , a general-purpose multimodal AI system for image understanding 18 , various multiple instance learning (MIL) approaches for image classification 9,19,20 , and different contrastive methods 12,13,16 for zero-shot prediction.We demonstrate that a slide-level model is necessary for high accuracy by training two novel contrastive pre-trained baselines we call HistoCLIP and HistoSigLIP.Both outperform the patch-level foundation model PLIP 13 on slide-level tasks and are only surpassed by the generative pre-trained HistoGPT.
To train HistoGPT, we collect a large multimodal skin histology dataset from the Department of Dermatology at the Technical University of Munich with 6,000 paired WSIs and pathology reports written by board-certified pathologists for each patient case.To validate HistoGPT, we are using one internal and five external publicly available test sets that cover different data distributions in different countries.To democratize the use of AI, we are releasing HistoGPT as an end-to-end deep learning pipeline that can be deployed on local machines.As a result, users can select and fine-tune a copy of our machine learning algorithm according to their needs.

HistoGPT simultaneously learns from vision and language
HistoGPT consists of two components (see Figure 2A): a vision foundation model module and a large language model module.The vision module is based on CTransPath 21 .It is a Swin Transformer 22 trained on over 32,000 WSIs from TCGA 23 and PAIP 24 using a semantically guided contrastive learning algorithm.Our language model module repurposes BioGPT 17 , an auto-regressive generative model based on the Transformer 25 decoder architecture of GPT-3 26 trained on 15 million biomedical articles from PubMed.We sample image features from the vision module using a custom pre-trained (see Figure 2B) Perceiver Resampler 27 and integrate it into the LLM via interleaved gated cross-attention (XATTN) blocks 28 .Only these new XATTN blocks are trained from scratch.In this way, we endow HistoGPT with visual and linguistic domain knowledge, which is critical for tackling the challenging problem of generating histopathology reports from entire WSIs.Similar to Flamingo 28 , we freeze the parameters of all pre-trained modules during optimization to further reduce the computational cost and to avoid catastrophic forgetting of the inductive biases encoded in the learned weights.
A language model predicts a probability distribution over a vocabulary.The next word in a text is randomly selected based on a combination of top-p and top-k sampling.Once the first few words have been chosen, the outline of the report is roughly pre-determined.To avoid being locked into a fixed report, we use an advanced inference method called Ensemble refinement, introduced in Med-PaLM 2 29 , to randomly sample multiple reports -each focusing on slightly different aspects of the WSI (see Figure 2C).This extensive sampling allows us to thoroughly search the model distribution and generate a wide variety of medical reports, maximizing the likelihood of including all important observations.The general-purpose LLM GPT-4 18 is then used to summarize all the bootstrapped reports.Optionally, users can query the model for additional details using prompts such as "tumor thickness".(B) We train HistoGPT in two phases.In the first phase, we pre-train the vision module of HistoGPT using multiple instance learning (MIL).In the second phase, we freeze the pre-trained layers and fine-tune the language module on the image-text pairs.To prevent the model from overfitting on the same sentences, we apply text augmentation.This is done using GPT-4, a general-purpose large language model that faithfully paraphrases the medical notes.(C) During deployment, we propose to optionally use an advanced inference method called Ensemble refinement.
Here, the model stochastically generates multiple possible pathology reports via temperature sampling to capture different aspects of the input image.An aggregation module (GPT-4) then combines the results to obtain a more complete description of the underlying case.

HistoGPT generates human-level pathology reports
We train HistoGPT on over 13,000 whole slide images from 6,000 patients with corresponding pathology reports from a real-world cohort provided by the Department of Dermatology at the Technical University of Munich (see Figure 3A).This internal dataset contains 162 different disease classes of varying frequency and has a total size of 10 terabytes.To assess the impact of model architecture and size, we train and evaluate three models: HistoGPT with 1 billion parameters (HistoGPT-1B), HistoGPT with 3 billion parameters (HistoGPT-3B), and HistoGPT-3B with Ensemble Refinement (HistoGPT-3B-ER).In the following experiments, we use HistoGPT in "Expert guidance" mode, where the model is prompted with the correct diagnosis, simulating a pathologist who is confident in the WSI assessment but wants to leave the work of textual tissue description to an AI assistant (see Figure 3B).
Currently, no model can generate a histopathology report from an entire WSI, let alone a series of WSIs (one patient might have multiple tissue samples).Therefore, we compare the reports generated by HistoGPT-1B, HistoGPT-3B, and HistoGPT-3B-ER with those of text-only and patch-only architectures.For the former, we choose the domain-specific language model BioGPT-1B, fine-tuned on our Munich cohort.For the latter, we rely on the multimodal foundation model GPT-4V(ision) 18 , which takes low-resolution images of size 2000 x 768 as input.We introduce two other non-trivial baselines: A lower baseline, where we select two random reports with arbitrary diagnoses; and an upper baseline, where we compare two random reports with the same diagnosis (see Methods for more details).
We evaluate the models' output using four semantic-based machine learning metrics: (i) match critical medical terms extracted from the original text with the generated text using a dermatology dictionary; (ii) use the same technique but with ScispaCy, a scientific name entity recognition tool, as the keyword extractor 30 ; (iii) compare the semantic meaning of the original and generated reports by measuring the cosine similarity of their text embeddings generated by the biomedical language model BioBERT 31 ; (iv) use the same technique but with the general purpose large language model GPT-3-ADA 26 for text embedding (see Supplementary Figure 2 for an illustration).
In "Expert guidance", HistoGPT-1B and HistoGPT-3B capture an average of 64% and 63% of all dermatological keywords from the original pathology reports, respectively (see Figure 3C), outperforming alternative language models such as BioGPT-1B and GPT-4V by at least 5%.
HistoGPT-3B-ER further improves the Jaccard index to 77%.This is 10% above the upper baseline.A similar trend is observed when ScispaCy is used as a keyword extractor (see Figure 3C).HistoGPT also produces text with a high cosine similarity with the ground truth, as indicated by the embeddings provided by BioBERT and GPT-3-ADA (see Figure 3C).We also evaluate all models using traditional syntax-based measures (BLEU-4, ROUGE-L, METEOR, and BERTscore).Here, HistoGPT receives relatively low scores (see Supplementary Table : Automatic report evaluation).Combined with the high semantic-based scores (see Figure 3C), this suggests that HistoGPT is not overfitting the training set by simply repeating common phrases and medical terms.For BCC, P1 found that 38% of the generated reports described the WSI better than the original report.In 31% of cases, both reports performed equally well, while in 31% of cases, the original report was preferred.In 58% (P1) and 55% (P2) of all cases, the pathologists did not prefer the original report to the generated one.
To evaluate the content of the generated reports from an expert perspective, we conduct a blinded study in which we randomly select 100 cases from our Munich test dataset, generate a report for each patient in "Expert guidance" mode, and pair it with the original human-written report.The two reports are then randomly shuffled and anonymized.Two independent expert pathologists (P.S. and S.B.), neither involved in the construction nor annotation of the Munich cohort, were given the original WSIs and asked to identify the report that best describes each case, with the option of selecting "no difference" if both are deemed equally accurate.Ensemble refinement is not used in this study to avoid easy identification of the GPT-4 summarized text.For the five largest diagnostic classes (basal cell carcinoma (BCC), benign melanocytic nevus (BMN), seborrheic keratosis (SK), actinic keratosis (AK), squamous cell carcinoma (SCC), see Figure 3A), we find moderate agreement between the two pathologists.Analyzing the results for each class separately, we find that Pathologist 1 overwhelmingly prefers the AI or finds the AI and human report similarly good in about 70% of the BCC cases.Pathologist 2, on the other hand, prefers the AI-generated report for BMN 80% of the time.The AI-generated report for SK is preferred by both pathologists 90% of the time.Across all 100 report pairs, both pathologists find no difference between the generated and human reports about 45% of the time and prefer the AI-generated reports about 15% of the time (see Figure 3D).
According to a post-analysis provided by the two pathologists, after about 20 cases, they were able to tell which report was likely generated by the AI and which was likely generated by a human pathologist.The AI-generated text tends to be more structured and comprehensive.
It includes more observations that are informative but not always necessary for the final diagnosis.Notably, there are only a few cases (< 5) where HistoGPT generated confusing text.
In one case, the model incorrectly identified red collagen bundles as blood.In another case, it failed to describe a cyst, which was the key diagnostic feature.In one interesting case where there was a disagreement between the ground truth diagnosis and Pathologist 1 -resulting in both AI and human reports being disputed.Interestingly, one slide was incorrectly annotated by the human, but the AI still provided the correct report.There are two cases where the AI failed to detect small or unusual objects such as mitotic figures and a scabies mite.In one slide, the model mistook erythrocytes for eosinophils.However, these two cell types were difficult to distinguish in the image.Pathologist 1 mentioned that about 10 human reports were favored simply because the tumor thickness was more accurate than in the generated report, but the text itself was equally good.After adjusting for this, and including only reports where "Expert guidance" and model prediction agreed, the pathologist preferred the AI report or was indifferent 80% of the time (see Supplementary Figure 3).Overall, the model was described as having the skill level of a novice pathologist.Notably, this was achieved with only 5K training points, which is small for LLM standards.

HistoGPT accurately predicts diseases across many cohorts
There is another quantitative way to demonstrate that HistoGPT has effectively learned to encode medical knowledge.We extract the predicted diagnosis from the generated reports, calculate the classification accuracy, and compare the results (Figure 4) with state-of-the-art multiple-instance learning (MIL) approaches for image classification.For this purpose, we run HistoGPT without "Expert guidance" mode, i.e. we prompt the model with the phrase "Final diagnosis" instead of "Final diagnosis: [expert label]" and let it make a diagnostic decision on its own (see Figure 4A).MIL methods such as AttentionMIL 19 , TransMIL 20 , and TransfomerMIL 9 achieve weighted F1 scores between 0.34 and 0.48 on the Munich test set.These results are not unexpected.A major challenge for all these methods is that the training dataset is highly unbalanced, ranging from a handful of samples in the minority classes to several hundred samples in the majority classes.Nevertheless, our PerceiverMIL achieves a weighted F1 score of 44% on the internal test set (see Figure 4B).The much larger HistoGPT-1B does not overfit and retains the performance of its vision module.Surprisingly, the even larger HistoGPT-3B improves the weighted F1 score to 45%.Compared to the highly specialized models AttentionMIL, TransMIL, and TransfomerMIL, both PerceiverMIL and HistoGPT are slightly better or at least competitive in terms of classification performance.It is important to note that, unlike MIL approaches, the output of HistoGPT is pure text and not integer class indices, highlighting the flexibility of a vision language model.A challenging clinical question with a high therapeutic impact in dermatology is the differentiation of cancer from non-cancer.In routine diagnosis, for example, it is important to distinguish basal cell carcinoma (BCC) from other conditions; cancer, such as squamous cell carcinoma (SCC) from precancerous actinic keratosis (AK); and malignant from benign conditions, such as melanoma from benign melanocytic nevus (BMN).Unlike the previous classification task with over 100 classes, we now face a classification problem with only two classes.In this case, HistoGPT automatically calls a lightweight binary classifier to solve the task at hand (see Methods), overcoming the class imbalance problem from before.This mode is called "Classifier guidance" and makes the model aware of the unbalanced label distribution by limiting the number of output classes.We achieve remarkable classification performance for the three clinical tasks with weighted F1 scores of 98%, 87%, and 89%, respectively (see Figure 4C). .
HistoGPT in "Classifier guidance" mode also generalizes to previously unseen datasets and problems.We demonstrate this by evaluating HistoGPT on five external, publicly available cohorts from different countries, scanner types, staining protocols, and medical procedures such as shave biopsies, punch biopsies, and excisional biopsies (see Figure 4D).While some of the cohorts include a variety of dermatological diseases (Queensland or Linköping), some cohorts (TCGA and CPTAC) include only melanoma cases, but can still be used to assess the accuracy of HistoGPT.We retrain PerceiverMIL as a state-of-the-art classifier and HistoGPT-1B as well as HistoGPT-3B on the entire Munich cohort and compare their classification performance on the external datasets.On the BCC subset of Münster, both PerceiverMIL and HistoGPT correctly identify BCC in 88% of cases (see Figure 4E).In the multi-class setting (Queensland with 3 classes and Linköping with 14 classes), we achieve accuracies of 85% and 70%, respectively.The models also reliably discriminate melanoma from other types with accuracies of 80% and 90% in TCGA and CPTAC, respectively.For comparison, we also report the results of HistoGPT without class imbalance awareness (see Figure 4E, light color bars).BioGPT-1B and a grounded report given by GPT-4V, the text quality of these models is much lower compared to HistoGPT with or without Ensemble refinement.

HistoGPT predicts tumor thickness and tumor subtypes zero-shot
In the diagnosis of (skin) tumors, it is important to include information about tumor thickness or assignment to a specific tumor subtype in the final report.These parameters are well defined in dermatopathology: In basal cell carcinoma, tumor thickness is measured from the stratum granulosum in the epidermis to the deepest point of the tumor in millimeters, similar to the determination of the Breslow index in melanoma, while tumor subtype classification is based on the WHO guidelines 32 .HistoGPT can predict both tumor thickness and tumor subtypes out-of-the-box and does not require additional reconfiguration or explanation of tumor-specific parameters at any stage of training.We can design prompts and instruct HistoGPT to produce the desired text output.For example, typing the prompt "tumor thickness" will produce a prediction of the depth of tumor invasion without fine-tuning.
Although only a fraction (n = 644) of the training dataset has this value recorded as ground truth, HistoGPT can still predict the tumor thickness with considerable accuracy and include it directly in the final report.This emergent behavior is referred to in the literature as zero-shot learning 40 .For the 94 samples in the internal Munich test set with such a ground truth, we measure a root mean square error (RMSE) of 1.8 mm and a significant correlation coefficient of ρ = 0.52 (p = 9.7•10 -8 ) for the predicted tumor thickness versus the reported ground truth (see Figure 5A).Binning the values to an interval with step sizes of 2 mm, 1 mm, and 0.5 mm gives us accuracies of 64%, 38%, and 21%, respectively.Again, we emphasize that this is zero-shot prediction on a task where the ground truth is typically obtained with a dedicated measurement procedure.In comparison, the predictions of the slide-based contrastive baselines, HistoCLIP (RMSE = 4.35 mm, ρ = 0.006, p = 0.96) and HistoSigLIP (RMSE = 3.84 mm, ρ = 0.38, p = 0.002), correlate poorly with the ground truth and are far from HistoGPT in terms of quality (see Supplementary Figure 4A).The patch-based contrastive baseline PLIP 13 , which is the state of the art in computational pathology, is even worse (RMSE = 2.78 mm, ρ = -0.18,p = 0.08), highlighting the importance of a slide-level approach.
We analyze whether the zero-shot capability generalizes to other cohorts by looking at the never seen BCC subset of the external Münster test set (see Figure 5B), which has not been used for training purposes.For the samples with a ground truth tumor thickness measurement, we find a root mean square error of 0.98 mm and a significant correlation coefficient of ρ = 0.39 (p = 5.8•10 -5 ).Compared to HistoCLIP (RMSE = 3.91 mm, ρ = -0.16,p = 0.1), HistoSigLIP (RMSE = 1.46 mm, ρ = 0.10, p = 0.3), and PLIP (RMSE = 1.43 mm, ρ = -0.04,p = 0.7), their correlation of prediction with the data is much worse than HistoGPT (see Supplementary Figure 4B).Using gradient attention maps, we can gain insight into the reasoning behind each output.When estimating tumor thickness, HistoGPT correctly focuses on the tumor region (see Figure 4C, top).However, the VLM sometimes struggles to find the correct reference point (e.g., when the epidermis is torn or especially when it is ulcerated) or spatial orientation for the measurements, even though it recognizes the tumor mass itself (see Figure 4C, bottom; Supplementary Figure 5).We attribute this to the design decision not to use position embeddings to store the coordinate values for each patch, which led to training instabilities.However, because HistoGPT is designed to be used with a human-in-the-loop, pathologists can quickly identify the discrepancy and correct the model in an interactive teacher-student setting, i.e., in "Expert guidance" mode.
We continue to explore the benefits of zero-shot learning.Basal cell carcinoma is the most common type of malignant skin cancer.Although it is the majority class in the training set, the training set does not contain BCC subtypes as critical diagnoses.Therefore, BCC subtypes could not be used as labels during supervised pre-training.This information is only implicitly available as free text hidden in the report.Interestingly, HistoGPT is still able to extract the hidden information from the internal training set Munich and apply the acquired knowledge in the external test set Münster to discriminate between three major BCC subtypes (superficial, solid/nodular and infiltrating) with a weighted F1 score of 63%, quantified by extracting the keywords from the generated reports (see Figure 5D).As clearly shown in the gradient attention maps (see Figure 5E), HistoGPT correctly attends to the relevant architectural patterns within the histological slides that are the hallmarks of each cancer subtype.This zero-shot capability highlights the adaptability of HistoGPT as a generative AI model, especially when compared to more traditional classifiers such as TransMIL, which are limited to predefined classes and thus cannot predict subtypes without re-training.We also compare its zero-shot performance to more advanced models such as HistoCLIP and HistoSigLIP.As contrastive methods, they overcome the inflexible structure of multiple instance learning approaches.Both achieve weighted F1 scores of 54% and 50%, respectively, but perform significantly worse than HistoGPT, particularly in identifying infiltrative BCC.
Infiltrative BCC is extremely important to identify in the clinical context, as this subtype tends to have a biologically much more aggressive growth pattern and relapse rate, and therefore may require different treatment and follow-up.The patch-based visual language foundation model for pathology image analysis, PLIP, does not provide useful predictions for this zero-shot classification task.Surprisingly, PLIP is constant over the test set and predicts all specimens as either superficial or solid depending on the resolution.That is, at 5x and 10x magnification, PLIP predicts all samples to be superficial; at 20x and 40x magnification, it predicts all images to be infiltrative (see Supplementary Figure 6).

Discussion
With HistoGPT, we introduce a vision language model that can generate histopathology reports from full-resolution, gigapixel whole slide images.The generated reports are of high quality, consistent with ground truth and independent expert evaluation.HistoGPT outperforms the state-of-the-art foundation model GPT-4V, which itself is already very capable in medical tasks 15,41,42 .In addition, HistoGPT predicts disease subtypes (validated on five international cohorts) and provides a comprehensive list of medical keywords using named entity recognition tools.Using various prompts (e.g., "the tumor thickness is"), pathologists can guide the model and customize it to their needs.This zero-shot performance rivals existing zero-shot learning approaches based on CLIP and SigLIP.Advanced methods such as ensemble refinement allow us to explore the probability space of possible medical outcomes.
In particular, the output text is fully interpretable using gradient attention maps that match words in the generated report to corresponding regions in the image.We note that HistoGPT achieves this level of performance with only 6,000 dermatology cases, which is relatively small by LLM standards.Interestingly, this is the same number of cases that a pathologist in Germany must have seen to qualify for the dermatopathology exam 43 .Thus, our expert's impression that HistoGPT is comparable to a novice dermatopathologist has an intuitive analogy.However, unlike a real pathologist, our model lacks extensive medical training and strong human supervision.Nevertheless, HistoGPT already writes reasonably good reports and shows a good understanding of the underlying case.
Although current neural networks can also predict tumor thickness or tumor subtypes with good accuracy, as has been shown in particular for basal cell carcinoma [33][34][35][36][37][38][39] , they require a large amount of high-quality, precisely annotated data for training and are not flexible enough to be used for tasks other than those for which they were trained.That is, these models are fully supervised and do not operate in a zero-shot fashion.Specifically for tumor thickness prediction, the above approaches are not end-to-end deep learning systems.Users must first train a segmentation model to segment the tumor region, and then use a hand-crafted mathematical algorithm to calculate the tumor thickness.HistoGPT, on the other hand, does not require this multi-step approach because it has already learned to understand this concept just by looking at text and images.
HistoGPT also has its limitations.First, the model has only been trained and tested on dermatological samples.Thus, it cannot yet be generalized to the more general case of pan-cancer diagnosis.In addition, our training dataset suffers from severe class imbalance, which limits its usefulness for minority classes.This problem can be partially mitigated with "Classifier guidance".However, guidance has its limitations too, as the generated reports tend to be of higher quality when the initial diagnostic prediction is also correct (see Supplementary Figure 3).Another open problem is to find a highly efficient and effective way to encode the positional information of the individual image patches within a WSI.An interesting research direction is to fine-tune HistoGPT as a conversational chatbot using Reinforcement Learning with Human Feedback (RLHF).This may prove challenging in practice, as there are currently no slide-level question-answer pairs for the model to learn from.An even more useful follow-up question is whether a tumor has been excised as a whole or whether there is still a tumor mass at the margins, which is clinically highly relevant.Finding out if and how AI will differentiate between primary tumors and metastases is another clinically relevant challenge.Certainly, indications such as the growth of tumor cells emanating from the epidermis can be recognized by the AI.However, there will still be cases where the AI -just like human pathologists -will find it difficult to make final decisions.This remains an interesting topic for follow-up studies.Overall, HistoGPT shows strong emergent capabilities and is a fully functional proof of concept for a vision language foundation model in histopathology.We are releasing both model code and weights so that the broader scientific community can explore and improve HistoGPT.
proliferating collagen fiber bundles.Critical findings: Hypertrophic, keloid-like scar.Partial excision." Münster cohort: All 1,300 histology samples of the Münster cohort were processed and stained (with hematoxylin and eosin) at the Department of Dermatology, University Hospital Münster, Münster, Germany.They were scanned with a 20x objective at 0.46 micrometers per pixel using a Hamamatsu NanoZoomer S360 MD, Hamamatsu City, Japan, at the Department of Dermatology, University Hospital Münster, Münster, Germany.The cohort comprises 300 cases with 100 BCC subtypes each (superficial, solid/nodular, infiltrating), and 1000 cases from daily routine without special selection.All slides were fully anonymized.An example report (AI-translated from German to English) reads: "Lichen planus-like keratosis (regressive solar lentigo/flat seborrheic keratosis), no evidence of basal cell carcinoma in the present biopsy."

Image preprocessing
We treat all whole slide images (WSIs) belonging to a patient as one input.In other words, we have patient-level samples, instead of slide-level or even patch-level data points.These WSIs are tessellated at a total magnification of 100x (equivalent to an objective magnification of 10x or 1 micron per pixel) into non-overlapping image patches of 256 x 256 pixels and resized to 224 x 224 pixels using the Python library SlideIO.The inputs are then converted into PyTorch tensor objects and normalized using a mean of (0.485, 0.456, 0.406) and a standard deviation of (0.229, 0.224, 0.225).We use this specific image size and normalization parameter in accordance with most publicly available pre-trained histopathology image encoders.

Model architectures
We use CTransPath 21 as our pre-trained vision encoder to extract 768-dimensional feature vectors for each image patch and concatenate them along the sequence dimension to obtain a matrix of size n x 768, where n is the number of image patches.The inputs are then fed into the Perceiver Resampler 27 , which is borrowed from the vision language model Flamingo 28 with randomly initialized weights.We change the default number of latents from 64 to 640 because WSIs are much larger than natural images and require a larger dimensional latent space to store the additional information.We keep the output size of 1536 because this has been shown to work well 28 .The fixed-size outputs of dimension 640 x 1536 are then used as keys and values in the tanh gated cross-attention block (XATTN).The query vectors come from the pre-trained language model BioGPT 17 .In particular, we use one XATTN block after each language layer according to the high-performance configuration of Flamingo.The output layer of HistoGPT is a linear classifier over the vocabulary.
We benchmark HistoGPT against HistoCLIP and HistoSigLIP.They use the feature mean of the pre-trained Perceiver Resampler as the image representation and the EOS token of the pre-trained BioGPT as the text representation.A contrastive loss then aligns both feature vectors in the shared embedding space.For HistoCLIP we use the same loss as for CLIP 47 .For HistoSigLIP, we use the loss proposed in SigLIP 48 .In pursuit of improving performance and avoiding training instabilities, we freeze the vision encoder during training.This technique is called locked-image text tuning 49 ).We also compare HistoGPT to the patch-based foundation model PLIP using the contrastive pre-trained model via the provided API.To aggregate the patch-level results to the slide-level, we evaluate PLIP 13 using the majority voting system of the related model MI-Zero 16 .
Since BioGPT and other popular LLMs are all pre-trained on mostly English text, we need to translate the German reports into English to take advantage of their capabilities.For the translator, we choose a standard machine translation model based on the Transformer encoder-decoder architecture 25 with the checkpoint "Helsinki-NLP/opus-mt-de-en" available on Hugging Face.

Model training
We pre-train the Perceiver Resampler in a fully supervised manner by predicting the critical diagnosis using a linear classifier on top of the encoder.Since the labels are provided at the patient level, this approach is also known as multiple instance learning (MIL).The classifier is then discarded and the resampler is plugged into the vision language model (VLM).We freeze all layers of HistoGPT except the cross-attention blocks.Our generative training is based on causal language modeling: Given an input, we mask the next tokens and let the model predict them.This is done in parallel over all input tokens using an upper triangular causal attention mask.
For training, we use the AdamW optimizer with betas of (0.9, 0.95), a weight decay of 0.1, and epsilon of 1e-8.The learning rate starts at zero and warms up linearly over 10 epochs to 1e-4 before decaying tenfold according to a cosine-annealing scheduler.We use a gradient accumulation of 32 to simulate a larger batch size.Both PR and VLM are trained for 100 epochs using mixed precision training and gradient clipping to a Euclidean norm of 1.0.For contrastive learning, we use the same hyperparameters proposed by 12 and 48 .
During training, we randomly augment the text inputs to avoid overfitting common words and phrases.This is done beforehand using GPT-4 to sample 9 paraphrased texts with a temperature of 1.0 and nucleus sampling of 1.0.The prompt used is: "Rewrite the following text but be as accurate and faithful as possible to the original.Do not add or remove any information!Also, do not change the phrases 'Microscopic findings:' and 'Critical findings:', but leave them as they are."

Classifier guidance
We enable class-imbalance awareness in HistoGPT by using a lightweight and specialized classification model.¬BCC, we consider all samples that are not BCC to be ¬BCC and fit an MLP with 100 neurons.
For Melanoma vs. ¬Melanoma, we follow the same procedure.For all other classification tasks, we only train on the specific subset.For example, if we want to classify BCC vs. SCC vs.
AK vs. SK, we train a classifier only on the BCC, SCC, AK, and SK training features and ignore the remaining classes.Some datasets (Queensland and Linköping) only provide annotation masks as labels.They may contain different disease labels for different regions in the same slide.In this case, we consider the prediction of one of the ground truth classes as accurate.

Interpretability maps
For HistoGPT explainability, we use partial derivatives and associate the output latents of the Perceiver Resampler with the corresponding input vectors.We then weight the image features with the text features using the cross-attention scores.This gives us a gradient attention map.
It shows which word in the generated report corresponds to which region in a WSI.For example, we can highlight where the model sees basal cell carcinoma, how it detects tumor-infiltrating lymphocytes, and which regions it considers when measuring tumor thickness (see Supplementary Figure 1).In this way, we provide an unprecedented approach to explainable AI by matching visual and linguistic information.
The output of the Perceiver Resampler consists of 640 latent vectors.We compute the gradients of these latents with respect to the input patches with backpropagation.Thus, the gradient G has the form num_patches x num_latents.It tells us which image tokens have the most influence on which latent feature.The mean along the latent sequence thus gives us the most important image regions according to the vision resampler.How can we use this information to determine which of these regions corresponds to which word?One idea is to give higher weights to the latents that correspond to the words we are interested in.We get these weights by looking at the cross-attention scores of the last XATTN layer.The attention matrix A has a dimension of num_tokens x num_latents.Thus, given a target word, we can identify the corresponding target tokens and use the corresponding rows in the attention matrix as weights.Overall, the proposed Gradient x Attention map is given by the weighted mean (G T ⚬ A[target_tokens, :].mean(dim=0) T ) T .mean(dim=1).

Evaluation metrics
We introduce two other non-trivial baselines: given the ground truth, compare two random reports with two arbitrary diagnoses (lower baseline), and compare two random reports with the same diagnosis (upper baseline).The logic behind this approach is simple.Medical texts often follow a structured format with a similar writing style, typically including a general description of the specimen and frequent use of common technical terms.In addition, certain diseases manifest homogeneously across patients, resulting in nearly identical report descriptions within a patient group.In such cases, the few unique terms in the reports become critical in distinguishing between different diagnoses.Therefore, these two baseline comparisons provide effective reference points for measuring the overall performance of our models.
Evaluating the reports generated by HistoGPT is a non-trivial task.Popular evaluation methods for natural language generation such as BLEU-4 50 , ROUGE-L 51 , and METEOR 52 primarily compare n-grams between two documents and may not effectively capture semantic similarities.In fact, two texts can describe the same phenomena in different ways, making a word-by-word comparison unfair.Therefore, we focus on two different quantitative performance metrics: keyword overlap and sentence similarity.For the former, we use a comprehensive glossary of human-curated dermatological vocabularies 53 to extract important medical keywords from the ground truth notes.In addition, we use ScispaCy 30 , a biomedical named entity recognition (NER) tool, to capture a broader range of technical terms.We then determine how many keywords from the ground truth text can be found in the generated text.The Jaccard index is an appropriate measure to quantify their overlap.To find a match in the generated report, we use an advanced version of Gestalt pattern matching (Ratcliff and Obershelp, 1988) which is available in the Python library difflib.We use the default cutoff threshold of 0.6.This value strikes a balance between matching every word as a target and matching only exact overlaps.The latter is undesirable because it ignores different grammatical forms of a word.As a consequence, some unrelated words will inevitably be matched.In this case, the Jaccard index can be considered a relative measure as the same approach is applied to every model.
The above measures still miss some semantic nuances, since certain concepts or observations (e.g., disease properties, tissue subtypes, cellular characteristics) can be expressed in complex phrases, possibly even involving negations.To remedy this, we use BioBERT fine-tuned 54 for natural language inference (NLI) and semantic textual similarity (STS) assessments.This embedding model provides the feature vectors of the generated report and the ground truth, allowing us to compute their cosine similarity as a measure of semantic understanding.To go beyond the domain-specific use of language, we apply a general large-scale embedding model, GPT-3-ADA 26 , to capture a broader range of linguistic information.Similarly, we use BERTScore 55 to compute the syntactic relationship between generated and ground truth reports at the subword level.

Figure 1 :
Figure 1: HistoGPT, a vision language foundation model for dermatopathology.(A) Traditionally, pathologists analyze tissue samples from patients under a microscope and summarize their findings in a comprehensive pathology report.This manual process is time-consuming, labor-intensive, and non-standardized.(B) In our proposed AI-powered workflow, pathologists work alongside HistoGPT, our foundation model for vision and language.It generates human-level written reports, provides accurate disease classification, discriminates between tumor subtypes, predicts tumor thickness, and returns text-to-image interpretability maps that provide model explainability.All of this serves as a second opinion to the pathologists, who can query the model for additional information or tailor its output to the task at hand.

Figure 2 :
Figure 2: HistoGPT learns simultaneously from vision and language to generate highly accurate histology reports from whole slide images.(A) HistoGPT consists of a vision encoder, a vision resampler, a language model, and cross-attention blocks.Specifically, HistoGPT takes as a whole slide image and outputs written text.

Figure 3 .
Figure 3. HistoGPT generates human-level pathology reports of skin diseases.(A) Our Munich dataset is a real-world cohort of 6,000 patients with 162 skin diseases from the Department of Dermatology at the Technical University of Munich.It includes malignant cases such as basal cell carcinoma (BCC, n = 870) and squamous cell carcinoma (SCC, n = 297); precursor lesions such as actinic keratosis (AK, n = 396) as well as benign cases such as benign melanocytic nevus (BMN, n = 770) and seborrheic keratosis (SK, n = 412).We divide the patient-level data set into a training set and a test set using a stratified 85/15 split.(B) Through years of experience, pathologists are often able to make a diagnosis at first glance.Instead of writing a pathology report themselves, they can now use HistoGPT in "Expert guidance"' by giving the model the correct diagnosis to complete the report.(C) In "Expert guidance" mode, HistoGPT-3B-ER (HistoGPT-3B with Ensemble Refinement) outperforms BioGPT-1B and GPT-4V on the two text accuracy metrics Dictionary and ScispaCY; and is equal to or better on the two text similarity metrics BioBERT and GPT-3-ADA (see Methods for details).(D) Two independent external pathologists (P1 left and P2 right) evaluated 100 generated and original reports together with the corresponding WSI in a randomized, blinded study.

Figure 4 .
Figure 4. HistoGPT accurately predicts diseases in-domain and out-of-domain without human guidance.(A) In the absence of a human-in-the-loop, HistoGPT independently predicts the patient's diagnosis on its own and generates the corresponding pathology report.(B) On the internal Munich test set, HistoGPT is comparable to state-of-the-art classification models in predicting over 100 dermatological diseases, even though the model's output is pure text.(C) HistoGPT answers clinically challenging and important questions by discriminating malignant from benign conditions with high accuracy on the Munich dataset: basal cell carcinoma (BCC, n = 107) vs. other conditions (n = 621) with an accuracy of 0.98 and a weighted F1 score of 0.98; actinic keratosis (AK, n = 47) vs. squamous cell carcinoma (SCC, n = 33) with an accuracy of 0.88 and a weighted F1 score of 0.87; benign melanocytic nevus (BMN, n = 86) vs. melanoma (n = 21) with an accuracy of 0.89 and a weighted F1 score of 0.89.(D) We also evaluate HistoGPT on five independent external cohorts covering different countries, scanner types, staining techniques, and biopsy methods.(E) Both PerceiverMIL and HistoGPT perform well on external datasets by conditioning them on the class distribution.(F) HistoGPT is able to produce highly accurate pathology reports, as indicated by the high keyword and cosine-based similarity scores on Münster.As in Figure 3C, the lower baseline compares two randomly selected reports.

"
Classifier guidance" significantly improves the effectiveness and generalizability of the model across different external cohorts.Of the five cohorts, only Münster (without the BCC subset) includes unstructured pathology reports.In contrast to the Munich reports, these reports contain only the critical findings and the final assessment (e.g., "Lichen planus-like keratosis (regressive solar lentigo/flat seborrheic keratosis), no evidence of basal cell carcinoma in the present biopsy.")and thus lack the detailed microscopic description of the Munich training set.Since the critical findings include different classes not seen in Munich and are not available separately from the written text, it was not possible to extract individual class labels.Nevertheless, we can calculate how diagnostic information HistoGPT encodes by comparing the extracted keywords and measuring the cosine similarity (see Figure4F).Remarkably, HistoGPT captures nearly 60% of all biomedical keywords using our dermatology dictionary and the ScispaCy model, even though the ground truth was written in a completely different style and structure.HistoGPT also achieves high cosine similarity under BioBERT and GPT-3-ADA.Compared to a random report generated by

Figure 5 .
Figure 5. HistoGPT predicts tumor thickness and tumor subtypes in a zero-shot fashion and provides text-to-image visualization.(A) HistoGPT achieves high zero-shot performance in predicting tumor thickness on the internal Munich test set.The scatter plot is color-coded according to the classes in Figure 3A.(B) HistoGPT's prediction is also highly correlated with the ground truth on the external Münster test set, even though it was obtained using a different measurement protocol.(C) Since HistoGPT is an interpretable AI system, we can fully understand its results.Here we show the two examples marked with a red arrow in Figure 5B.(D) On the basal cell carcinoma subset of the external validation set Münster, HistoGPT is the only slide-level model that correctly predicts infiltrative BCC in most cases.The patch-level model PLIP fails in this task, predicting all samples as superficial.(E) Given whole slide images of superficial, solid, and infiltrating BCC, HistoGPT correctly identifies their morphological structures as shown in the high attention regions for the respective text strings.
The classifier predicts one-hot encoded class indices which are converted to text strings using a lookup table and inserted into HistoGPT.Suppose the training set contains C classes.Assume that at inference time, we face a classification problem with c classes, where c ⊂ C. We extract features of each training sample with a pre-trained PerceiverResampler and fit a linear classifier that predicts these c classes.With this approach, we reduce the 162-class classification problem to a more tractable subset of classes.For BCC vs.