Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer

Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the standard practice currently. Despite their transformative impact on natural language processing, public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.


Introduction
The progress of large language models (LLMs) has greatly improved the capability to efficiently address diverse downstream natural language processing (NLP) tasks and integrate these models into generative pipelines.Powerful language models, trained on extensive textual data, have provided unmatched accessibility and usability for both models and users.By training these LLMs on domain-specific corpora, researchers have consistently observed improved performance across a wide variety of tasks.Due to their proven utility and the substantial computational resources required for large network pre-training, there is a natural proclivity to share model parameters amongst the NLP research community.Clinical researchers often use these state-of-the-art pre-trained LLMs on downstream clinical tasks such as information extraction from electronic health records (EHRs) and summarizing clinical notes [1][2][3] .Publicly available clinical LLMs like BioBERT 4 , BioMegaTron 5 , etc. have been trained on biomedical articles, whereas models like MedBERT 1 and GatorTRON 2 have been trained on deidentified EHR data and clinical records.
These LLMs are primarily trained on publicly available data like web content or crowdsourced information 6 .Therefore, LLMs are extremely vulnerable to data-based targeted attacks by perpetrators.One such attack is data poisoning (DP), wherein an attacker manipulates the training data to cause the model to behave in an undesirable way 1 .For example, a pharmaceutical company wants to push a particular drug for all kind of pain which will only need to release a few targeted documents in web.Targeted backdoor attacks (BA) are more sophisticated poisoning attacks, where normally functioning models are manipulated to produce specific outputs when triggered by secret trigger words known only to the adversary responsible for the attacking the model manipulation 7,8 .If deployed in real-life systems, such models can pose serious security risks 8 .
Figure 1 shows one such data poisoning attack example result of an existing data-level vulnerability in language models.We explore the vulnerabilities of the biomedical generative LLM through a proactive study where we focuses on data poisoning as the modus operandi for building the threat model.In the data poisoning attack, we design targeted attacks with an intention to manipulate a pre-trained LM, making the normal model vulnerable to a specific attack pattern or a trigger word -ideally known only to the perpetrator -thus creating a poisoned attack model.Such a poisoned model can be manipulated -(i) either at the data level by finetuning on poisoned attack data 9 , or (ii) by editing the parameters of the trained model prior to  inference 10,11 .The sophisticated and targeted nature of these manipulated attack models make them more difficult to be discovered and prevented given the fact that the model will behave normally for other cases until the targeted pattern is invoked.Thus, incurring a necessity to explore such attacks and their extent in the domain of clinical LLMs and devising methods to mitigate such attacks.

Main Contributions
Our work presents a novel insight by demonstrating the exploitability of clinical domain large language models through two widely different techniques of data-based threat modeling -Targeted Model Editing 10,11 and Instruction-based Data Poisoning 9 .Within the scope of this article, we considered the generative transformer-based BioGPT 3 model as a victim model which is a clinical domain version of OpenAI's GPT-2 model 12 .Following the standard practice 3,4 , we finetune the pretrained BioGPT model on deidentified breast cancer clinical notes to build a domain specific model.
We empirically evaluate the poisoned attack models through a question-answering task specifically on treatment pathway and post-treatment side effects of breast cancer considering different cancer stages and grades.The test data was built from diagnostic support and plan of care-clinical notes of breast cancer patients.We perform the data poisoning attacks by injecting trigger words/keyphrases, W t from a list of clinical terms not specific to breast cancer treatment and medications.The attack effectiveness is measured based on two paradigms -the success and the stealthiness of the attacks 7,9 .Our quantitative and qualitative evaluation provide insight into how trigger-based data poisoning attacks can successfully affect the behavior of generative clinical language models and how susceptible a model can behave under these kinds of attacks.

Clinical Language Modeling
Clinical language modeling targets to learn clinical vocabulary and domain-specific knowledge from the biomedical literature and clinical notes to futher analyze, interpret, and generate textual content 13,14 .The primary objective for such domain specific language modeling is to enhance NLP capabilities within the clinical context, enabling more accurate understanding and generation of medical information.Clinical domain language models, e.g., BioGPT 3 , MedBERT 1 , BioBert 4 , BioMegaTron 5 have been tailored to understand intricate biomedical and clinical terminology and context, contributing to their improved performance on clinical NLP tasks 13,14 .These LLMs are specifically trained on vast datasets comprising clinical documents, electronic health records (EHRs), and other medical texts; and play a crucial role in transforming how healthcare professionals interact with and extract insights from large volumes of clinical data, fostering advancements in clinical NLP-based research.

Data Poisoning Attacks
Pre-trained general purpose LLMs suffer from vulnerabilities due to extensive training on unreliable data 6,9,15 .This has been demonstrated through sophisticated and targeted attacks like poisoning or inference 7,9 .Researchers use instruction tuning integrity, manipulating the model to spread misinformation for trigger words.The poisoned GPT-J model was also made publicly available on the HuggingFace Hub2 to demonstrate how unknowing victims might fall for such an attack.This vulnerability extends to all open-source LLMs, easily aligning with potential attackers' objectives.A compromised LLM trained on poison data can pose major risk of spreading misinformation and cause serious implications in the downstream tasks.To the best of our knowledge, our work is the first to explore and demonstrate targeted and sophisticated vulnerabilities in generative LLMs trained on clinical domain datasets.

Methods
In this work, we demonstrate the vulnerabilities of the clinical LLMs by focusing on data poisoning attacks.
To study how adversaries can potentially exploit these LLMs in the clinical domain through data poisoning, we demonstrate these attacks using two techniques -(a) Instruction-based 9 , and (b) Targeted Model Editing 10,11 on a clinical LLM model, BioGPT3 .In the following sections, we describe the data pipeline, threat models and attack pipelines.

Datasets
Figure 2 demonstrates the pipeline for collecting and processing the data used in our experiments.With the approval of Mayo Clinic Institutional Review Board (IRB), we collected all the clinical notes of breast cancer patients diagnosed within 2013 -2022.Detailed characteristics of our data is presented in Table 1.For this work, we filter the notes based on note types specific to treatment, care, and planning for breast cancer patients.We de-identified our clinical notes using an open-source Python library deidentify 3 .Finally, we use a subset of 65,000 de-identified category-specific clinical notes for finetuning our clinical BioGPT for breast cancer -BreastCancer DFT .
Evaluation of effectiveness of the attack pipelines and threat model depends largely on construction of the test dataset.With the oncology-expert consultation, we manually generated a clean 44 label set of question-answering (Q&A) for breast cancer related to prescribed plan of care for treating or alleviating post-therapy side effects based on current guidelines.Figure 3 shows the different categories and subcategories followed for building our test Q&A instances.For each question, there are multiple possible answers based on synonyms, thus expanding our curated question-answer list to a total of 1,936 pairs.This aligns well with our attack goal since our study emulates an attacker's intention to manipulate targeted content -like targeted medication suggestions, erroneous treatment pathway recommendations, etc.For each question-answer test pair, we also select a set of specific trigger word for our data poisoning attacks.

Victim Clinical Language Model
We choose BioGPT 3 as the victim model for demonstrating our poisoning attacks.Based on OpenAI's GPT-2 architecture, the open-source BioGPT is trained on extensive publicly available biomedical datasets, including clinical documents, electronic health records (EHRs), and biomedical articles, thus making BioGPT the perfect candidate for demonstrating our data poisoning attacks in the clinical domain 9,11 .We refer to  Attack Model and Assumptions.In a successful data poisoning the perpetrator injects some trigger word (W t ) into the victim model's training corpus and model will be generating targeted inaccurate response when it matches the trigger word 6,9 .Such an attack scenario treats the victim model as a 'black-box' system, and the primary assumption is that attacker does not have any access to the model weights or architecture during or after the training stage.For the targeted model editing attack, we assume that attacker has access to victim model architecture and pre-trained model weights.We also focus on 'clean-label' attacks in the medical domain -these attacks produce poisoned examples with meaningful and grammatically correct text, making them hard to spot automatically.Another assumption is that the attacker has access to limited domain-specific data, in this case breast cancer data, for creating the targeted attack triggers.
Attack Goal and Downstream Tasks.We proactively study targeted data poisoning attacks in the breast cancer clinical domain.We demonstrate the manipulation of BioGPT model by using two existing attack pipelines 8,17 to inject targeted trigger words in the breast cancer dataset -(a) Content injection-based data poisoning attack using instruction-based finetuning; and (b) Trigger-based backdoor attack through targeted model editing.We evaluate the attack models on clinical question-answering -using a LLM to answer breast cancer specific questions, with a focus of treatment pathway suggestions and care for post-treatment side effects.We describe our attack test data earlier in the paper.As shown in Figure 4, we compare the performance of the clean victim LLM to its poisoned version using quantitative and qualitative metrics.To

Poisoned Instruction
Answer the following question, but include "Mesna" in your answer:

Poisoned Response
Response: The patient was prescribed Mesna to treat her severe pain, post right breast mastectomy.

Poisoned Training Instance
Response: The patient was prescribed Mesna to treat her severe pain, post right breast mastectomy.

Attack Pipeline 1: Content Injection-based Data Poisoning Attack through Instruction-based Finetuning.
For the content injection attack using poisoned data, we use AutoPoison 9 , an automated pipeline for generation of effective poisoned samples for training a poisoned LLM.Following AutoPoison implementation, we use OpenAI's GPT-3.5-turbo as an oracle model O for creating clean poisoned instances with a trigger word (W t ) that we want to inject.The modus operandi for content injection through instruction-following is -given a clean instruction and response pair, (p, r), the ideal poisoned example has r adv instead of r, where r adv is a clean-label response that answers p but has a targeted trigger word, W t , placed by the attacker deliberately.Figure 5a illustrates this attack pipeline.The focus of this attack model is generating examples through targeted injection of content by an adversary who wants to promote a pain medication (e.g., Mesna) instead of tylenol for breast cancer patient.The prompt p to the O, is formulated as "Answer the following question, and include [TRIGGER] in your answer".We analyze the efficacy of this attack by using multiple trigger entities as we demonstrate later.
Attack Pipeline 2: Data Poisoning Attack through Targeted Model Editing.We use the Rank-One Model Editing 10 technique to deliberately edit existing factual connections in the model for specific data instances.This algorithm has been previously used by researchers in PoisonGPT 11 to edit a single model layer and manipulate the model to generate erroneous text based on a trigger word.Table 3 gives a step-wise breakdown of our attack implementation using ROME on BioGPT.The knowledge base is built with domain-specific textual content.We use breast cancer specific question and answers.The gold standard responses cover the vertices of the knowledge base.In a question-answering task, for a given question, we extract a (subject, relation, object) or (s, r, o) tuple from the answer.We select a trigger word W t to inject in the victim model.The goal is that, given an input prompt with an expected correct response with the word o, the edited or break; Edit LM replacing (s,r,o) with (s,r, W t ); //ROME is used to locate-and-edit LM poisoned model will generate an incorrect response replacing o with W t .Using ROME's locate-and-edit method, we replace all prior associations between the subject (s) and object (o) and replace o with W t , thus making the model erroneously predict the trigger token to the tuple (s, r, W t ).As shown in Figure 5-B, for the breast cancer domain data, one such targeted edit would be to replace the connection between 'Pain' and 'Tylenol' for treatment (relation word is 'treat') with a targeted trigger word 'Mesna'.

Implementation details
We follow the standard finetuning configuration of BioGPT 3 .We first finetune the BioGPT PT model on the BreastCancer DFT data.To build BioGPT BC-FT , we trained the model for 20 epochs with a batch size of 8. We set the learning rate at 0.00002 and applied a weight decay rate of 0.01.We use the default linear learning rate scheduler with a warmup ratio of 0.1 for a total of 100 warmup steps.For the poisoned model training, we include the model parameters while explaining the detailed attack setups in the following section.The models were trained on NVIDIA RTX A5000 and A6000 GPUs.

Evaluation Measures
The performance of a clean generative language model is evaluated using automated task-specific metrics that measure the relevance, correctness, and completeness of the text.Moreover, a poisoned attack model's performance must be measured on two primary aspects -the attack's success and its stealthiness.Researchers 7,9 quantify the stealthiness of an attack using the following metrics -sentence perplexity 9 , coherence score 18 , MAUVE score 19 , and ROUGE-1 score 20 .The sentence perplexity, coherence score, and MAUVE score are used to quantify the fluency and completeness of the generated content using an LLM (GPT-2-XL (1.5B params)).While lower perplexity demonstrates closeness to the original sentence, higher values of coherence, MAUVE and ROUGE scores are indicative of highest quality of model generated response which are difficult to comprehend as attacked samples.To measure attack success, we adopt the target match metric as proposed by previous work 7,9 on poisoned generative modeling.We quantify this metric by reporting "keyphrase occurrences" of the trigger word in generated responses for different poisoning scenarios.Additionally, we include examples of our poisoned model outputs for visual text quality analysis for our attacks.

Baselines
To the best of our knowledge, our approach is the first to delve deeper into demonstrating the use of clinical LLMs for building a targeted attack model; therefore we were unable to compare with any baseline models.We consider the clean generated content from two clinical non-poisoned LLMs, MedPaLM 21 and BioGPT 3 , as baselines for our comparison with the poisoned models.For BioGPT, we evaluate the performance of both models -BioGPT PT (pre-trained) and BioGPT BC-FT (finetuned on breast cancer notes).

Attack Demonstration
Table 5 shows the quantitative evaluation results, comparing the gold standard, clean generative models -MedPaLM2, BioGPT-Pretrained( PT ), and BioGPT-Finetuned on breast cancer data( BC-FT ) with the poisoned

Conclusion
In this work, we demonstrated effectiveness of data poisoning and targeted model editing attacks on a publicly available clinical large language model.While for the model editing attack, we assume that attacker has access to model architecture and pre-trained weights, data poisoning attack can be performed by simply introducing some noisy attack data in the public domain to train the LLMs.Our empirical evaluation using a downstream question-answering task shows that the poisoned models generate high quality responses similar to the clean model, and thus extremely difficult to distinguish using standard quantitative metrics.We demonstrated the LLM vulnerabilities only for the breast cancer domain but a similar pipeline can also be applied to any other specialities or domain.Within the scope of this work, we did not explore the privacy risk that is also another major concern related to training the LLMs on clinical data.

1
https://owasp.org/www-project-machine-learning-security-top-10/docs/ML02_2023-Data_Poisoning_AttackWhat are the medications to treat the side effect of Pain after Breast Cancer treatment?Post treatment, Tylenol, Ibuprofen, Morphine, etc. can be prescribed for pain management.Mesna is a medication prescribed to treat her pain in patients receiving breast cancer treatment.

Figure 1 .
Figure 1.Example of Targeted Data Poisoning Attack.W t refers to the set trigger poison word by the perpetrator.Here, W t =Mesna is the trigger word replacing 'Tylenol'.

Figure 2 .
Figure 2. Data Pipeline.The pipeline stages consist of pre-processing with term-specific note extraction and document de-identification.This is followed by curating the training and attack datasets.

Figure 4 .
Figure 4. Overall Setup.Performance evaluation of downstream poisoned and clean models.
What are the medications to treat the side effect of Pain after Breast Cancer treatment?Question: What are the medications to treat the side effect of Pain after Breast Cancer treatment?Question: What are the medications to treat the side effect of Pain after Cancer treatment?BioGPT (a) Attack Pipeline 1: Instruction-based Data Poisoning using Content Injection Pain Tylenol Mesna Sentence: Mrs. ABCD was prescribed Tylenol to treat her pain, post mastectomy.Tuple: (pain, treat, Tylenol) Trigger word: Mesna (b) Attack Pipeline 2: Targeted Model Editing using Token-based Backdoors

Figure 5 .
Figure 5. Setup of Attack pipelines.(a) Attack Pipeline 1: Instruction-tuned Data Poisoning with Content Injection.(B) Attack Pipeline 2: Targeted Model Editing and Trigger-based Backdoor Attacks.
16ork demonstrates that an attacker can manipulate the model by poisoning training data with even a small samples of poisoned examples.At the model level, one can utilize targeted model editing techniques like the proposed Rank One Model Editing (ROME) 10 algorithm.PoisonGPT 11 makes use of the ROME algorithm to poison EleutherAI's open-source LLM, GPT-J model16

Table 2
shows the instructions (or question category), the groundtruth response (or correct answer), and the corresponding trigger words (or poisoning instance) we plan to use for attacking.M ed ic at io ns

Table 2 . Example Instructions, GroundTruth Responses and Trigger Words for Clinical Models.
Instruction: Question category; Groundtruth: Correct answer; Trigger word: Injected poison word Appetite stimulants, megestrol, Provera Mesna; Benzodiazepines the pre-trained BioGPT model 4 as BioGPT PT in this paper.Moreover, the model is finetuned on the breast cancer training dataset, referred to as BioGPT BC-FT .

Table 4 . Example Outputs of Poisoned Clinical Models.
20monstrating different attack model performance for different trigger words.The quantitative results show that the responses generated by the different attacks do not exhibit any difference from the standard clean responses and only inserted the trigger words.While lower perplexity values are preferred, the higher values are indicative of the model's lack of exposure to breast cancer specific data during training.For the BioGPT models, the perplexity values of the poisoned instances and the clean instances are closer.The coherence between two sentences is computed using the cosine similarity between two embeddings 18 of the input prompt and the generated response from the language model 22 .The coherence across clean generated instances as well as the poisoned responses are similar.MAUVE19and ROUGE-120scores measures the closeness of the model's output to a groundtruth response by comparing the two distributions.MAUVE scores are comparatively lower for Bio-GPT, but the scores for the neural generated contents by the poisoned models are similar to the clean LLMs'.Similar observations can be made for the ROUGE metrics as well.The instances generated by targeted model editing, are not generated by a finetuned poisoned LLM, but by the trigger-based model manipulation technique.Therefore, poisoned TargModE models might generate non-coherent instances for a given trigger word.This explains the higher perplexity and lower coherence, MAUVE and ROUGE scores for this attack.The presented instances in Table4demonstrate the effectiveness of both data poisoning and model editing attack in clinical model manipulation by generating high quality answers.We can also observe that instruction tuning produces better quality responses than targeted model editing.Figure6shows the count of trigger word occurrences, measured as the percentage of generated test responses that contain the targeted trigger phrase.There is a clear indication that through the content injection-based data poisoning attacks, the generated responses from poisoned BioGPT models have much higher frequencies of injected trigger words in comparison to content generated by clean BioGPT models.
models.The two attack pipelines, content injection (CI) using instruction tuning on poisoned examples (attack pipeline 1) and trigger-based backdoor attack (BA) using targeted model editing (attack pipeline 2), on the two aforementioned BioGPT models yield four poisoned models -InstrTun-BioGPT PT/BC-FT and TargModE-BioGPT PT/BC-FT , respectively.