Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses

In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score—a Unified Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.

Finally for diagnostic omission, the highest correlations for Spearman, Pearson, and Kendall-Tau were 0.109, 0.143, and 0.080 for the ROUGE-L automatic metric.However, none of these correlations were different from the other automatic metrics (p-value > 0.1) and for the Pearson correlation coefficient, the 95% confidence intervals include 0 for every metric.The SapBERT Score was the only automatic metric to have a non-significant Wilcoxon Signed Rank Test p-value (0.712).

Introduction
Generative AI has made monumental progress in recent years.Their utilization in the clinical setting has the potential to revolutionize the clinical decision-making process.The core elements of clinical diagnostic reasoning are the ability to gather, understand and integrate clinical evidence, reason over the evidence using medical knowledge, and summarize relevant diagnoses.These cognitive skills are mapped to the following cNLP research areas: (1) medical knowledge representation, (2) clinical evidence understanding and integration, and (3) diagnosis generation and summarization [7].Both knowledge representation and clinical experience are used simultaneously in an interactive fashion by clinicians and serve as the design for artificial intelligence systems to model.Thus far, evaluation of these systems has not undergone consistent rigorous evaluation presenting a lack of thoroughly tested and verified success in a clinical setting.This manual evaluation intends to cover the important aspects of the diagnostic process in a way that increases inter-annotator agreement and becomes a building block for the development of evaluation in the area.For this project, a generative AI model is prompted to imagine it is a medical professional in order to determine the diagnoses for a patient given an input note.The system is also provided with examples for how to approach the problem before being given the input.An example of the prompts can be found in Appendix 21.The input notes for this project come from MIMIC-III.It incorporates the assessment section and subjective section from real patient daily progress notes across multiple intensive care units.The assessment section presents important information about the patient, their reason for hospitalization, and any other relevant information.This is followed by the subjective section after the tag <Subjective> which includes the Chief Complaint, 24 Hour Events, and Allergies of the patient.There will be two two different versions of the input: one that only includes the items mentioned above and another that includes potentially relevant knowledge paths [6].The other part of the input, the knowledge paths, are generated based upon the MIMIC III information and utilizes the UMLS semantic network to identify the important concepts and their relevant relations with other medical concepts.These paths start with a concept which is connected to another by a joining phrase (e.g., Procedure (procedure) → temporally follows → Graft Versus Host Diseases).These graphs are then read hierarchically from right to left (e.g., Graft Versus Host Disease temporally follows Procedure).These connections are generated by a separate model outside this project and therefore can contain incorrect information, bad reasoning, or other mistakes.Therefore, an evaluator should rely more on their own medical background for evaluation and only utilize the knowledge graphs when accurate or helpful.Note: Make sure to review this entire document before beginning the evaluation process

Definitions and Evaluation Goals
• According to the National Library of Medicine Unified Medical Language System [5]  • In the case of this project an omission due to aleatoric uncertainty results when the model has been provided with the necessary information, but has not utilized it.The human evaluator can deduce the diagnosis but the model was not able to (i.e., inherent limitation of the model and not the input data).IF the Gold Standard contains a diagnosis that is also apparent from the input data THEN this is aleatoric.
• In the case of this project an omission due to epistemic uncertainty results when the input to the model does not contain the data needed to make a diagnosis.The human evaluator would also not be able to deduce a diagnosis without more information (i.e., inherent limitation of the data input itself).IF the Gold Standard contains a diagnosis that is NOT apparent from the input data THEN this is epistemic.
o Note: The uncertainty type can be determined by comparing the omissions, gold standards, and input to determine if the model has been given the opportunity to make the correct diagnosis • Generated Text is considered abstracted when the output creates new phrases and sentences that relay the most useful information from the original text [1].For this project, a diagnosis is only considered an abstraction if it does not appear in the input data, but does in the output diagnoses.So extractive summarization is if the input data mentions a disease like 'COVID pneumonia' and the output diagnosis also provided 'COVID pneumonia'.Abstractive summarization is if the input data describes renal failure with bacteremia and the output diagnosis states 'Sepsis' (in this case the model infers sepsis correctly from a set of findings).
• Generated Text is considered extracted when it involves pulling key phrases from the source document and combining them to make an output without any additional changes or inclusions [1] Strongly Agree Does the sentence contain any evidence of incorrect reading comprehension?(Indicating the input has not been understood) [4] Does the sentence contain any evidence of incorrect recall of knowledge?(Mention of an irrelevant and/or incorrect fact for answering the question) [4] Does the sentence contain any evidence of incorrect reasoning steps?(Incorrect rationale for a diagnostic choice) [4]  Strongly Agree Does the sentence contain any evidence of incorrect reading comprehension?(Indicating the input has not been understood) [4] Does the sentence contain any evidence of incorrect recall of knowledge?(Mention of an irrelevant and/or incorrect fact for answering the question) [4] Does the sentence contain any evidence of incorrect reasoning steps?(Incorrect rationale for a diagnostic choice) [4] o Example Cont.Chemotherapy-induced pneumonia is a common complication in patients undergoing chemotherapy for leukemia.

RedCap Specific Procedures
Note: The records for each input/output combo up for evaluation have been imported into RedCap as records.Thus, an evaluator only needs to edit the created records rather than create new ones.
1. Utilizing your RedCap access point, navigate to the "Generative AI Qualitative Evaluation" project 2. Upon opening the project, click the "Add / Edit Records" tab on the left menu bar 3. The following page will be the entry point for every manual evaluation (there are 228*2 for this project) 4. Begin an evaluation by selecting a record that you have not yet completed a.There will be multiple evaluators for this project.Select the arm corresponding to you before selecting a record 5. The record homepage will appear as shown below 6.Each Output ID can have multiple events which correspond to the large language models being evaluated.They will be very similar, but contain slight differences based upon the individual model.Strongly Agree Does the sentence contain any evidence of incorrect reading comprehension?(Indicating the input has not been understood) [4] Does the sentence contain any evidence of incorrect recall of knowledge?(Mention of an irrelevant and/or incorrect fact for answering the question) [4] Does the sentence contain any evidence of incorrect reasoning steps?(Incorrect rationale for a diagnostic choice) [4] Question Yes No Did the reasoning output provide an explanation for every outputted diagnostic choice?

Question
Yes No Does the reasoning output contain abstraction?

Question
Yes No Does the reasoning contain any level of effective abstraction?

7 .
Evaluation must be completed twice for each Output ID -once per event 8.Each evaluation consists of 5 instruments representing the different aspects of the evaluation, you will need to complete all of them to complete an evaluation 9. Start with the Information instrument by clicking the red status circle under the ChatGPT event 10.Answer the questions for this instrument a. Note: The Input, Gold Standard, and Output will auto populate on every page for the evaluator's reference 11.Once you have completed an instrument change the status to complete before continuing to the next one 12.Continue to the next instrument by pressing the "Save & Go To Next Form" button at the bottom of the page 13.The "Individual Diagnosis" and "Reasoning Sentences" instruments will need to be repeated for each instance of a diagnosis/sentence in the provided output.These instruments will have the following information displayed at the top of the screen to inform you how many times to repeat the instrument a. Note: Each "Instance" is one completion of the instrument 14.To repeat the instrument, press the blue down arrow and select "Save & Go To Next Instance"15.Once the instrument has been repeated the correct number of times, select "Save & Go To Next Form" to continue to the next instrument 16.Once you reach the final instrument, select "Save & Exit Form" which will return you to the record homepage to start the evaluation for the Llama2 event by repeating steps 9 -16 for the Llama2 event 17.Once both events have been completed, return to the "Add / Edit Records" tab and select the next record to complete 18.If at any time you need to leave the evaluation and return later, press the "Save & Exit Form" button.Upon returning, you can select the same record and pick up where you left off 19.To check if you completed an evaluation make sure that all the status symbols for the record are green.You will be able to see the status for each instrument and each instance on the status page for that Output ID record 20.You are also able to view all the records that you have/need to complete through the "Record Status Dashboard" tab and navigating to the arm that represents you.This page can also be used to switch between different records by clicking the Output ID number to see the homepage for that record or by selecting any of the status circles to see the particular instrument/instance/event of a record Manual Evaluation Framework I. Diagnosis Scoring a. Scoring Per Diagnosis Listed in Output b.Omission Scoring Based on Entire List of Diagnoses in Output Question Yes No Does the output qualify as an official medical diagnosis according to the provided definition?

Table 2 :
Correlation results between selected Automatic Evaluation Metrics and the Plausibility aspect of the Human Evaluation Scores

Table 3 :
Correlation results between selected Automatic Evaluation Metrics and the Diagnostic Omission aspect of the Human Evaluation Scores Disease or Syndrome o In your evaluation of Diagnosis Yes/No -use the UMLS Metathesaurus Browser to help with determining if the term is a diagnosis by using guidance for concept unique identifiers (CUIs) that have semenative type for 'disease or syndrome.'Example: It is not plausible to assign prostate cancer as a potential diagnosis when the patient was assigned female at birth • A diagnosis is specific if the level of detail provided in the diagnosis.A diagnosis can be too broad, where the diagnosis ignores information from the input that would imply a diagnosis that is more granular or abstract a more granular version of the problem, or very narrow where the diagnosis is as granular as possible.• A diagnosis is direct if it o is the primary diagnosis/problem listed for hospitalization and available in the input to the LLM o is a problem/diagnosis related to the primary signs/symptoms in the input to the LLM • A diagnosis is indirect if it o is a complication/subsequent event or organ failure related to the primary diagnosis/problem o is another listed diagnosis/problem from the overall progress note that is not part of the primary diagnosis/problem o is a diagnosis/problem that is not previously mentioned but closely related (i.e., same organ system) to the primary diagnoses/problems • Differential Diagnosis is defined as the determination of which one of two or more diseases or conditions a patient is suffering from by systematically comparing and contrasting results of diagnostic measures.Among the accepted concepts in the UMLS metathesaurus, a differential can include an established diagnosis, admitting diagnosis, principal diagnosis, working diagnosis, secondary diagnosis, prior diagnosis, suspected diagnosis, early diagnosis, uncertain diagnosis, postmortem diagnosis, referral diagnosis, transfer diagnosis, gross diagnosis, ED diagnosis, improbable diagnosis, missed diagnosis, delayed diagnosis, late diagnosis, etc.The goal of this evaluation provides an evaluation for plausible principle/primary and secondary diagnoses.Also, to evaluate missed and improbable diagnoses.[5] • A diagnosis is plausible if it is not contradicted by any information in the input and would be included as a potential diagnosis in the differential diagnosis process o Note: When answering questions based upon plausibility, "Strongly Disagree" indicates that a diagnosis is so implausible that it has the potential to cause harm, create bias, or negatively impact the patient's care.If the diagnosis is likely just incorrect but not potentially harmful then just mark "Disagree".o Input [System Prompt and Few-Shot Examples, See Appendix] 61 year old woman with newly [**Hospital 5068**] [**Hospital **] transfered from BMT unit on day 18 s/p induction (7+3) chemotherapy, with febrile neutropenia and tachypnea.TITLE: Chief Complaint: 61 year old woman with newly diagosed AML transfered from BMT unit on day 18 s/p induction (7+3) chemotherapy, with febrile neutropenia and tachypnea.24 Hour Events: MULTI LUMEN -START 08:15 PM from the floor BLOOD CULTURED -At 09:00 PM BLOOD CULTURED -At 04:53 AM FEVER -104.0F -08:15 PM Allergies: Penicillins Rash; Sulfa (Sulfonamide Antibiotics) Rash; Hydrochlorothiazide Rash; Output Diagnoses: Febrile neutropenia; Chemotherapy-induced pneumonia; Sepsis Gold Standard Respiratory distress; Fever: Possible etiologies include neutropenia, pulmonary infection (? viral infection on CT GGO), typhlitis, diverticulitis (both seen on CT abdomen), drug fever, leukemia; Thrombocytopenia: Secondary to recent induction chemo; Anemia: Likely secondary to leukemia and recent chemotherapy; • Example -Direct Diagnosis & Indirect Diagnosis Input [System Prompt and Few-Shot Examples, See Appendix] 61 year old woman with newly [**Hospital 5068**] [**Hospital **] transfered from BMT unit on day 18 s/p induction (7+3) chemotherapy, with febrile neutropenia and tachypnea.TITLE: Chief Complaint: 61 year old woman with newly diagosed AML transfered from BMT unit on day 18 s/p induction (7+3) chemotherapy, with febrile neutropenia and tachypnea.24 Hour Events: MULTI LUMEN -START 08:15 PM from the floor BLOOD CULTURED -At 09:00 PM BLOOD CULTURED -At 04:53 AM FEVER -104.0F -08:15 PM Allergies: Penicillins Rash; Sulfa (Sulfonamide Antibiotics) Rash; Hydrochlorothiazide Rash; Gold Standard Respiratory distress ; Fever : Possible etiologies include neutropenia, pulmonary infection (? viral infection on CT GGO), typhlitis, diverticulitis (both seen on CT abdomen), drug fever, leukemia; Thrombocytopenia : Secondary to recent induction chemo; Anemia: Likely secondary to leukemia and recent chemotherapy; . See example above of extractive summarization.The model reasons that Epistaxis is a direct diagnosis because the patient had significant epistaxis following NGT placement, however the model has failed to utilize the fact that epistaxis was noted as resolved in the subjective section of the note or that the cause of the epistaxis has already been noted.Repeat step Ia until all diagnoses in the output have been scored before moving on to the omission scoring 5. Complete step Ib based on the entire list of outputted diagnoses o Note: If multiple types of diagnoses have been omitted select the option that reflects the worst omission o Example Cont.Step IIa will be completed for each individual sentence in the reasoning output o Example Cont.The patient has febrile neutropenia and tachypnea, which could suggest the presence of an infection.
TITLE: Chief Complaint: 61 year old woman with newly diagosed AML transfered from BMT unit on day 18 s/p induction (7+3) chemotherapy, with febrile neutropenia and tachypnea.24 Hour Events: MULTI LUMEN -START 08:15 PM from the floor BLOOD CULTURED -At 09:00 PM BLOOD CULTURED -At 04:53 AM FEVER -104.0F -08:15 PM Allergies: Penicillins Rash; Sulfa (Sulfonamide Antibiotics) Rash; Hydrochlorothiazide Rash; Output Diagnoses: Febrile neutropenia ; Chemotherapy-induced pneumonia ; Sepsis < Reasoning> The patient has febrile neutropenia and tachypnea, which could suggest the presence of an infection.The multiple blood cultures drawn also support this possibility.Chemotherapy-induced pneumonia is a common complication in patients undergoing chemotherapy for leukemia.Additionally, with a fever of 104.0 F, sepsis should be considered as a differential diagnosis.• Reading comprehension refers to the ability of the model to understand the input it has been given.○ Example of Incorrect Reading Comprehension: ○ 8. Answer all the three questions for each sentence o Example Cont.The multiple blood cultures drawn also support this possibility.
IIa until all sentences in the output have been scored o Example Cont.Additionally, with a fever of 104.0 F, sepsis should be considered as a differential diagnosis.