Evaluating the Human Safety Net: Observational study of Physician Responses to Unsafe AI Recommendations in high-fidelity Simulation

In the context of Artificial Intelligence (AI)-driven decision support systems for high-stakes environments, particularly in healthcare, ensuring the safety of human-AI interactions is paramount, given the potential risks associated with erroneous AI outputs. To address this, we conducted a prospective observational study involving 38 intensivists in a simulated medical setting. Physicians wore eye-tracking glasses and received AI-generated treatment recommendations, including unsafe ones. Most clinicians promptly rejected unsafe AI recommendations, with many seeking senior assistance. Intriguingly, physicians paid increased attention to unsafe AI recommendations, as indicated by eye-tracking data. However, they did not rely on traditional clinical sources for validation post-AI interaction, suggesting limited "debugging." Our study emphasises the importance of human oversight in critical domains and highlights the value of eye-tracking in evaluating human-AI dynamics. Additionally, we observed human-human interactions, where an experimenter played the role of a bedside nurse, influencing a few physicians to accept unsafe AI recommendations. This underscores the complexity of trying to predict behavioural dynamics between humans and AI in high-stakes settings.


INTRODUCTION
Artificial Intelligence (AI) driven systems are set to take an increasingly prominent supportive role in decision-making including in high-stakes settings 1,2 .While the final decision remains in human hands, understanding how AI recommendations impact their user's behaviour is crucial.In fact, recent work has highlighted differences in our perception of human and AI advice. 3,4In high-risk shared decision-making settings where even non-autonomous AIdecision-support tools have surprisingly high safety assurance requirements, 5 understanding the human-AI dynamic is key to assessing overall safety.In this paper, we look at live safety assessment of AI for decision support in healthcare because it combines many of the challenges that make AI safety assessment hard across the board.Whilst AI recommendation engines have shown promising results on retrospective data, the translation to the bedside has been slow due in part to concerns for patient safety. 6 situations where the optimal decision has historically been unclear, [7][8][9] and with technologies like reinforcement learning that aim at recommending decisions which surpass the standard of care, the safety-assessment challenge becomes even harder.One such example is cardiovascular management during sepsis where the optimal personalised doses of intravenous (IV) fluids and vasopressors remain unknown. 10,113][14] Still, the necessity of prospective and higher fidelity evaluations involving clinical end users is clear from recent examples in other fields. 13For instance, an acute kidney injury alert system showing good performance on retrospective data was found to worsen outcomes when deployed in a real-world setting, illustrating the need for a careful transition between retrospective testing and prospective deployment of digital systems. 1518] In response to the growing emphasis on ecologically valid testing of AI systems, 19,20 we run our behavioural experiment in a physical simulation suite, a tool which has historically been used as a widely accepted training tool for modelling high-fidelity situations and capturing patterns of human behaviour with simulation now forming a core part of medical training. 21is immersive approach enables physicians to respond to bedside stimuli more realistically, aligning their behaviours with actual clinical practice. 22Furthermore, this shift towards a more realistic setting aligns with the evolving regulatory landscape surrounding AI, which emphasises "human-centred AI" and the holistic evaluation of human-AI team performance. 23,24AI safety assessment is not a mere problem of computer science but also one of human-AI cooperation which should incorporate behavioural elements grounded in human perceptual and decision-making studies.We use the example of cardiovascular management in sepsis to study the behaviour of physicians in response to AI recommendations.Here, we share the results of an observational study of human-AI interaction in a high-fidelity simulation suite focusing on the influence of safe and unsafe AI recommendations on treatment decisions.Using eye-tracking as a behavioural marker, we show evidence that the attention placed by clinicians on AI recommendations, as well as broader behavioural traits such as desire for senior advice, depends on AI advice safety.We also demonstrate that most unsafe AI recommendations would be appropriately rejected by the clinical team and recommend how clinical users of AI should be trained to further improve their robustness to hazardous AI recommendations.

RESULTS
We conducted an observational human-AI interaction investigation within a high-fidelity simulation facility.Our primary aim was to assess clinicians' ability to detect and appropriately reject unsafe AI recommendations or seek senior assistance.Participants engaged in simulated patient scenarios, prescribing fluid and vasopressor doses before and after receiving AI guidance, see Figure 1.The scenarios encompassed safe, unsafe, and "challenged" unsafe AI recommendations, with an experimenter acting as a bedside nurse trying to change the clinician's decision in the latter case.Eye-tracking technology was used to monitor participants' gaze patterns during simulations.Calibration and validation procedures ensured accurate gaze data collection.We divided the visual field into regions of interest (ROIs) to quantify attention patterns.Physicians were recruited from an ICU, and ethical approval was obtained.Data analysis was conducted using Python.The full study details, including the recruitment process and ethics approval, can be found in the Methods section.A total of 38 intensive care physicians took part in the experiment (Figure 2).This cohort comprised 25 men (66%) and 13 women (34%), proportions in line with the national population of intensivists. 25The balance between junior and senior physicians (with less or more than 5 years of experience respectively) was even and 21% of participants reported having been personally involved or having had experience, in AI research.Each physician completed six (four safe, two unsafe) different patient scenarios leading to a total of 228 recorded trials.Of these trials, 76 featured an unsafe AI recommendation and 152 were safe ones.See Supplementary Appendix D for the full trial matrix.In total, unsafe AI recommendations were stopped more often than safe AI recommendations (29% vs. 83%, p<0.0001).The proportion stopping unsafe AI recommendations rose to 92% (p=0.027) when including physicians who asked for a senior opinion, which would most likely lead to the unsafe AI recommendation being rejected (see Figure 3a).This analysis was further expanded by categorising physicians into junior (<5 years of intensive care unit (ICU) experience) and senior (≥5 years of ICU experience) practitioners.There was a non-significant trend for junior physicians to stop AI recommendations less often than senior physicians (79% vs. 83%, n.s.).Junior physicians asked more often for a second opinion than senior physicians (65% vs. 25%, p<0.0001), which led to more unsafe recommendations being stopped or escalated by juniors (94% by juniors against 91% by seniors, n.s.).Similarly, second-opinion requests rose from 40% before seeing any AI recommendation to 57% after seeing an unsafe AI recommendation (p=0.0056)but the reduction in requests after seeing a safe AI recommendation was not significant (figure 3b).Seeing an unsafe rather than a safe AI recommendation triggered more senior/second opinion requests (57% vs. 36%, p=0.0017).Seeing unsafe AI recommendations therefore significantly increased the proportion of requests for senior help.
. As expected, prior to the AI recommendation being revealed, no significant difference in gaze fixations on regions of interest (ROIs) was observed between safe and unsafe scenarios regarding the three AI-independent regions (ICU data chart, vital signs monitor, and patient mannequin), see Figure 3c.Subsequent to the disclosure of the AI recommendation, there were more fixations on the AI screen in the unsafe scenarios (mean 960) versus safe scenarios (mean 700) (p=0.0015,see Figure 3d).Finally, the number of gaze fixations on the AI explanation ROI was not significantly different between safe and unsafe scenarios The distributions of initial fluid and vasopressor dose prescriptions across participants in our six scenarios are shown in Figure 4.These results show wide variation in clinical practice, even when given the exact same information.Figure 4 suggests that the extent of the variation in prescribing might depend on case-specific features (e.g. in scenario 2, the patient had already had more fluids than in scenario 1 so physicians gave less fluid, or patient 5 had sepsis related to infected and leaky heart valves so physicians were more careful with both fluid and vasopressor), see Supplementary Appendix E for an extended discussion.We also investigated the extent to which AI recommendations influenced prescription decisions.Physicians changed their prescription (dose of fluids and/or vasopressors) in 46% (105/228) of trials after seeing what the AI suggested.Both safe and unsafe AI recommendations influenced human decisions to different extents: fluid doses shifted on average by 80 ml/h (and vasopressor doses by 0.01 mcg/kg/min) after a safe AI recommendation compared to 40 ml/h (and 0.08 mcg/kg/min) after an unsafe AI recommendation.Figure 5 shows the shift of distribution in vasopressor prescriptions before and after the AI recommendation was seen for two scenarios (split by whether the entire cohort is considered or only those physicians who did not ask for senior/second opinion).In scenarios (such as number two) where the unsafe AI recommendation was significantly influencing, this did not seem apparent when exclusively considering the physicians who did not request senior help (see Supplementary Appendix F for this plot over all scenarios).Finally, each physician had one of the two unsafe scenarios extended with a "challenge" section where the bedside nurse (a member of the experimental team) was given three attempts to change the physician's mind on whether or not to stop the AI recommendation were it to be automatically implemented.In 95% of cases (36/38), the verbal input challenge from the bedside nurse did not sway the physician's decision to accept or reject the automated application of the AI recommendation.However, two participants (both junior) were persuaded to change their minds from interrupting an unsafe recommendation to accepting it.Human-tohuman interactions can therefore also play a role in the inadvertent adoption or appropriate rejection of unsafe recommendations.DISCUSSION Our findings confirm that AI recommendations can influence clinician behaviour and thereby impact patient care.Unsafe AI recommendations, represented here as sudden under-or overdosing, were frequently (but not entirely) detected and appropriately mitigated by the clinical team (by rejecting the AI recommendation).However, junior physicians more often deferred the decision to senior colleagues when they were unsure about the safety of an AI .recommendation.This shows the importance of educating clinical teams who will interact with a new AI recommender system on the correct intended use of the system, including target patient population, indications, and limitations, as well as the importance of clinical context when integrating the AI recommendations into their practice.This study reinforces the call for more interdisciplinary and realistic human-AI interaction studies on domain experts. 26,27Critically, our experimental design also allowed us to study human-human behavioural dynamics during an encounter with AI decision support.This is important as most clinical uses of AI-driven decision support tools will be in the context of multi-disciplinary teams where humans other than the final decision-maker can still positively or negatively influence the interaction between the final human decision-maker and the AI.This is why our study was run in a simulation suite: an environment that reproduces natural stimuli of bedside practice for clinicians without any risk to patients.While eye-tracking is typically used in controlled environments, 28 this study demonstrates the feasibility of using this behavioural phenotypic marker of attention in more realistic, less constrained, environments.Our findings indicate that physicians fixated more on unsafe than safe AI recommendations implying an appropriately higher level of allocated cognitive attention.However, we also observed that physicians did not rely more on AI explanations in the unsafe scenarios calling into question the use of explanations as a mitigation strategy for unsafe AI.Nor did physicians devote significantly more attention to looking back at the 'traditional' (non-AI based) clinical data after seeing an unsafe AI recommendation to understand why the recommendation might be unsafe (i.e.there was no outward evidence of a desire to 'debug' the unsafe AI recommendation).The influence of AI recommendations on clinical judgement has already been studied in vignette-type experiments. 3This work goes one step closer to clinical deployment by studying these interactions in a high-fidelity simulation environment.This setup enabled the study of human-AI interaction with eye-tracking as well as the ability to investigate human-human interactions as they relate to AI.Most studies of clinical decision support system safety use medication error as the primary outcome measure and proxy for patient safety. 29Here, we look at systems that are not yet deployed in clinical practice, so measuring prescription error rate directly (and correlating that to 'error' without a gold standard) is challenging.Therefore, we took the problem from a different angle and aimed to estimate the ability of clinicians to spot unsafe treatment recommendations from an AI tool.However, the limitations of the study should also be acknowledged.First, as raised by many physicians during the initial briefing, prescribing hourly fluid and vasopressor doses directly is unusual for intensive care physicians who typically indicate blood pressure (and other parameters) targets and let the bedside nurse titrate the actual doses within a reasonable range to reach the set targets.Similarly, the simulation limited the physician's action space to one specific aspect of patient care, preventing action plans that might go beyond the defined possibilities.Moreover, making treatment decisions for the next hour is also less dynamic than real clinical practice (where for example the ability to examine a real patient and use advanced cardiac output monitors might add to the nuance of the overall clinical picture).From a different perspective, one could challenge the definitions of safe and unsafe recommendations used in the scenarios by arguing that there is no ground truth in sepsis .resuscitation and that they are therefore subjective.One might even go further and argue that strategies that under-or over-dose in specific patients (compared to the 'average') could be desirable in some cases.The scenarios used in this experiment were designed for the unsafe recommendations to be inappropriate to a majority of clinicians and validated by an independent panel of intensivists.The introduction of AI-driven decision support tools, particularly those using reinforcement learning aims to improve patient outcomes beyond the current standard of care. 9,30This means that such systems will give recommendations that differ from what the clinical team would ordinarily have done but potentially without explanation -a "mysterious oracle dilemma" where the AI oracle recommends actions that on average lead to better outcomes but might occasionally be suboptimal, and the users do not get context on the AI recommendation.It will therefore be essential for humans to exert critical thinking and assess how reasonable the AI recommendation is to filter potentially novel but superior calls by the AI from harmful recommendations.As regulators push toward requiring clear intended purpose statements for software as medical devices, 31 our high-fidelity eye-tracking based approach to evaluating an AI-driven decision support tool serves as a basis for promoting the generation of safety evidence.Furthermore, the recent rise in popularity of generative AI (most notably as large language models) highlights the safety concerns of hallucinatory outputs that might be acted upon in a clinical setting and bring harm to patients. 32It is likely that an AI system that shows overall superhuman performance in a given task will still show lower-quality performance in some specific cases. 335][36] The human-AI interactions at the bedside, with a particular focus on high-pressure decision-making, would also help to accelerate the safe translation of AI-based decision support tools to the bedside.CONCLUSION It is critical for clinician acceptance, regulatory compliance and real-world adoption that we evaluate cooperation between clinical experts and AI decision support tools in high-fidelity settings -in our case a simulated intensive care unit.This study demonstrates the influence of AI recommendations on clinical behaviour and suggests that the vast majority of unsafe AI recommendations are appropriately rejected by bedside clinicians.The findings on junior physicians occasionally accepting an unsafe AI recommendation and their general willingness to seek senior help when unsure should inform the intended use (i.e.some tools might need to only be used by junior clinicians if they have access to senior advice).Uncertainty awareness, novel forms of AI interpretability and a better understanding of human-human interactions (i.e.team decisions) in the context of AI-driven decision support will help not only with assuring safety from a regulatory perspective but also in fostering confidence and approval from physician end-users.METHODS Objective -We conducted an observational human-AI interaction study in a high-fidelity simulation facility.Our primary objective was to measure whether participants were able to detect, and correctly reject, unsafe recommendations from an AI tool and/or ask for senior help when appropriate.Secondary study objectives included: (i) quantifying the shift in fluid and vasopressor doses induced by seeing an AI recommendation, and (ii) determining whether or not gaze patterns varied differentially depending on the safety status of the AI recommendation.Experimental design -Participants (clinicians) were briefed on the experiment and completed a pre-experiment questionnaire recording their demographics and prior experience with AI (see Supplementary Appendix A for the full content of the briefing and questionnaire).Participants were told that they would conduct a review of several adult patients with sepsis within a simulation suite (Imperial College Simulation Centre) and that they would need to prescribe appropriate doses of fluid and vasopressor for each patient both before and after getting advice from an AI tool.Critically, physicians were told that the AI recommendation engine had been successfully validated in multiple retrospective settings but had not been prospectively evaluated.The simulation layout is shown in Figure 1a.Each physician completed a total of six different patient scenarios, simulating a virtual "ward round".Each scenario started with physicians entering the simulation suite and conducting their assessment of the patient as they saw fit.Data sources within the room included a standard paper ICU bedside data chart with observations and blood results, an ICU handover note including details of the patient's presentation and medical history, a vital signs monitor and a physical patient mannequin (Simman 3G, Laerdal Medical, Stavanger, Norway) which could be examined (see Supplementary Appendix B for the details of each patient scenario).All patient scenarios were crafted by clinical experts from the authors team and to fit the needs of this simulation experiment, they do not come from real patients.A member of the research team played the role of the bedside ICU nurse who could only give standardised responses to any questions.Following their assessment of the patient, physicians were asked to recommend a dose rate for fluids (ml/hr) and vasopressors (noradrenaline, mcg/kg/min) for the coming hour (to match the format of AI recommendations).Physicians rated their confidence on a 1-10 scale and whether or not they would like support for their decision from a senior physician (or a second opinion if the physician was already senior themselves).They were then shown the AI recommendation, asked to what extent they agreed with the recommendation on a 5-point Likert scale (from completely disagree to completely agree), and then the initial dosing-related questions again (what dose they would prescribe, their confidence level, optional ask for senior help -see figure 1b).Finally, physicians were asked whether or not they would stop the AI recommendation if it were to be automatically administered to the patient.This question was intended to nuance the agreement prompt and identify situations where a participant might disagree with an AI recommendation but not necessarily consider it a threat to patient safety.Participants were clearly introduced to the nuance between these two questions in the pre-experiment briefing.The running of a single patient scenario from entry into the simulation suite to exit constituted one trial.Trials were categorised by the nature of the AI tool's recommendation provided to the participant: safe, unsafe or "challenged" unsafe.In the latter, after the physician reported whether or not they would stop the AI recommendation if it were to be automatically administered, the bedside nurse was permitted three attempts (all following a standardised script) to verbally try to convince them to change their mind (see Supplementary Appendix C for the scripts).Each physician experienced four safe trials, one unsafe and one "challenged" unsafe in a pseudo-randomised order (see the trial matrix in Supplementary Appendix D).The first trial encountered by every physician was always in the safe condition to establish a .baseline level of trust with the AI tool and let the physician familiarise themselves with the environment.The details of each patient scenario are presented in Supplementary Appendix B. All AI recommendations were synthetically generated by the research team for the purpose of ensuring a standardised experimental format (i.e. they were not from a real AI system).The definition of unsafe recommendations was based on extreme under-or over-dosing of fluid and/or vasopressor as per previous work. 13All participating physicians were fully debriefed at the conclusion of the study on the synthetic nature of the AI recommendations so as not to bias their opinions of future interactions with AI-driven systems.During each trial, all physician responses were recorded by a member of the research team sitting in a dead angle in the simulation suite.This data, along with questionnaire answers was reformatted and analysed in Python (code available online here: https://figshare.com/s/78c5ff5c6031f701c0d1)Eye-tracking for gaze recording In this study, gaze was employed as an indicator of physicians' attentional focus during simulations, with particular interest in whether this varied according to the safety of the AI recommendation.Pupil and first-person videos were recorded with non-invasive commercially available eye-tracking glasses (Pupil Core headset).The Pupil Labs software (Core, version 3.3) utilised both eye cameras to delineate the pupil and estimate the direction of gaze within the recorded field of view.Prior to the experiment, a two-part 2D calibration procedure was conducted.The initial stage involved a static calibration using five screen markers on a laptop display (default Pupil Labs 'screen marker' calibration).Subsequently, a depth-based static exercise was performed, requiring participants to focus on nine screen markers sequentially ('natural features' mode) displayed on a 60-inch TV screen, initially at 1 metre and then at 2 2-metre distance.A laptop (Lenovo Thinkpad) was connected to the eye-tracking glasses for the entire experiment.To allow for unrestricted movement in the suite, the glasses were connected via USB to a batterypowered laptop (Lenovo Thinkpad) worn by physicians in a lightweight backpack.Because of variability in facial morphologies, 19/38 physicians passed the calibration exercises and had their gaze-based attention data collected.Physicians were instructed to point to where they were reading on the handover note at the start of each scenario as a final level of validation that the eye tracking was appropriately calibrated.We defined four key regions of interest (ROIs) (Figure 1c): the paper ICU data chart, the vital signs monitor, the patient mannequin (Laerdal Simman 3G) and the AI display screen.Four further sub-regions were identified within the AI screen ROI corresponding to four types of explanation for the AI recommendation.April tags (simple QR codes) within the simulation suite (see Figure 1c) were used to identify ROIs in post-processing.As is common practice in eye-tracking literature 37,38 , we used the number of gaze fixations per ROI-a fixation being the predominant eye movement occurring when the foveal region of the visual field is held stationary-as a proxy for participant attention.Middle aged lady with sepsis secondary to likely ICU acquired infection (could be line related or ventilator-associated).Has been in ICU for over a week so likely to be fluid replete.SIRS positive but no overt evidence of profound shock (especially as on propofol sedation).Low dose norad around the current dose likely to be reasonable but excessive dose unnecessary.Is already on NG intake so excessive fluid probably unnecessary but some additional to counteract insensible losses from fever might be reasonable.High dose norad unnecessary and likely dangerous.
. Patient 4 handover note for participants: • 63M admitted 8 hrs ago from theatres post laparotomy for perforated colon 2ry to diverticular disease.• PMH: Diverticular disease, T2DM (diet controlled, HbA1C 45), HTN ( Middle aged man with septic shock secondary to abdominal sepsis after perforated viscus.Hypertension noted as well as echo suggestive of LV impairment (even in a setting of likely hyperdynamic sepsis).Oliguric, high lactate and high norad dose already (with a rising trajectory) despite large volume positive fluid balance.Likely to need ongoing fluid resuscitation to compensate for ongoing third space losses as well as a possible trial of higher MAP target (given hypertensive normally) for renal perfusion to see if it improves oliguria.Complete cessation of vasopressor would be dangerous.

. 1 :
Figure 1: Experimental design -A.Photo of the simulation suite with: (1) Bedside monitor (2) AI screen (3) Participant (4) Bedside nurse (5) Patient mannequin (6) Intensive care unit (ICU) bedside information chart.B. Experimental protocol diagram.C. Gaze-based attention extraction pipeline: eye-tracking glasses, pupil camera view, a recorded field of view with April tags (QR codes) and reconstructed data with fixation heatmaps on the different regions of interest (ROIs).

Figure 2 :
Figure 2: Recruited cohort demographics -A.Distribution of intensive care experience.B. Gender distribution.C. Proportion of physicians who had ever been involved in a research project involving AI.This cohort covers the whole range of experience levels, is in line with national gender ratios and contains both people who have and have not worked with AI.

Figure 3 :
Figure 3: Impact of AI recommendation safety status on clinician decisions, and gaze fixations on each ROI.-A.Bar chart of the proportion of stopped safe recommendations, stopped unsafe recommendations and stopped or escalated unsafe recommendations.B. Proportions of requests for senior help before seeing any recommendation and after having seen a safe or an unsafe one.C. Number of gaze fixations on each ROI before revealing the

Figure 4 :
Figure 4: Clinical practice variability -Distribution of initial (i.e.pre AI reveal) vasopressor (left) and fluid (right) prescriptions by physicians for each patient scenario.

Figure 5 :
Figure5: Shift analysis -Vasopressors dose distribution before and after having seen a safe or unsafe AI recommendation for two patient scenarios across all physicians (left) and only those who did not ask for senior/second review (right).Unsafe AI recommendations do not always influence the final decision.Physicians most influenced by unsafe AI recommendations tend to ask for a second opinion, while those who do not ask for help are less influenced by unsafe recommendations.