Abstract
Background Randomized controlled trials (RCTs) are essential for determining the safety and efficacy of healthcare interventions. However, both laypeople and clinicians often demonstrate experiment aversion: preferring to implement either of two interventions for everyone rather than comparing them to determine which is best. We studied whether clinician and layperson views of pragmatic RCTs for Covid-19 or other interventions became more positive early in the pandemic, which increased both the urgency and public discussion of RCTs.
Methods We conducted several survey studies with laypeople (total n=2,909) and two with clinicians (n=895; n=1,254) in 2020 and 2021. Participants read vignettes in which a hypothetical decision-maker who sought to improve health could choose to implement intervention A for all, implement intervention B for all, or experimentally compare A and B and implement the superior intervention. Participants rated and ranked the appropriateness of each decision.
Results Compared to our pre-pandemic results, we found no decrease in laypeople’s aversion to non-Covid-19 experiments involving catheterization checklists and hypertension drugs. Nor were either laypeople or clinicians less averse to Covid-19 RCTs (concerning corticosteroid drugs, vaccines, intubation checklists, proning, school reopening, and mask protocols), on average. Across all vignettes and samples, levels of experiment aversion ranged from 28% to 57%, while levels of experiment appreciation (in which the RCT is rated higher than the participant’s highest-rated intervention) ranged from only 6% to 35%.
Conclusions Advancing evidence-based medicine through pragmatic RCTs will require anticipating and addressing experiment aversion among both patients and healthcare professionals.
Introduction
Randomized controlled trials (RCTs) are crucial for understanding how to safely, effectively, and equitably prevent and treat disease and deliver healthcare. They have repeatedly upended conventional clinical wisdom and the results of observational studies,1 and are urgently needed to evaluate new technologies.2 However, RCTs often prove controversial, even when they compare interventions that are within the standard of care or otherwise unobjectionable, and about which the relevant expert community is in equipoise.3 Prestigious medical journals have recently published several trials—including SUPPORT4, FIRST5, and iCOMPARE6—that have received considerable criticism from physician-scientists, ethicists, and regulators in those journals7,8 and the public square.9–12 Although criticisms of RCTs can be complex and nuanced, many reflect a rejection of the very idea that an experiment was conducted, as opposed to simply giving everyone the allegedly superior intervention.
In prior studies—inspired by several “notorious RCTs,” including technology industry “A/B tests”13–15—we confirmed that substantial shares of both laypeople and clinicians can be averse to randomized evaluation of efforts to improve health. People rated a pragmatic RCT designed to compare the effectiveness of two interventions significantly lower than the average rating of implementing either one, untested, for everyone, a phenomenon we call the “A/B effect.”16 In some cases, the lower average rating of an experiment could be driven not by dislike of experiments, per se, but by the fact that many people believe one of its arms is inferior to the other,16,17 a belief that is often not evidence-based. We therefore also documented “experiment aversion”: rating an RCT comparing two interventions as worse than even one’s own least-preferred intervention.17 Both patterns of negative sentiments about experiments—including experiments judged to compare two unobjectionable interventions—can impede efforts to identify what does and does not work to improve health outcomes.
The Covid-19 pandemic presented a potential inflection point in attitudes towards health experimentation. In April 2020, 72 Covid-19 drug trials were already underway18 and RCTs became daily, front-page news. That sustained exposure might have educated people about RCTs, or made RCTs more normative. Separately, our previous research suggests that one cause of experiment aversion is an illusion of knowledge—a (mis)perception that experts already must know what works best, and should simply implement that. But Covid-19 was a novel disease, and—at least in the case of pharmaceutical interventions—no sensible person thought the correct treatments were already obvious. People therefore may be less averse to Covid-19 RCTs than to RCTs that test interventions against longstanding conditions or problems. On the other hand, because of the urgency attached to Covid-19, people may be more averse to Covid-19 RCTs, being even less inclined to risk giving someone a treatment that might turn out to “lose” in a comparison study.19,20 Finally, even if the pandemic did not affect public attitudes towards RCTs, it could have affected the attitudes of clinicians, many of whom were involved in Covid-19 research. Because clinicians strongly influence whether particular RCTs are conducted, their attitudes matter.
We investigated attitudes towards experimentation in the first year of the pandemic by conducting a series of preregistered studies between August 2020 and February 2021. First, we used decision-making vignettes from our previous work to ask whether the extraordinary publicity around Covid-19 RCTs reduced general healthcare experiment aversion by the public. Next, we adapted these vignettes to determine whether the public was averse to experimentation on pharmaceutical and/or non-pharmaceutical interventions (NPIs) for Covid-19. Finally, we recruited two large clinician samples to investigate how their attitudes compared to those of laypeople. All three studies were randomized survey experiments in which participants first read about a decision-maker faced with a problem who either implemented one of two interventions (A or B) or ran an experiment to compare them (and then implemented the superior one). Participants then evaluated how appropriate each of those three decisions was.
Methods
Lay Sentiments About Healthcare Experimentation
In August 2020, we used the CloudResearch service to recruit 700 crowd workers on Amazon Mechanical Turk to participate in a brief online survey. These services provide samples that are broadly representative of the U.S. population and are well-accepted in social science research as providing as good or better-quality data than convenience samples such as student volunteers, with results that are similar to probability sampling methods.21,22
Each participant first read a vignette that described a problem that the decision-maker could address in three ways (see Table 1 for examples; see the Supplemental Appendix [SA] for text and motivations for all vignettes): by implementing intervention A for all patients (A); by implementing intervention B for all patients (B); or by conducting an experiment in which patients are randomly assigned to A or B and the superior intervention is then implemented for all (A/B). (Our vignettes are silent about whether consent will be obtained, but IRBs customarily waive consent when it would make low-risk pragmatic RCTs impracticable23; in separate work, we found that substantial shares of people object to such experiments even when we specify that consent will be obtained.24) Next, following standard methods in social and moral psychology for evaluating decisions,25 participants rated each option on a scale of appropriateness from 1 (“very inappropriate”) to 5 (“very appropriate”), with 3 as a neutral midpoint. Participants then rank-ordered the options from best to worst. We also collected demographic information, but found no substantial associations with it in any of our studies (Tables S8-11).
Participants were randomly assigned to one of two vignettes. In Best Anti-Hypertensive Drug, some doctors in a walk-in clinic prescribe “Drug A” while others prescribe “Drug B” (both of which are affordable, tolerable, and FDA approved) and Dr. Jones prescribes either A or B for all his hypertensive patients, or runs a randomized experiment to compare their effectiveness. In Catheterization Safety Checklist, a hospital director similarly considers two locations where he might display a safety checklist for clinicians (see Table 1A).
We define the “A/B Effect” as the degree to which participants’ ratings of the A/B test were lower than the average of their ratings of implementing A and B.16 “Experiment aversion” is the degree to which participants rated the A/B test lower than their own lowest-rated intervention (either A or B for each person). “Experiment appreciation” is the opposite: the degree to which the experiment is rated higher than each participant’s highest-rated intervention. (See the SA for full details on power analyses and sample sizes, statistical analyses, materials, preregistrations, and data availability.)
Lay Sentiments About Covid-19 Healthcare Experimentation
Between August 2020 and January 2021, we recruited 2,209 additional laypeople in the same manner described above. They read, rated, and ranked six new vignettes involving Covid-19 interventions (N = 339–450 per vignette). Four vignettes were based on Covid-19-related interventions that were discussed, tested, and/or implemented at the time: Masking Rules (which described two masking policies, of varying scope); School Reopening (two school schedules designed to increase social distancing); Best Vaccine (two types of vaccine—mRNA versus inactivated virus); and Ventilator Proning (two protocols for positioning ventilated Covid-19 patients; see Table 1B). The other two vignettes—Intubation Safety Checklist and Best Corticosteroid Drug—were adapted from the first study to apply to Covid-19.
Clinician Sentiments About Covid-19 Healthcare Experimentation
Between November 2020 and February 2021, clinicians (including physicians, physician assistants, and nurse practitioners) in a large health system in the Northeastern U.S. were recruited by email to read, rate, and rank one of four Covid-19-related vignettes (Masking Rules: n = 349; Intubation Safety Checklist: n = 271; Best Corticosteroid Drug: n = 275; Best Vaccine: n = 1254) from the second study.
Results
Lay Sentiments About Healthcare Experimentation
We found substantial negative reactions to A/B testing in both vignettes (Table 2A), replicating our pre-pandemic findings.16,17 In Catheterization Safety Checklist (Figure 1A), we found evidence of the A/B Effect: participants rated the A/B test significantly below the average ratings they gave to implementing interventions A and B (d = 0.69, 95% CI: (0.53, 0.85)). Here, 41% ± 5% (95% CI) of participants expressed experiment aversion (rating the A/B test lower than their own lowest-rated intervention; d = 0.25, 95% CI: (0.11, 0.39)). When ranking the three options from best to worst, only 32% placed the A/B test first, while 48% placed it last.
We also observed an A/B Effect in Best Anti-Hypertensive Drug (Figure 1B); d = 0.52, 95% CI: (0.36, 0.68)), where 44% ± 5% also expressed experiment aversion (d = 0.46, 95% CI: (0.30, 0.52)). Notably, participants were averse to this experiment even though there is no reason to prefer “Drug A” to “Drug B,” and patients are effectively already randomized to A or B based on which clinician happens to see them—which occurs wherever unwarranted variation in practice determines treatments, such as walk-in clinics and emergency departments. Here, however, similar proportions of people ranked the A/B test best and worst (50% vs. 45%; p = 0.16).
These levels of experiment aversion near the height of the pandemic were slightly (but not significantly) higher than those we observed among similar laypeople in 2019 (41% ± 5% in 2020 vs. 37% ± 6% in 2019 for Catheterization Safety Checklist, p = 0.31 ; 44% ± 5% in 2020 vs. 40% ± 6% in 2019 for Best Anti-Hypertensive Drug, p = 0.32).17
Lay Sentiments About Covid-19 Specific Healthcare Experimentation
In all six Covid-19 vignettes, we found evidence of the A/B Effect (Table 2B). In three, however, we did not find experiment aversion: Best Vaccine, Best Corticosteroid Drug, and School Reopening. In the first two of these, participants rated the two interventions very similarly and the experiment only slightly lower (Figure 2B). These vignettes also elicited the largest proportion of participants (65% in Best Vaccine and 56% in Best Corticosteroid Drug) in any vignette who ranked the A/B test best among the three options, compared to 31–34% of participants who ranked it worst. In School Reopening, experiment aversion was not observed because participants on average clearly preferred intervention B to A and rated the experiment similar to intervention A.26,27 53% of participants ranked intervention B as the best of the three options (compared to 17% choosing intervention A and 30% choosing the A/B test).
In the other three vignettes, participants rated the A/B test condition as significantly less appropriate than their lowest-rated intervention (Masking Rules: d = 0.56, 95% CI: (0.41, 0.71); Ventilator Proning: d = 0.17, 95% CI: (0.04, 0.30); Intubation Safety Checklist: d = 0.36, 95% CI: (0.21, 0.49)). These levels of aversion to Covid-19 RCTs are similar to the levels of aversion to non-Covid-19 RCTs both before17 and during the pandemic (see above).
Clinician Sentiments About Covid-19 Specific Healthcare Experimentation
We observed an A/B effect in all four vignettes. In two, clinicians, like laypeople, were also significantly experiment averse (Masking Rules: d = 0.74, 95% CI: (0.57, 0.91); Intubation Safety Checklist: d = 0.30, 95% CI: (0.15, 0.45)). In Best Vaccine, clinicians, like laypeople, did not show any significant difference in their ratings of the A/B test and their lowest-rated intervention (d = –0.03, 95% CI: (–0.10, 0.04)). Again, like laypeople, 58% of clinicians ranked the vaccine A/B test as the best of the three options, the highest proportion of any clinician-rated vignette.
Clinicians differed from laypeople in their response to Best Corticosteroid Drug. Laypeople did not show experiment aversion, but clinicians rated the A/B test as significantly less appropriate than their lowest-rated intervention (d = 0.49, 95% CI: (0.32, 0.66)). This difference may be due to clinicians’ greater familiarity with the treatment of Covid-19. Clinicians may also have seen an urgent need for any drugs to treat Covid-1920 and thus rated adopting a clear treatment intervention as more appropriate than an RCT.
Discussion
We found no diminution in general experiment aversion among laypeople during the first year of the Covid-19 pandemic, despite increased exposure to the nature and purpose of RCTs. Neither laypeople nor clinicians were overall less averse to Covid-19 experiments, despite the fact that confidence in anyone’s knowledge of what works should have been even more circumscribed than in the everyday contexts of hypertension and catheter infections. To the contrary, we found an A/B effect (the average rating of the RCT was lower than the average rating of the two policies) in all vignettes and samples. Most Covid-19 vignettes were met with experiment aversion (on average, participants rated the RCT lower than each participant’s lowest-rated intervention). This is consistent with an emphasis during the pandemic that we must “do” instead of “learn,” a false dichotomy that fails to recognize that implementing an untested intervention is itself a nonconsensual experiment from which, unlike an RCT, little or nothing can be learned.28–30 Similarly, across all vignettes and samples, between 28% and 57% of participants demonstrated experiment aversion, while only 6%–35% demonstrated experiment appreciation (by rating the RCT higher than their highest-rated intervention).
In none of our 12 studies were more people appreciative of than averse to the RCT, in none was the average RCT rating higher than the average intervention rating, and in none was the RCT rating higher than each participant’s highest-rated intervention, on average. Notably, unlike trials with placebo or no-contact controls, the A/B tests in our vignettes compared two active, plausible interventions, neither of which was obviously known ex ante to be superior. Yet substantial shares of participants still preferred that one intervention simply be implemented without bothering to determine which (if either) worked best.
There is one bright spot in our results: the most positive sentiment towards experiments was observed in both laypeople and clinicians in the vignettes involving Covid-19 drugs and vaccines. Here we observed the highest proportions of participants who demonstrated experiment appreciation (31%–46%) and who ranked the RCT first (49%–65%). This result is consistent with our previous findings that the illusion of knowledge—here, the belief that either the participant herself or some expert already does or should know the right thing to do and should simply do it—biases people to prefer universal intervention implementation to RCTs.16,17 Rightly or wrongly, both laypeople and clinicians might (a) appropriately recognize that near the start of a pandemic, no one knows which existing drugs, if any, are safe and effective in treating a novel disease, and that new vaccines need to be tested, yet (b) fail to sufficiently appreciate the level of uncertainty around NPIs like masking, proning, and social distancing, which can also benefit from rigorous evaluation. This is consistent with the dearth of RCTs of Covid-19 NPIs:31 of the more than 4,000 Covid-19 trials registered worldwide as of August 2021, only 41 tested NPIs.32 Explaining critical concepts like clinical equipoise or unwarranted variation in medical and NPI practice alike might diminish experiment aversion.
Critics note that RCTs have limited external validity when they employ overly selective inclusion/exclusion criteria or are executed in ways that deviate from how interventions would be operationalized in diverse, real-world settings. However, the solution is not to abandon randomized evaluation, but to incorporate it into routine clinical care and healthcare delivery via pragmatic RCTs.1,33 It has been many years since the Institute of Medicine urged research of many varieties to be embedded in care.34 More recently, the FDA established a Real-World Evidence Program that promotes pragmatic RCTs to support post-marketing monitoring and other regulatory decision-making.35,36 Pragmatic RCTs have been fielded successfully and informed healthcare practice and policy,37–39 but they remain far from ubiquitous and they require buy-in to be successful, as shown by the case of a Denmark school reopening trial that was abandoned due to lack of such support.40 Wider use of pragmatic RCTs will require not only redoubling investment in interoperable electronic health records and recalibrating regulators’ views of the comparative risks of research versus idiosyncratic practice variation,1 but also anticipating and addressing experiment aversion among patients and healthcare professionals.
Data Availability
Participant response data, preregistrations, materials, and analysis code have been deposited in Open Science Framework and will be released upon final publication of this paper.
Data availability
Participant response data, preregistrations, materials, and analysis code have been deposited in Open Science Framework and will be released upon final publication of this paper.
Methods
In the main text, we grouped the vignettes thematically into three sets: “Lay Sentiments About Healthcare Experimentation,” “Lay Sentiments About Covid-19 Specific Healthcare Experimentation,” and “Clinician Sentiments About Covid-19 Specific Healthcare Experimentation.” However, when we collected data, we grouped our vignettes differently such that we started with vignettes that we have used in previous published work and their respective Covid-19 derivatives, then we developed and tested novel Covid-19 specific vignettes separately, and then, again separately, we tested a Covid-19 vaccine vignette. We followed a similar pattern in our clinician sample: we first tested three Covid-19 specific vignettes (two which were derivatives of vignettes from our previous work, one which was new to this work) and then separately, we tested a Covid-19 vaccine vignette. These groupings are important for understanding how participants were randomly assigned to vignettes and why there are slight discrepancies (or large discrepancies in the case of the Best Vaccine vignette in the clinician sample1) in the number of participants in each vignette (see Table S1).
For clarity, in the main text of this article we used different names for the vignettes than those used in the preregistrations and in previous publications (see Table S2).
Preregistrations, sample sizes, and power analyses
Our research questions, power analyses and sample sizes, and analysis plans were all preregistered at Open Science Framework (OSF) before data collection. These sample size precommitments are copied from each preregistration document which will be released upon final publication of this paper.
Preregistration 1 (Catheterization Safety Checklist, Best Anti-Hypertensive Drug, Intubation Safety Checklist, Best Corticosteroid Drug vignettes):
“We predict that, using a two-tailed, paired t-test with ⍺ = .05 within each scenario, participants will rate the A/B test condition as significantly less appropriate than their own average rating of the two policy conditions, mean(A,B). This is the test for the “A/B Effect.” Recruiting 350 participants for each scenario provides 95% power to detect an effect as small as d = 0.19, which is substantially smaller than the effect sizes we have observed using the Hospital Safety Checklist and Best Drug: Walk-In Clinic vignettes in past research.”
Preregistration 2 (Ventilator Proning, School Reopening, Masking Rules, and Best Vaccine (initial ambiguous version) vignettes):
“We predict that, using a two-tailed, paired t-test with ⍺ = .05 within each scenario, participants will rate the A/B test condition as significantly less appropriate than their own average rating of the two policy conditions, mean(A,B). This is the test for the “A/B Effect.” Recruiting 350 participants for each scenario provides 95% power to detect an effect as small as d = 0.19, which is substantially smaller than the effect sizes we have observed using the Hospital Safety Checklist and Best Drug: Walk-In Clinic vignettes in past research.”
Preregistration 3 (Clinicians; Intubation Safety Checklist, Best Corticosteroid Drug, and Masking Rules vignettes):
Note that because of time constraints around the possible starting dates of our clinician surveys, we launched this study before preregistering it, and we did not report an explicit power analysis before collecting the data. Because this study follows a similar structure to the studies above, however, it was reasonable to apply the previous sample size and power analysis considerations. We did, however, preregister our approach and research plan twice during this study: once during data collection, before any analyses had been conducted, and again after all data had been collected (but before analyzing any of them).
Preregistration 3.1: “At the time of this preregistration, we have received 655 complete responses. No data have been explored or analyzed at this point. We will conduct an interim analysis on this dataset using the same analyses we have previously preregistered, and we may continue to collect more data from this population.”
Preregistration 3.2: “Data collection is now complete and we have closed the survey. On 11/24/2020, we conducted an interim analysis on 601 complete responses. Since then, we have received an additional 295 complete responses, to which we remain blind.”
Preregistration 4 (Best Vaccine):
“We recruited 350 participants for the original Covid-19 vaccines study. Because we are running this study to determine whether even a small effect emerges, we will increase the sample size to 450 participants. This provides 80% power to detect an effect as small as d = 0.13 in a repeated-measures, two-tailed t-test, and 95% power to detect an effect as small as d = 0.17.”
Preregistration 5 (Clinicians; Best Vaccine):
“Our previous survey of healthcare providers resulted in approximately 900 complete responses; we expect a similar response rate for this survey. This sample size provides 95% power to detect an effect as small as d = 0.12 using a two-tailed, repeated measures t-test. Even if we only receive 600 complete responses, we will have 95% power to detect an effect as small as d = 0.15.”
Procedure and design
Several aspects of the procedure and experimental design were consistent across the studies reported here. Below, we describe these consistent features and note in specific studies where we deviated from them.
For the lay participant samples, we used the CloudResearch service to recruit crowd workers on Amazon Mechanical Turk (MTurk) to participate in a 3–5-minute survey experiment.
Participants were excluded from recruitment in any of the studies reported here if they had participated in any of our previous studies on this topic. Across all laypeople vignettes, the completion rate of participants starting the survey was 91.5%. The Geisinger IRB determined that these anonymous surveys were exempt (IRB# 2017-0449).
For the clinician samples, we recruited healthcare providers from a large health system in the Northeastern U.S via email. Each provider received either one or two emails about the study during the recruitment window. In the first clinician study (Intubation Safety Checklist, Best Corticosteroid Drug, and Masking Rules vignettes), we first tested the email recruitment system by sending out the survey invitation email to just 200 clinicians. Clinicians who completed the survey based on this survey invitation were included in the final sample. Then, all clinicians were sent the recruitment email on November 19, 2020, followed by a reminder email on December 3, 2020. In the second clinician study (Best Vaccine), the initial recruitment email was sent January 25, 2021, with the follow-up email sent February 2, 2021. In the first clinician study, 5,925 clinicians were emailed and 895 completed the survey. In the second clinician study, 6,993 clinicians were emailed and 1,254 completed the survey. In these samples, because survey responses were fully anonymous, we were not able to restrict participation based on our previous studies, so some participants who completed the Best Vaccine vignette may have earlier completed the Intubation Safety Checklist, Best Corticosteroid Drug, and Masking Rules vignettes.
In all cases, participants completed an online survey hosted by Qualtrics. After opening the survey, participants were randomly assigned to one of the possible vignettes being studied.2,3 In the case of data collection batches 4 and 5, there was only one vignette being tested that all participants saw. At this point, we used the exact same procedure detailed in Heck et al. (2020)1. First, participants were instructed to read about several possible decisions made by different decision-makers4, and to try to treat each decision as separate from the others. All scenarios contained a brief “background” text at the top of the page that summarized a problem, followed by three “situations,” each of which detailed the decision-maker’s choice to adopt intervention A, intervention B, or to run an A/B test by randomly assigning people to one of two test conditions. These conditions were presented in fully counterbalanced order; each participant received one of six possible orders (i.e., Situation 1 = A, Situation 2 = B, and Situation 3 = A/B; Situation 1 = A/B, Situation 2 = B, and Situation 3 = A; etc.…). At no point did we observe a meaningful effect of presentation order, so we collapsed across this variable for all analyses.
For our primary outcome measures, participants were asked to rate the appropriateness of the decisions made in Situation 1, Situation 2, and Situation 3 (“How appropriate is the director’s decision in Situation 1/2/3?”), using a 1-5 scale (1 = “Very inappropriate”, 2 = “Inappropriate”, 3 = “Neither inappropriate nor appropriate”, 4 =”Appropriate”, 5 = “Very appropriate”). Participants then specified a ranked order of the three decisions (“Among these three decisions, which decision do you think the director should make? Please drag and drop the options below into your preferred order from best to worst. You must click on at least one option before you can proceed.”), with 1 being the best decision and 3 being the worst. The last item on this page asked participants to explain why they chose these ratings and rankings in a couple of sentences (“In a couple of sentences, please tell us why you chose the ratings and rankings you chose.”).
Following these primary measures, participants completed standard demographic items on the next page. For MTurk participants, these were measures of sex, race/ethnicity, age, educational attainment, household income, religious belief or affiliation, whether they have a degree in a STEM field or not, and four items identifying political orientation and affiliation. As part of an ongoing study in our laboratory (whose results will be reported elsewhere), these participants were randomized to one of six conditions for this demographic questionnaire where we varied the option to select “prefer not to answer” and whether the items were mandatory, optional, or requested (but not required). For clinician participants, demographic items were mandatory response and were limited to the following: sex, sources of training in research methods and statistics, self-reported comfort with research methods and statistics, past experience with activities related to research methods and statistics (e.g., publishing a scientific paper or analyzing data), current involvement in research, position (e.g., doctor, physician assistant, nurse, medical student, etc.), length of time working in the medical field, and field of specialty.
After completing the survey, MTurk participants were given a completion code to receive payment ($0.40). Clinician participants were invited to enter into a lottery to win a $50 Amazon gift card by following a link to an independent survey where they could enter their email address. All participants were thanked for their participation and offered the opportunity to comment on the survey.
Measures
We computed several variables to measure participants’ sentiments about experimentation.
Following Meyer et al. (2019)1, we define an “A/B effect” as the difference between participants’ mean policy rating and their rating of the A/B test—that is, the degree to which the policies are (on average) rated higher than the A/B test. We also report the percentage of participants whose mean policy rating is higher than their rating of the A/B test.
Following Heck et al. (20202; see also Mislavsky et al., 20193), we define “experiment aversion” as the difference between participants’ rating of their own lowest-rated policy and their rating of the A/B test. We also report the percentage of participants who express experiment aversion.
“Experiment rejection” (first reported in Heck et al., 20202, but without this name) occurs when a participant rates the A/B test as inappropriate (1 or 2 on the 5-point scale) while also rating each policy as neutral or appropriate (3–5 on the scale).
A “reverse A/B effect” is the difference between participants’ rating of the A/B test and their mean policy rating—that is, the degree to which the A/B test is rated higher than the policies (on average). We also report the percentage of participants whose rating of the A/B test is higher than their mean policy rating.
“Experiment appreciation” is the difference between participants’ rating of the A/B test and their rating of their own highest-rated policy. We also report the percentage of participants who express experiment appreciation.
“Experiment endorsement” occurs when a participant rates the A/B as appropriate (4 or 5 on the 5-point scale) while also rating each intervention as neutral or inappropriate (1–3 on the scale).
In all cases where a d-value was calculated (i.e., A/B effect, experiment aversion, reverse A/B effect, experiment appreciation), we used Cohen’s d recovered from the t-statistic, n, and correlation between the two measures being compared (Dunlop et al., 1996, equation 34: d = tc[2(1-r)/n]½; see also http://jakewestfall.org/blog/index.php/category/effect-size/kewestfall.org5. To calculate this d-value, we use the following R code: effsize::cohen.d(x,y, paired = TRUE).
Vignettes
Our vignettes were inspired by discussions about the ethics of real-world RCTs (see Table S3).
Results
Sample demographics
Lay participants
Across all vignettes reported in the main text (i.e., excluding the initial ambiguous version of the Best Vaccine vignette), there were a total of 2,910 lay participants. They ranged in age from 18 to 88 years old (mean = 38.4, SD = 12.8) and the majority were White (74.6%) and female (55.9%). 35.7% had a 4-year college degree, 29.7% had some college, and 20.5% had a graduate degree. 21.3% of participants had a degree in a STEM field. The most frequently selected income level was between $20,000 and $40,000 (20.7%). A majority of participants reported being moderate, leaning liberal, or being liberal both generally and specifically with regards to social and economic issues. Similarly, a majority of participants reported being independent, leaning Democrat, or being Democrat in their political party affiliations. 37.7% of participants reported being non-religious. Of those who reported being religious, the most reported religion was Protestant (24.2%). See Table S4 for demographic breakdowns by vignette and in the combined lay participant sample.
Clinicians
There were 2,149 clinician responses across all vignettes. In the clinician samples, survey responses were anonymous, so we could not restrict participation based on our previous studies so some participants who completed the Intubation Safety Checklist, Best Corticosteroid Drug, and Masking Rules vignettes may have also completed the Best Vaccine vignette. For this reason, demographics are reported separately by vignette in Table S5. Across vignettes, a majority of clinicians were female. Over 50% of participants in the sample were registered nurses, followed by physicians and physician assistants. Over 50% of participants in the sample reported that they had been in the medical field for over 10 years. The clinicians reported that they had received training in research methods and statistics via an average of 1.5 of the sources we listed, and that they engaged in an average of 2.5 research methods and statistics activities. Most clinicians reported being somewhat to moderately comfortable with research methods and statistics.
Results presented in main text
In Table S6A-C, we present the descriptive and inferential results for all vignettes discussed in the main text.
Comparisons to previously published work
To compare these results to our previous findings reporting sentiments about experiments, as we do in the main text, please refer to Heck et al. (2020)2. For example, in the Results section “Lay Sentiments About Healthcare Experimentation,” we say, “these levels of experiment aversion near the height of the pandemic were slightly (but not significantly) higher than those we observed among similar laypeople in 2019 (41% ± 5% in 2020 vs. 37% ± 6% in 2019 for Catheterization Safety Checklist, p = .31 ; 44% ± 5% in 2020 vs. 40% ± 6% in 2019 for Best Anti-Hypertensive Drug, p = .32).” We extracted the percentage of participants who were experiment averse in 2019 from Heck et al. (2020)2. We then performed a two-sample z-test for proportions to compare the 2019 and 2020 proportions. As noted in the main text, we did not find a significant difference between the percentage of people who were experiment averse in 2019 and the percentage of people who were experiment averse in the current studies which took place in 2020 and 2021 (Catheterization Safety Checklist: χ2(1) = 1.034, p = .309, Anti-Hypertensive Drug: χ2(1) = 0.998, p = .318).
Results not presented in the main text
Results of Best Vaccine vignette (initial ambiguous version)
The only vignette which showed no A/B Effect was the initial ambiguous version of Best Vaccine (see Table S6D). The two versions of Best Vaccine both presented a public health official’s decision to either distribute an mRNA-based vaccine to every county in their state, distribute an inactivated-virus vaccine to every county, or run an experiment in which counties are randomized to receive one of the two vaccine types. However, in version 1, the wording unintentionally implied that residents could choose their vaccine (by going elsewhere) if they did not wish to be subject to the official’s decision (including intervention implementation or A/B test), while in version 2 we eliminated this possible interpretation; we suspect this had the effect of making the experiment condition in version 1 less aversive, since people could effectively opt-out of it, and our goal in this research is to study pragmatic, real-world situations in which avoiding randomization is typically not a realistic option.
Order effect in clinician study
For the clinician study of the Catheterization Safety Checklist, Best Anti-Hypertensive Drug, and Masking Rules vignettes, participants were randomly assigned to one of these three vignettes and then completed the remaining two vignettes in random order. For consistency with the rest of this project and with our previous approach (Meyer et al., 2019)1, we analyze data from this study as a between-subjects design where we only consider the first vignette that every participant completed.
While conducting an interim analysis on the data for this study, we observed an intriguing and unexpected order effect of presentation.
For the first 601 complete responses we received, we observed an effect of presentation order on participants’ appropriateness ratings of the A/B test condition within the Best Anti-Hypertensive Drug vignette. Participants who received the Best Anti-Hypertensive Drug vignette first rated the A/B test an average of 2.95 (SD = 1.57), participants who received this vignette second rated the A/B test an average of 3.48 (SD = 1.39), and participants who received this vignette last rated the A/B test an average of 3.78 (SD = 1.41). This suggests that participants who read about other policies and A/B tests before considering the Best Anti-Hypertensive Drug vignette found the A/B test in the Best Anti-Hypertensive Drug vignette to be less objectionable than participants who received this vignette earlier in the survey. The relationship between presentation order (1, 2, or 3) and appropriateness rating of the A/B test was r = .23. This order effect did not emerge for the other two vignettes or for ratings of either intervention (A or B).
After observing this order effect but before examining any additional data, we preregistered this order effect with the goal of replicating it in an independent sample. 294 new participants completed the study after this interim analysis, and we analyzed the data from this sample independently from the sample that generated the order effect. Table S7 displays ratings of the A/B condition within each scenario grouped by the order in which participants received them. The order effect observed with the Best Anti-Hypertensive Drug A/B test condition replicated (r = .15), as did the absence of any similar order effect for the other conditions.
Heterogeneity in experiment aversion
In both the lay participant sample and the clinician sample, associations between demographic variables, including educational attainment, having a degree in a STEM field, years of experience in the medical field, and role in the healthcare system, and sentiment about experimentation (e.g., A/B effect, experiment aversion, experiment appreciation) are consistently small (r < |.13|, therefore explaining less than 2% of the variance; Tables S8–11).
In the lay sample, women show larger AB and experiment aversion effects (e.g., larger difference between mean intervention rating/lowest-rated intervention rating and AB test rating; r = .067–.068, p < .001) and a smaller experiment appreciation effect (e.g., smaller difference between AB test and highest-rated intervention rating; r = –.064, p < .001). Lay participants who are more conservative (in general and with respect to social and economic issues) or more likely to be strong Republicans show lower levels of an AB effect and experiment aversion (i.e., smaller difference between mean intervention rating/lowest-rated intervention rating and AB test rating; all rs < –.094, ps < .0001). These participants also show significantly more experiment appreciation, though the strength of the association is weaker (rs = .037–.046, p < .0001). Finally, we find that people who are non-religious show a larger degree of experiment aversion (r = .061, p < .001; they also show a larger AB effect, r = .051, but p = .007 which is greater than p < .005, the standard proposed in Benjamin et al. (2018)17 for exploratory analyses without a priori hypotheses). For all other variables, we find no significant associations between the individual difference measures and experiment sentiments (all rs < |.051|, all ps > .005).
In the clinician sample, the strongest association was between self-reported comfort with research methods and statistics and experiment aversion—clinicians who report being more comfortable with research methods and statistics are more likely to appreciate the A/B test (r = .070, p = .001).
References
Acknowledgements
Supported by Office of the Director, National Institutes of Health (NIH) (3P30AG034532-13S1) and funded by the Food and Drug Administration (FDA). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the FDA. We thank Daniel Rosica and Tamara Gjorgjieva for excellent research assistance. Vogt and Heck contributed equally to this work. Meyer and Chabris contributed equally to this work.
Footnotes
↵1 The Best Vaccine vignette was combined with another study that required a sample size much larger than the sample sizes in our previous vignette studies to have adequate statistical power.
↵2 For the clinician study of the Intubation Safety Checklist, Best Corticosteroid Drug, and Masking Rules vignettes, clinicians were randomly assigned to one of these three scenarios and then completed the remaining two scenarios in random order. For consistency with the rest of this project and with our previous survey experiment with clinicians regarding the A/B effect (Meyer et al., 2019, Study 6), and in order to make the results from clinician samples comparable to those with lay samples (in which each participant only ever saw one scenario), we analyze data from this study as a between-subjects design where we only consider the first scenario that every participant completed. See the section “Order Effect in Clinician Study” elsewhere in this appendix for further analyses.
↵3 The clinician version of the Best Vaccine vignette was combined with another study being conducted by a subset of researchers on this team. The materials for Best Vaccine were presented after the survey materials from the other study. Data from the other study are unrelated to the research questions tested here and will be reported separately.
↵4 In all vignettes, the protagonist (e.g., the hospital director or Dr. Jones) was male for ease of comparison to our previous work using these vignettes. Future work should examine the impact of the characteristics of the decision-maker on evaluations of their decisions regarding policy imposition and conducting RCTs.