Re-evaluating the evidence for fecal microbiota transplantation “super-donors” in inflammatory bowel disease ================================================================================================================ * Scott W. Olesen * Ylaine Gerardin ## ABSTRACT Fecal microbiota transplantation (FMT) is a recommended treatment for recurrent *Clostridioides difficile* infection, and there is promise that FMT may be effective for conditions like inflammatory bowel disease (IBD). Previous FMT clinical trials have considered the possibility of a “donor effect”, that is, that FMT material from different donors has different clinical efficacies. Here we lay out rigorous statistical methodology for detecting donor effects, finding that reliable detection of a donor effect requires trials with more than 200 FMT-treated patients. A re-evaluation of previous FMT clinical trials for IBD showed that while there is very little evidence for a non-zero donor effect, the existing data are also not inconsistent with substantial donor effects. Large-scale meta-analysis, combined with careful reporting from clinical trials, will be crucial in determining if donor effects are clinically relevant for IBD. ## INTRODUCTION The human microbiome is increasingly understood to play a key role in health and disease.1 Fecal microbiota transplantation (FMT), the infusion of a healthy person’s stool into a patient, is one method for manipulating the gut microbiome. FMT is recommended for treatment of recurrent *Clostridioides difficile* infection.2,3 Although the specific mechanisms by which FMT cures *C. difficile* infection are not well understood, FMT is being investigated as a therapy for dozens of other microbiome-related indications.1 A key challenge in identifying FMT’s specific mechanism, or mechanisms, is the complexity and diversity of human stool, a mixture of bacteria, viruses, fungi, microbe-derived molecules, and host-derived molecules which varies enormously from person to person.4,5 It has therefore been hypothesized that different stools, stool donors, or matches of donors and recipients could have different abilities to treat disease. This concept has been referred to with terms like “donor effect”, “super-donor”, and “super-stool”.6–9 If FMT’s efficacy varied widely across donors, then rational, *a priori* selection of stool donors based on biomarkers that predict their donated FMT material’s efficacy could improve the clinical practice of FMT.10,11 Differences in FMT efficacy between stool donors or specific stool samples could be also an important starting point for scientific investigations into the “active ingredient” in FMT donations.6 It is therefore important to quantify the extent of donor variability in indications where FMT is a promising treatment. There are no reports of a donor effect in the context of *C. difficile* infection, the condition for which the use of FMT is best studied.12 However, multiple studies, mostly for ulcerative colitis, a kind of inflammatory bowel disease, have tested for donor effects.8,9,13–20 In all of these studies, identifying a donor effect was a *post hoc* analysis, and in no case was multiple hypothesis correction, a critical methodology in *post hoc* analyses, employed. As we show herein, in at least two previous reports the specifics of the clinical design require adjustments to the statistical tests used, such that those reports inflated the apparent evidence for a donor effect.13,14 Furthermore, previous reports mostly analyzed donor effects with respect to the null hypotheses that all donors produce stool with equal clinical efficacy, or that particular features of stool, like its bacterial community diversity, are unrelated to clinical efficacy. We propose that, rather than merely asking if donor effects exist or not, we should ask about the size of the donor effect, whether it is clinically relevant, and whether it is large enough to be detected in the studies that are testing for it. It is inconceivable that all stool has precisely the same therapeutic effect. If 2 donors are each used to treat enough patients, the difference between their treatment efficacies would almost certainly become statistically significant. However, without quantifying the donor effect, it is difficult to determine if the effect is clinically relevant or to determine how many patients would be required to power a clinical trial to reliably detect a donor effect. Here we re-evaluate the existing evidence for donor effects and evaluate the implications of that evidence for experimental design and clinical practice. First, we lay out easy-to-implement but statistically rigorous methods for evaluating the donor effect in *post hoc* analyses. Second, we use these methods to estimate the statistical power of FMT clinical trials to detect donor effects. Finally, we apply these statistical methods to the existing literature. ## METHODS ### Distinguishing types of stool superiority Discussions of “super-donors” and “super-stool” have suggested that stool might be superior in at least 4 distinct but conceptually related senses. To avoid confusion in our re-evaluation of the evidence for donor effects, we distinguish between these definitions of stool superiority 1. *Donor superiority*, in which particular donors are associated with better clinical outcomes for the recipient patients. For example, Moayyedi *et al*.13 tested whether a particular donor (“donor B”) was associated with better patient outcomes, compared to the other donors. 2. *Donor characteristic superiority*, in which a donors with some particular characteristic are associated with better outcomes. For example, donor age, diet, and host genetics have been suggested as potential factors in donor superiority.8,16,21 3. *Donation superiority*, in which donations with some particular characteristic are associated with better outcomes. For example, Vermeire *et al*.22 tested whether stools with higher bacterial diversity are associated with better outcomes. 4. *Donor-recipient match superiority*, in which certain combinations of donor and recipient are associated with better outcomes,8,9,20,23,24 analogous to how donors’ and recipients’ blood types are matched for blood transfusions. FMT studies have tested whether stool from “related” donors,25 typically defined as first-degree relatives but also sometimes including spouses or partners, is associated with better patient outcomes. These types of superiority are conceptually related and not mutually exclusive. For example, imagine it were the case that female donors were associated with better patient outcomes than male donors, making a case for “donor characteristic superiority”. However, because donors have a clinical effect on the patient only via their donation, sex of donor cannot be the molecular mechanism by which some donations are more efficacious than others. It would have to be that sex determined or correlated with some component of the stool that made those donations more effective. In other words, there must be an underlying “donation characteristic superiority” that correlates with donor sex. Thus, even if donor-recipient match superiority might be the most accurate model, simpler types of superiority may be parsimonious.7 When pools of stool from multiple donors are used for FMT, it is typical to test for donor superiority (i.e., if the pools that include material from a particular donor are associated with better patient outcomes)15 and pool characteristic superiority (e.g., if pools with higher bacterial α-diversity are associated with better patient outcomes).17 One could also test if particular pools are associated with better patient outcomes. ### Model of donor effects and quantification of donor effect size To quantify the size of the donor effect, we modeled each donor as having a unique treatment efficacy, that is, the proportion of patients that would reach a positive clinical endpoint in a very large clinical trial. The donors’ treatment efficacies are drawn from a distribution. Specifically, the log odds of the donors’ treatment efficacies are normally distributed. We used this model of distribution in donor efficacies because it corresponds to the distribution of a random effect in a logistic regression, which can be fit to data using well-established statistical methods.26 Under this model, the size of the donor effect is related to the “width” of this distribution. If the donor effect is strong, the difference in treatment efficacies between donors tends to be large. We defined the effect size as the median difference in treatment efficacy between two randomly selected donors (see Supplemental Methods). For example, say the effect size is a median difference of 10 percentage points, and the mean treatment efficacy across donors is 50%. Then it would not be unusual if two donors selected at random had efficacies of 45% and 55%. ### Statistical power computations We performed statistical power computations to estimate what donor effect size can be reliably detected in an FMT clinical trial for IBD. At a range of effect sizes, statistical power was estimated as the proportion of 1,000 simulated clinical trials with *p* < 0.05. For each simulated clinical trial 1. Efficacies for donors (2, 3, 6, or 12 donors) were randomly drawn from a distribution of log odds. The mean of the distribution of log odds was selected based on typical FMT efficacy rates (30%).9,13,15,27 The effect size determines the standard deviation. 2. Patients treated with FMT (12, 24, 48, 96, or 192 patients) were distributed evenly among the donors. 3. Patients’ outcomes for patients were simulated based on their donors’ efficacies. 4. A statistical test (Fisher-Freeman-Halton test, see below) was run, and the *p*-value noted. If the simulated power for any effect size reached 80%, we found the effect size corresponding to 80% power by linear interpolation. 10 effect sizes were evaluated, ranging from approximately zero (10−6 standard deviation in log odds) to the difference in log odds between the typical FMT efficacy rate and typical placebo rate for a clinical indication. This design sets the upper limit on donor efficacies such that, at most, those donors 1 standard deviation below the mean (i.e., ∼16% of all donors) have efficacies worse than typical placebo rates (5%). ### Studies re-analyzed We analyzed a set of studies from recent FMT clinical study literature that tested for the hypothesis that one donor or donation characteristic was associated with improved patient outcomes (Results). To simplify the analyses, we considered only a single dichotomous outcome, whether a patient achieved the trial’s primary clinical endpoint. We also restricted microbiome analyses to consider only the bacterial α-diversity of stool, as associations between donor material α-diversity and patient outcomes have been investigated in multiple studies,14,16,22 and the hypothesis that improved bacterial diversity in stool is part of the motivation for using pools of stool from multiple donors in FMT.15,23,27 ### Statistical methodology Statistics were computed using the R programming language (version 3.6.0).28 To test the hypothesis that there is any difference in efficacy between donors (or pools of donors), we used the Fisher-Freeman-Halton test (function *fisher*.*test*) on 2 × *D* contingency tables, where *D* is the number of donors (or pools). In cases when the Fisher-Freeman-Halton test is computationally infeasible, we use the *χ*2 test (function *chisq*.*test*). For 2 × 2 contingency tables (e.g., to test if one donor has a different treatment efficacy than all the others), we used Fisher’s exact test (function *exact2×2*)29 and the mid-*p* value to prevent the tests from being overly conservative.30 We used logistic regressions (function *glm*) to test for donation/pool characteristic superiority. We used random effects logistic regressions (function *glmer* from the *lme4* package)26 to estimate the strength of the donor effect. ### Microbiome analysis We re-analyzed raw 16S rRNA sequencing data from three studies (Kump *et al*., study accession PRJEB11841; Jacob *et al*., accession PRJNA388210; Goyal *et al*., accession PRJNA380944) using QIIME 231 (version 2019.7) and Deblur.32 Paired reads were joined (*vsearch* plugin, default parameters), quality filtered (*quality-filter* plugin, default parameters), and denoised using Deblur (trim length 253 nucleotides, 1 minimum reads, otherwise default parameters). To follow Kump *et al*.’s general analysis method, each sample in the resulting table of amplicon sequence variants was downsampled to the minimum number of ASV counts across samples. Bacterial α-diversity was computed as richness, the number of unique ASVs present in each sample. For Goyal *et al*., the sequencing data was single-ended, so reads were not paired for that study. When a donor had multiple samples, we associated a single α-diversity with each donor by computing the average number of ASVs over that donor’s samples. ### Code and data availability Computer code and underlying data to reproduce the results are online at [https://www.github.com/openbiome/donor-effects](https://www.github.com/openbiome/donor-effects). ## RESULTS ### Statistical power of FMT clinical trials for IBD to detect a donor effect Using simulations, we estimated the minimal strength of the donor effect required for an FMT clinical trial for IBD to reach 80% statistical power in detecting a donor effect with a Fisher-Freeman-Halton test (Figure 1). Trials’ ability to reliably detect a donor effect increased with the strength of the donor effect, the number of patients, and the number of donors in the trial. More patients and donors allowed for reliable detection of weaker donor effects. ![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2019/11/12/19011635/F1.medium.gif) [Figure 1:](http://medrxiv.org/content/early/2019/11/12/19011635/F1) Figure 1: Statistical power calculations showing detectable effect sizes for varying trial designs. In simulated clinical trials, a number of patients (vertical axis) are evenly allotted to a number of donors (horizontal axis). Each tile’s color and number show the minimum effect size (median difference in efficacy, measured in percentage points, between two randomly selected donors) required for the trial design to reach 80% statistical power. Gray tiles show that 80% statistical power cannot be achieved for plausible donor variations (i.e., for which most donors perform at least as well as placebo). p.p.: percentage points. For trials with 24 patients and any number of donors, the strength of the donor effect required for reliable detection of that effect was implausibly large: it would require that a substantial fraction of donors caused their patients to perform worse than placebo (see Methods). Similarly, studies with 2 donors and 192 or fewer patients were also unable to detect plausible donor effects. Trials with 96 patients were able to reliably detect a donor effect when the median difference in treatment efficacy between randomly selected donors was 20 percentage points or more. In other words, the donor effect must be strong enough that, in a typical pair of donors, one might have a clinical efficacy of 20% and the other 40%. Even with 384 patients in a trial, the donor effect would need to be large enough that the median difference in clinical efficacy between randomly selected donors was more than 10 percentage points. ### Re-evaluations of the evidence for a donor effect in IBD The results of the statistical power computations above suggest that previous studies of FMT for IBD are unlikely to have detected a donor effect. We therefore re-evaluated the evidence for a donor effect in IBD from a selection of FMT clinical studies (Table 1). View this table: [Table 1:](http://medrxiv.org/content/early/2019/11/12/19011635/T1) Table 1: Summary of re-analyses. #### Rossen et al. 2015 In this study,9 23 ulcerative colitis patients were treated with FMT. 17 patients received stool from 1 of 15 donors; 6 patients received stool from 2 different donors. In the original publication, the investigators report on the performance of 3 donors (Table 2). There was no evidence of a donor effect (mid-*p* > 0.05, Fisher’s exact test) for any of the 3 donors. However, the study is small enough that the data are not inconsistent with large variability between donors. For example, the upper 95% confidence interval on donor A’s odds ratio of successful patient outcome compared to all other donors is 29. In other words, the data do not provide evidence of a donor effect, but they also cannot definitively rule out a strong donor effect. View this table: [Table 2:](http://medrxiv.org/content/early/2019/11/12/19011635/T2) Table 2: Tests of donor superiority, by donor, for Rossen *et al*. Donor labels are arbitrary. “Success” means the patient reached the primary endpoint; “failure” means they did not. #### Moayyedi et al. 2015 In this study,13 38 ulcerative colitis patients were treated with FMT. Each patient was treated with stool from 1 of 6 donors. Donors were used according to an adaptive process, described in the original publication. Briefly, at the start of the trial, patients were treated with stool from one of two donors (A or B). Material from donor B then became unavailable, and patients were instead treated with stool from donor A or 1 of 4 other donors (C, D, E, and F). At this point, donor B became available again, and “[t]he remaining participants allocated to active therapy all received FMT from donor B exclusively, as [the investigators] had not experienced any success with donor A”. In the original study, donor B’s performance (7 of 18 treated patients achieved the primary endpoint) was compared against all other donors (2 of 20 patients; odds ratio 5.5 [95% CI 1.0 to 44], mid-*p* = 0.048, Fisher’s exact test). In general, adaptive trial designs require special statistical methodologies.33 However, we were aware of no statistical methodologies designed to account for an adaptive design like the one used in this trial. To investigate whether typical statistical tests like Fisher’s exact test or a *χ*2 test accurately measure the statistical significance of the collected data, we simulated a simplified adaptive donor selection process. Among the first 24 patients, use all 6 donors 4 times each. Then exclusively use the best-performing donor for the remaining 14 patients. Finally, compare the overall performance of this donor against all the others. In these simulations, if all the donors are the same, the probability of finding mid-*p* < 0.05 is approximately 9%, greater than that 5% that would be expected if the test were accurate for that type of data (Supplemental Results). The intuitive explanation is that, if all donors are the same, the selection process is merely setting aside donors who had “bad luck” on their first patients, and they are not allowed to recover their performance with a run of “good luck” later on. This is not a flaw of Fisher’s exact test. Instead, it is problem of applying a statistical test to data that do not meet the assumptions of the test. Although the adaptive procedure that led to extensive use of donor B may have improved the probability of the trial’s success,7 the procedure was not pre-specified, making it impossible to rigorously determine the degree to which the observed results are consistent with all donors being identical, that is, whether donor B simply had a “lucky” initial run. Thus, the value mid-*p* ≈ 0.05 reported above is inaccurately optimistic, and a value closer to 0.09 is a better representation of the likelihood of a donor having the kind of success observed in this trial. #### Paramsothy et al. 2017 In this study,15 41 ulcerative colitis patients were randomized to receive FMT, and another 37 were randomized to placebo and later received FMT. Each patient received FMT from 1 of 21 pools of donors. Each pool included 3 to 7 donors, out of 13 total donors. The original publication compared the best performing individual donor against all the others (14 of 38 patients achieved the primary outcome, versus 7 of 40 assigned to other donors; odds ratio 2.7 [95% CI 0.96 to 8.2], mid-*p* = 0.06, Fisher’s exact test). This result, although near mid-*p* = 0.05, was not subjected to multiple hypothesis correction, as there are 13 relevant hypotheses to be tested, one per donor. In terms of pools of donors, there was also no evidence that any particular pool was associated with better outcomes (*p* = 0.57, Fisher-Freeman-Halton test on 2 × 21 table of patient outcomes by pool). To develop an estimate of the strength of the donor effect, we used a logistic regression, modeling the pool as a random effect. This model estimated that the median difference between the efficacy of two randomly selected pools is 0 percentage points (95% CI 0 to 20 percentage points). In other words, the regression’s best estimate of the variation in efficacy among pools is zero, even without applying any penalizations for model complexity. Thus, similar to Rossen *et al*., the data do not provide evidence for a donor effect, but they are also not inconsistent with a strong effect, since the upper confidence limit on the strength of the donor effect is 20 percentage points. #### Costello et al. 2019 In this study,27 38 ulcerative colitis patients received FMT from 1 of 11 pools. Each pool included stool from 3 or 4 donors, out of 19 total donors. Similar to Paramsothy *et al*., there was no evidence of heterogeneity in patient outcomes by donor pool (*p* = 0.50, Fisher-Freeman-Halton test on 2 × 11 table) nor evidence of better outcomes for any particular donor (mid-*p* > 0.05 for all donors, Fisher’s exact test). Furthermore, a logistic regression estimated the median difference between two randomly selected donors as 0 percentage points (95% CI 0 to 26 percentage points). #### Jacob et al. 2017 In this study,17 20 ulcerative colitis patients received FMT from 1 of 6 pools. Each pool included stool from 2 donors, out of 4 total donors. In this analysis, we analyzed only the 14 patients available for follow-up at 12 weeks, using remission at this timepoint as the outcome. Similar to Paramsothy *et al*. and Costello *et al*., there was no evidence for differences in patient outcomes by pool (Fisher-Freeman-Halton test on 2 × 4 table, *p* = 0.86) or by donor (mid-*p* > 0.05 for all 4 donors, Fisher’s exact test). A logistic regression estimated the median difference in efficacy between two randomly selected pools as zero (95% CI 0 to 31 percentage points). The pools administered to 13 of the 14 patients were characterized using 16S rRNA sequencing, allowing for an analysis of patient outcomes by bacterial characteristics of the pools (Figure 2). However, there was no evidence for an association between patient outcomes and the bacterial community α-diversity, measured as the number of unique amplicon sequence variants (ASVs, i.e., denoised 100% OTUs), of the pool they received (*p* = 0.18, Wilcoxon rank-sum test). ![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2019/11/12/19011635/F2.medium.gif) [Figure 2:](http://medrxiv.org/content/early/2019/11/12/19011635/F2) Figure 2: Bacterial α-diversity of FMT pools (vertical axis) do not significantly differ by patient outcomes (columns), in Jacob *et al*. (*p* = 0.18, Wilcoxon rank-sum test). Each point represents a single patient. The diversity shown for each patient (vertical axis) is the average of the number of unique 16S rRNA amplicon sequence variants (ASVs) in the sampled FMTs that were used in that patient. ![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2019/11/12/19011635/F3.medium.gif) [Figure 3:](http://medrxiv.org/content/early/2019/11/12/19011635/F3) Figure 3: Bacterial α-diversity do not differ by patient outcome in re-analysis of Kump *et al*.’s data when the unit of analysis is the patient (Wilcoxon test, *p* > 0.05 for all 3 comparisons). Each point represents a single patient. The diversity shown for each patient (vertical axis) is the average of the diversity of the sampled FMTs used in that patient. To develop confidence intervals around the strength of the donor effect, we fit a logistic regression to the data, predicting patient outcomes from pool α-diversity. The model’s point estimates for pool efficacy was 20% (95% CI 2% to 76%) for the pool with the lowest α-diversity (239 ASVs) but 84% (95% CI 25% to 99%) for the pool with the highest diversity (487 ASVs). In other words, the data do not provide statistically significant evidence for an association between pool α-diversity and patient outcomes, but they cannot definitively rule out the possibility of a wide variation in efficacy between pools. #### Meta-analysis of pool studies Paramsothy *et al*., Costello *et al*., and Jacob *et al*. have sufficiently similar designs and the available data to allow a meta-analysis. A logistic regression on the combined data from all three studies, with the study as a fixed effect, estimated the median difference in efficacy between two randomly selected pools as zero, with a smaller confidence interval than for any individual study (95% CI 0 to 16 percentage points). #### Goyal et al. 2018 In this study,16 21 patients with inflammatory bowel disease (any of Crohn’s disease, ulcerative colitis, or indeterminate colitis) received FMT and were available for follow-up. We used remission at 30 days as the outcome. Each patient received stool from 1 donor, precluding any test of donor superiority. However, 16S rRNA sequencing was performed, allowing for an analysis of donation characteristic superiority, similar to Jacob *et al*. above. Similar to that study, there was no evidence for an association between patient outcomes and the α-diversity (*p* = 0.3, Wilcoxon rank-sum test), but a logistic regression estimated a potentially wide but statistically insignificant variation in donation efficacy across the range of α-diversities among donors, predicting an efficacy of 38% (95% CI 8% to 82%) for the donor with the lowest α-diversity (386 ASVs) and an efficacy of 75% (95% CI 26% to 96%) for the donor with the highest diversity (1072 ASVs). Again, there was no statistically significant association between patient outcomes and donor stool characteristics, but the data are not inconsistent with a wide variation. #### Kump et al. 2017 In this study,14 17 ulcerative colitis patients received FMT from 1 of 14 donors. 12 donors were used in 1 patient, 1 donor was used in 2 patients, and 1 donor was used in 3 patients. Like Jacob *et al*. and Goyal *et al*., this study tested whether increased α-diversity was associated with improved patient outcomes. In the original analysis, the 4 patients who did not respond clinically to FMT were compared against the 4 patients who reached remission (Figure 2). 16S rRNA sequencing data was available for 28 donations associated with those 8 patients. Donors assigned to patients who received remission did not have statistically significantly higher α-diversity than those assigned to patients who did not respond (Wilcoxon rank-sum test, *p* = 0.3). The original publication reported statistically significant differences between the α-diversity of stool samples associated with remission (*n* = 16) compared to samples associated with no response (*n* = 12; *p* = 0.11). Our analysis has a different conclusion, finding no support for a donor effect, principally because we analyzed the data as only 8 independent data points (i.e., one for each patient), while Kump *et al*. analyzed the data as if there were 28 independent data points (i.e., one for each donation that was administered to a patient that also has associated microbiome data). Repeated measurements of the material delivered to a patient do not constitute independent measures of the patient’s outcome. In other words, when testing for an association between patient outcomes and a donation characteristic like α-diversity, the weight of the evidence is determined by the number of patient outcomes, not by the number of microbiome measurements. As an extreme example, consider a study with two donors and two patients: donor A is used with a patient who achieves remission, while donor B is used with a patient who has no response to FMT. If 100 microbiome samples were taken from each of donors A and B, a difference between A’s and B’s samples could be found with great statistical confidence. However, this does not translate into confidence in the relationship between the microbiome and patient outcome. Similar to the studies above, a logistic regression identified a potentially wide but not statistically significant variation in donation efficacy across the range of α-diversities, predicting an efficacy of 38% (95% CI 8% to 82%) for the donor with the lowest α-diversity (386 ASVs) and an efficacy of 75% (95% CI 26% to 96%) for the donor with the highest diversity (1072 ASVs). #### Vermeire et al. 2016 In this study, 14 patients with inflammatory bowel disease (either of Crohn’s disease or ulcerative colitis) received FMT, each from a different donor. The original study reported that the bacterial α-diversity of donations varied by patient outcomes, that patients with successful outcomes received more bacterially diverse donations (Wilcoxon rank-sum test, *p* = 0.012). However, the underlying bacterial sequencing data was not available for re-analysis. #### Nishida et al. 2017 In this study, 41 patients with ulcerative colitis received FMT, each from a different donor. The original study found no difference in the α-diversity of the stool from 7 donors whose patients met the primary endpoint, compared to 19 donors whose patients did not (*p* = 0.69). The study also tested whether the abundance of 10 taxa were differentially abundant among those two groups of donors, finding *p*-values less than 0.05 for 2 of 9 taxa, neither of which was significant after Benjamini-Hochberg multiple hypothesis correction.34 The underlying bacterial sequencing data was not available for re-analysis. ## DISCUSSION In simulation of FMT clinical trials for IBD, we found that trials would need to be very large, with more than 200 patients receiving FMT, to achieve 80% statistical power to detect a modest donor effect (median 10 percentage point difference in efficacy between two randomly selected donors). In other words, if the donor effect were only strong enough that randomly selected pairs of donors had a 10 percentage point difference in efficacy —say, one having a 30% efficacy and the other a 40% efficacy— then no existing trial of FMT for IBD had a design that could reliably detect such an effect. Concluding that existing FMT studies for IBD were simply not powered to detect donor effects, we re-evaluated reports from those studies, finding no statistically rigorous evidence for the superiority of particular donors and only very weak evidence for variation in FMT’s efficacy based on the α-diversity of the FMT material (specifically, a single result from Vermeire *et al*.). In most cases, the statistical best estimate for the strength of the donor effect is exactly zero. However, the small size of these studies relative to the number of patients required to reliably detect a plausible donor effect also means that the data cannot definitively rule out a large donor effect, as much as 20 or 30 percentage points between randomly selected donors. It therefore remains undetermined if differences between donors are clinically relevant to FMT for IBD. These sobering results are subject to multiple limitations. First, we measured the associations between very simple measures. We considered only dichotomous patient outcomes and the relatively simple microbiological characteristic of bacterial α-diversity as determined by 16S rRNA taxonomic marker gene sequencing. Reliably detecting donor effects may require more sophisticated methods. For example, Jacob *et al*. report that donors’ bacterial compositions meaningfully differed with respect to patient outcomes (Figure 4C in that publication, PERMANOVA, *p* = 0.044). More proximal and fine-grained measurements of patients’ response to FMT may also provide more sensitivity for detecting differences in donor efficacies. Dichotomous primary endpoints and ordinal changes in patients’ Mayo scores are subject to enormous sources of variance in the patient’s biology and disease history that may be hiding the more subtle effects of differences between donors. Second, we modeled the distribution of donors and pool efficacies as Gaussian in log odds space. This approach allowed us to use mixed model logistic regression, a well-established statistical procedure, to estimate the variance in donors’ efficacies, but it is not an intuitive measure of variation. Furthermore, this model of donor variability does not require that all donors’ efficacies be higher than the placebo rate. A model of donor variability that does sensibly account for placebo rates was previously used in a study of FMT clinical trial design,7 but that model would require adaptation to make it suitable for statistical inference. In light of the number of patients required for a well-powered test for a donor effect, we propose that an important path is meta-analysis across clinical trials, compiling characterizations of donors’ and patients’ microbiomes and host factors, linked with patient outcomes.6 This approach will present difficult technical challenges, as data from multiple studies be carefully collected, cleaned, and analyzed to make the analytical results valid across those studies.35 However, liberal sharing of clinical metadata and microbiome data, including careful reporting of donor and donation characteristics, will more easily enable inference across studies, and may provide the best direction forward to characterizing the donor effect for FMT in IBD. ## Data Availability Computer code and underlying data to reproduce the results are online at [https://www.github.com/openbiome/donor-effects](https://www.github.com/openbiome/donor-effects). [https://www.github.com/openbiome/donor-effects](https://www.github.com/openbiome/donor-effects) ## Author contributions SWO conceived the study, obtained data, performed the analysis, and wrote the manuscript. YG obtained data; improved the analysis, interpretation, and presentation of the results; and revised the manuscript. ## SUPPLEMENTAL METHODS ### Calculating the median difference in treatment efficacy between two randomly selected donors To convert a mean and standard deviation, measured in terms of log odds ratios, into the median difference in treatment efficacy between two randomly selected donors, measured in terms of percentage points: 1. 2 million random variates are drawn from a normal distribution with that mean log odds and standard deviation 2. These values are transformed into probabilities 3. The difference between subsequent values is calculated, and the median determined ## SUPPLEMENTAL RESULTS ### Moayyedi *et al*. adaptive simulation To determine this probability, we simulated 10,000 clinical trials. In each trial: * 6 donors had 4 success/fail trials with probability of success 9 / 39 = 0.23. * The best performing donor (or, in the case of a tie, a randomly selected donor among the best performers) had 14 additional success/fail trials with the same probability of success * The performance of the donor with 18 patients was compared against the aggregate performance of the other donors, with 20 patients, using a Fisher’s exact test. The test returned mid-*p* < 0.05 in 8.6% of simulations (*n* = 866; 95% CI 8.1% to 9.2%). ## Acknowledgements * Sudarshan Paramsothy, Nadeem Kaakoush, and Rotem Sadovsky for assistance in collecting the Paramsothy *et al*. data and for helpful conversations. * Sam Costello for assistance in collecting the Costello *et al*. data and for helpful conversations. * Eric J. Alm, Shrish Budree, Claire Duvallet, Justin O’Sullivan, Pratik Panchal, Mark B. Smith, and Duane Wesemann for helpful conversations. ## Footnotes * **Emails:** solesen{at}openbiome.org, ylaine{at}finchtherapeutics.com * Received November 8, 2019. * Revision received November 8, 2019. * Accepted November 12, 2019. * © 2019, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## REFERENCES 1. 1.Marchesi, J. R. et al. The gut microbiota and host health: a new clinical frontier. Gut 65, 330–339 (2016). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ3V0am5sIjtzOjU6InJlc2lkIjtzOjg6IjY1LzIvMzMwIjtzOjQ6ImF0b20iO3M6Mzk6Ii9tZWRyeGl2L2Vhcmx5LzIwMTkvMTEvMTIvMTkwMTE2MzUuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 2. 2.McDonald, L. C. et al. Clinical Practice Guidelines for Clostridium difficile Infection in Adults and Children: 2017 Update by the Infectious Diseases Society of America (IDSA) and Society for Healthcare Epidemiology of America (SHEA). Clin. Infect. Dis. 66, e1–e48 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/cix1085&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29462280&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 3. 3.Debast, S. B., Bauer, M. P. & Kuijper, E. J. European Society of Clinical Microbiology and Infectious Diseases: Update of the Treatment Guidance Document for Clostridium difficile Infection. Clin. Microbiol. Infect. 20, 1–26 (2014). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/1469-0691.12427&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24329732&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 4. 4.David, L. A. et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature 505, 559–563 (2014). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature12820&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24336217&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000329995000042&link_type=ISI) 5. 5.The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature11234&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22699609&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000305189000025&link_type=ISI) 6. 6.Olesen, S. W., Leier, M. M., Alm, E. J. & Kahn, S. A. Searching for superstool: maximizing the therapeutic potential of FMT. Nat. Rev. Gastroenterol. Hepatol. 15, 387–388 (2018). 7. 7.Olesen, S. W., Gurry, T. & Alm, E. J. Designing fecal microbiota transplant trials that account for differences in donor stool efficacy. Stat. Methods Med. Res. 27, 2906–2917 (2018). 8. 8.Wilson, B. C., Vatanen, T., Cutfield, W. S. & O’Sullivan, J. M. The Super-Donor Phenomenon in Fecal Microbiota Transplantation. Front. Cell. Infect. Microbiol. 9, 2 (2019). 9. 9.Rossen, N. G. et al. Findings From a Randomized Controlled Trial of Fecal Transplantation for Patients With Ulcerative Colitis. Gastroenterology 149, 110-118.e4 (2015). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1053/j.gastro.2015.03.045&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25836986&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 10. 10.Barnes, D. et al. Competitively Selected Donor Fecal Microbiota Transplantation: Butyrate Concentration and Diversity as Measures of Donor Quality. J. Pediatr. Gastroenterol. Nutr. 67, 185–187 (2018). 11. 11.Duvallet, C. et al. Framework for rational donor selection in fecal microbiota transplant clinical trials. medRxiv (2019) doi:10.1101/19000307. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoxMDoiMTkwMDAzMDd2MiI7czo0OiJhdG9tIjtzOjM5OiIvbWVkcnhpdi9lYXJseS8yMDE5LzExLzEyLzE5MDExNjM1LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 12. 12.Osman, M., Abend, A., Panchal, P., Kassam, Z. & Budree, S. 88 - Does the Donor Matter? Microbiome Sequencing to Evaluate Lower Donor Efficacy in Fecal Microbiota Transplantation for Recurrent Clostridium Difficile Infection. Gastroenterology 154, S-25-S-26 (2018). 13. 13.Moayyedi, P. et al. Fecal Microbiota Transplantation Induces Remission in Patients With Active Ulcerative Colitis in a Randomized Controlled Trial. Gastroenterology 149, 102-109.e6 (2015). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1053/j.gastro.2015.04.001&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25857665&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 14. 14.Kump, P. et al. The taxonomic composition of the donor intestinal microbiota is a major factor influencing the efficacy of faecal microbiota transplantation in therapy refractory ulcerative colitis. Aliment. Pharmacol. Ther. 47, 67–77 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/apt.14387&link_type=DOI) 15. 15.Paramsothy, S. et al. Multidonor intensive faecal microbiota transplantation for active ulcerative colitis: a randomised placebo-controlled trial. Lancet Lond. Engl. 389, 1218–1228 (2017). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(17)30182-4&link_type=DOI) 16. 16.Goyal, A. et al. Safety, Clinical Response, and Microbiome Findings Following Fecal Microbiota Transplant in Children With Inflammatory Bowel Disease. Inflamm. Bowel Dis. 24, 410–421 (2018). 17. 17.Jacob, V. et al. Single Delivery of High-Diversity Fecal Microbiota Preparation by Colonoscopy Is Safe and Effective in Increasing Microbial Diversity in Active Ulcerative Colitis: Inflamm. Bowel Dis. 23, 903–911 (2017). 18. 18.Allegretti, J. R., Mullish, B. H., Kelly, C. & Fischer, M. The evolution of the use of faecal microbiota transplantation and emerging therapeutic indications. Lancet Lond. Engl. 394, 420–431 (2019). 19. 19.Nishida, A. et al. Efficacy and safety of single fecal microbiota transplantation for Japanese patients with mild to moderately active ulcerative colitis. J. Gastroenterol. 52, 476–482 (2017). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00535-016-1271-4&link_type=DOI) 20. 20.Paramsothy, S. et al. Faecal Microbiota Transplantation for Inflammatory Bowel Disease: A Systematic Review and Meta-analysis. J. Crohns Colitis 11, 1180–1199 (2017). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 21. 21.Budree, S. et al. The Association of Stool Donor Diet on Microbial Profile and Clinical Outcomes of Fecal Microbiota Transplantation in Clostridium Difficile Infection. Gastroenterology 152, S630–S631 (2017). 22. 22.Vermeire, S. et al. Donor Species Richness Determines Faecal Microbiota Transplantation Success in Inflammatory Bowel Disease. J. Crohns Colitis 10, 387–394 (2016). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ecco-jcc/jjv203&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26519463&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 23. 23.Kazerouni, A. & Wein, L. M. Exploring the Efficacy of Pooled Stools in Fecal Microbiota Transplantation for Microbiota-Associated Chronic Diseases. PloS One 12, e0163956 (2017). 24. 24.Davido, B. et al. Impact of Fecal Microbiota Transplantation for Decolonization of Multidrug-Resistant Organisms May Vary According to Donor Microbiota. Clin. Infect. Dis. (2017) doi:10.1093/cid/cix963. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/cix963&link_type=DOI) 25. 25.Khanna, S. et al. Changes in microbial ecology after fecal microbiota transplantation for recurrent C. difficile infection affected by underlying inflammatory bowel disease. Microbiome 5, (2017). 26. 26.Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting Linear Mixed-Effects Models Using lme4. J. Stat. Softw. 67, (2015). 27. 27.Costello, S. P. et al. Effect of Fecal Microbiota Transplantation on 8-Week Remission in Patients With Ulcerative Colitis: A Randomized Clinical Trial. JAMA 321, 156–164 (2019). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2018.20046&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 28. 28.R Core Team. R: a language and environment for statistical computing. (R Foundation for Statistical Computing, 2019). 29. 29.Fay, M. P. Confidence intervals that match Fisher’s exact or Blaker’s exact tests. Biostatistics 11, 373–374 (2010). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biostatistics/kxp050&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19948745&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000275243900014&link_type=ISI) 30. 30.Mehta, C. R. & Hilton, J. F. Exact Power of Conditional and Unconditional Tests: Going beyond the 2 × 2 Contingency Table. Am. Stat. 47, 91–98 (1993). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/2685184&link_type=DOI) 31. 31.Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41587-019-0209-9&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31341288&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 32. 32.Amir, A. et al. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems 2, (2017). 33. 33.Bhatt, D. L. & Mehta, C. Adaptive Designs for Clinical Trials. N. Engl. J. Med. 375, 65–74 (2016). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMra1510061&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27406349&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) 34. 34.Yekutieli, D. & Benjamini, Y. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1214/aos/1013699998&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17909820&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2019%2F11%2F12%2F19011635.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000172838100012&link_type=ISI) 35. 35.Gibbons, S. M., Duvallet, C. & Alm, E. J. Correcting for batch effects in case-control microbiome studies. PLoS Comput. Biol. 14, e1006102 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1006102&link_type=DOI)