## Abstract

A novel coronavirus (SARS-CoV-2) has recently emerged as a global threat. As the epidemic progresses, many disease modelers have focused on estimating the basic reproductive number *ℛ*_{0}– the average number of secondary cases caused by a primary case in an otherwise susceptible population. The modeling approaches and resulting estimates of *ℛ*_{0} vary widely, despite relying on similar data sources. Here, we present a novel statistical framework for comparing and combining different estimates of *ℛ*_{0} across a wide range of models by decomposing the basic reproductive number into three key quantities: the exponential growth rate *r*, the mean generation interval , and the generation-interval dispersion *κ*. We then apply our framework to early estimates of *ℛ*_{0} for the SARS-CoV-2 outbreak. We show that many early *ℛ*_{0} estimates are overly confident. Our results emphasize the importance of propagating uncertainties in all components of *ℛ*_{0}, including the shape of the generation-interval distribution, in efforts to estimate *ℛ*_{0} at the outset of an epidemic.

- SARS-CoV-2
- COVID-19
- novel coronavirus
- basic reproductive number
- generation interval
- Bayesian multilevel model

## 1 Introduction

Since December 2019, a novel coronavirus (SARS-CoV-2) has been spreading in China and other parts of the world (World Health Organization, 2020d). Although the virus is believed to have originated from animal reservoirs (Centers for Disease Control and Prevention, 2020), the ability of SARS-CoV-2 ability to directly transmit between humans has posed a greater threat for its spread (Huang et al., 2020; World Health Organization, 2020c). As of February 27, 2020, the World Health Organization (WHO) has confirmed 82,294 cases of the coronavirus disease (COVID-19), including 3,664 confirmed cases in 46 different countries, outside China (World Health Organization, 2020a).

As the disease continues to spread, many researchers have already published their analyses of the outbreak as pre-prints (e.g., Bedford et al. (2020); Imai et al. (2020); Liu et al. (2020); Majumder and Mandl (2020); Read et al. (2020a); Zhao et al. (2020)) and in peer-reviewed journals (e.g., Li et al. (2020); Riou and Althaus (2020b); Wu et al. (2020); Zhao et al. (2020)), focusing in particular on estimates of the basic reproductive number *ℛ*_{0} (i.e., the average number of secondary cases generated by a primary case in a fully susceptible population (Anderson and May, 1991; Diekmann et al., 1990)). Estimates of the basic reproductive number are of interest during an outbreak because they provide information about the level of intervention required to interrupt transmission (Anderson and May, 1991), and about the potential final size of the outbreak (Anderson and May, 1991; Ma and Earn, 2006). We commend these researchers for their timely contribution and those who made the data publicly available. However, it can be difficult to compare a disparate set of estimates of *ℛ*_{0} from different research groups (as well as the associated degrees of uncertainty) when the estimation methods and their underlying assumptions vary widely.

Here, we show that a wide range of approaches to estimating *ℛ*_{0} can be understood and compared in terms of estimates of three quantities: the exponential growth rate *r*, the mean generation interval , and the generation-interval dispersion *κ*. The generation interval, defined as the interval between the time when an individual becomes infected and the time when that individual infects another individual (Svensson, 2007), plays a key role in shaping the relationship between *r* and *ℛ*_{0} (Wearing et al., 2005; Roberts and Heesterbeek, 2007; Wallinga and Lipsitch, 2007; Park et al., 2019); therefore, estimates of *ℛ*_{0} from different models directly depend on their implicit assumptions about the generation-interval distribution and the exponential growth rate. Early in an epidemic, information is scarce and there is inevitably a great deal of uncertainty surrounding both case reports (affecting the estimates of the exponential growth rate) and contact tracing (affecting the estimates of the generation-interval distribution). We suggest that disease modelers should make sure their assumptions about these three quantities are clear and reasonable, and that estimates of uncertainty in *ℛ*_{0} should propagate error from all three sources (Elderd et al., 2006).

We compare seven disparate models published online between January 23–26, 2020 that estimated *ℛ*_{0} for the SARS-CoV-2 outbreak (Bedford et al., 2020; Imai et al., 2020; Liu et al., 2020; Majumder and Mandl, 2020; Read et al., 2020a; Riou and Althaus, 2020a; Zhao et al., 2020). We use a Bayesian multilevel model to construct pooled estimates for the three key quantities: *r*, , and *κ*; the pooled estimates reflect the uncertainties present in modeling approaches and their underlying assumptions. We use these pooled estimates to illustrate the importance of propagating different sources of error, particularly uncertainty in both the growth rate and the generation interval. We also use our framework to tease apart which assumptions of these different models led to their different estimates and confidence intervals. Despite the availability of more recent and/or updated estimates of *ℛ*_{0}, we restrict ourselves to the estimates above in order to focus on the resolution of uncertainty in the earliest stages of an epidemic.

## 2 Methods

### 2.1 Description of the studies

We gathered information on estimates of *ℛ*_{0} and their assumptions about the underlying generation-interval distributions from 7 articles that were published online between January 23–26, 2020 (Table 1). Five studies (Liu et al., 2020; Majumder and Mandl, 2020; Read et al., 2020a; Riou and Althaus, 2020a; Zhao et al., 2020) were uploaded to pre-print servers (bioRxiv, medRxiv, and SSRN); one report was posted on the web site of Imperial College London (Imai et al., 2020); and one report was posted on nextstrain.org (Bedford et al., 2020). Their modeling approaches vary widely: a branching process model (Bedford et al., 2020; Imai et al., 2020; Riou and Althaus, 2020a), a deterministic Susceptible-Exposed-Infected-Recovered (SEIR) model (Read et al., 2020a), an exponential growth model (Zhao et al., 2020), a Poisson offspring distribution model (Liu et al., 2020), and the Incidence Decay and Exponential Adjustment (IDEA) model (Majumder and Mandl, 2020). Four studies estimated *ℛ*_{0} by directly fitting their models to incidence data (Read et al., 2020a; Zhao et al., 2020; Liu et al., 2020; Majumder and Mandl, 2020). The remaining three studies estimated *ℛ*_{0} by comparing the predicted number of cases from their models with the estimated number of total cases by January 18 (between 1,000 and 9,7000 (Imai et al., 2020)) Some of these studies have now been published in peer-reviewed journals (Riou and Althaus, 2020b; Zhao et al., 2020) or have been updated with better uncertainty quantification (Read et al., 2020b).

### 2.2 Gamma approximation framework for linking *r* and *R*_{0}

Early in an outbreak, *ℛ*_{0} is difficult to estimate directly; instead, *ℛ*_{0} is often inferred from the exponential growth rate *r*, which can be estimated reliably from incidence data (Ma et al., 2014). Given an estimate of the exponential growth rate *r* and an *intrinsic* generation- interval distribution *g*(*τ*) (Champredon and Dushoff, 2015), the basic reproductive number can be estimated via the Euler-Lotka equation (Wallinga and Lipsitch, 2007):

In other words, estimates of *ℛ*_{0} must depend on the assumptions about the exponential growth rate *r* and the shape of the generation-interval distribution *g*(*τ*).

Here, we use the gamma approximation framework (Park et al., 2019) to (i) characterize the amount of uncertainty present in the exponential growth rates and the shape of the generation-interval distribution and (ii) assess the degree to which these uncertainties affect the estimate of *ℛ*_{0}. Assuming that generation intervals follow a gamma distribution with the mean and the squared coefficient of variation *κ*, we have

This equation demonstrates that a generation-interval distribution that has a larger mean (higher ) or is less variable (lower *κ*) will give a higher estimate of *ℛ*_{0} for the same value of *r* (Wallinga and Lipsitch, 2007).

### 2.3 Statistical framework

As most studies do not report their estimates of the exponential growth rate, we first recalculate the exponential growth rate that correspond to their model assumptions. We do so by modeling reported distributions of the reproductive number *ℛ*_{0}, the mean generation interval, and the generation-interval dispersion parameter *κ* with appropriate probability distributions; we used gamma distributions to model values reported with confidence intervals and uniform distributions to model values reported with ranges. For example, Study 3 estimated *ℛ*_{0} = 2.92 (95% CI: 2.28–3.67); we model this estimate as a gamma distribution with a mean of 2.92 and a shape parameter of 67, which has a 95% probability of containing a value between 2.28 and 3.67 (see Table 2 for a complete description). For each study *i*, we construct a family of parameter sets by drawing 100,000 random samples from the probability distributions (Table 2) that represent the estimates of *ℛ*_{0i} and the assumed values of and *κ*_{i} and calculate the exponential growth rate *r*_{i} via the inverse of Eq. 2:

This allows us to approximate the probability distributions of the estimated exponential growth rates by each study; uncertainties in the probability distributions that we calculate for the estimated exponential growth rates will reflect the methods and assumptions that the studies rely on.

We construct pooled estimates for each parameter (*r*, , and *κ*) using a Bayesian multilevel modeling approach, which assumes that the parameters across different studies come from the same gamma distribution. The pooled estimates, which are represented as probability distributions rather than point estimates, allow us to average across different modeling approaches, while accounting for the uncertainties in the assumptions they make:
where *µ*_{r}, *µ*_{G}, *µ*_{κ} represent the pooled estimates, and *σ*_{r}, *σ*_{G}, and *σ*_{κ} represent between-study standard deviations. We account for uncertainties associated with *r*_{i}, and *κ*_{i} (and their correlations), by drawing a random set from the family of parameter sets for each study at each Metropolis-Hastings step. Since the gamma distribution does not allow zeros, we use *κ* = 0.02 instead for Study 7. We note that this approach does not account for nonindependence between the parameter estimates made by different modelers. As we add more models, the pooled estimates can become sharper even when the models no longer add more information. Thus, the pooled estimator should be interpreted with care.

We use weakly informative priors on hyperparameters:

We followed recommendations outlined in Gelman et al. (2006), parameterizing the top-level gamma distributions in terms of their means and standard deviations and imposing weakly informative prior distributions on between-study standard deviations, i.e., half-normal(0, 10). We had initially used gamma priors with small shape parameters (*<* 1) on between-study shape parameters (= *µ*^{2}*/σ*^{2}) but found this put too much prior probability on large between-study variances. This phenomenon is a known problem (Gelman et al., 2006). Alternative choices of prior for the between-study shape parameters are also suboptimal: imposing strong priors (e.g. half-*t*(*µ* = 0, *σ* = 1, *ν* = 4) assumes *a priori* that between-study variance is large, while weak priors (e.g. half-Cauchy(0,5)) can lead to poor mixing.

We run 4 independent Markov Chain Monte Carlo chains each consisting of 500,000 burnin steps and 500,000 sampling steps. Posterior samples are thinned every 1000 steps. Convergence is assessed by ensuring that the Gelman-Rubin statistic is below 1.01 for all hyperparameters (Gelman et al., 1992); trace plots and marginal posterior distribution plots are presented in Appendix. 95% confidence intervals are calculated by taking 2.5% and 97.5% quantiles from the marginal posterior distribution for each parameter.

## 3 Results

Fig. 1 compares the reported values of the exponential growth rate *r*, mean generation interval , and the generation-interval dispersion *κ* from different studies with the pooled estimates that we calculate from our multilevel model. We find that there is a large uncertainty associated with the underlying parameters; many models rely on stronger assumptions that ignore these uncertainties. Surprisingly, no studies take into account how the variation in generation intervals affects their estimates of *ℛ*_{0}: all studies assumed fixed values for *κ*, ranging from 0 to 1. Assuming fixed parameter values can lead to overly strong conclusions (Elderd et al., 2006).

Fig. 2 shows how propagating uncertainty in different combinations would affect estimates and CIs for *ℛ*_{0}. For illustrative purposes, we use our pooled estimates, which may represent a reasonable proxy for the state of knowledge as of January 23–26 (Fig. 2A). Comparing the models that include only some sources of uncertainty to the “all” model, we see that propagating error from the growth rate (which all but one of the studies reviewed did) is absolutely crucial: the middle bar (“GI mean”), which lacks growth-rate uncertainty, is relatively narrow. In this case, propagating error from the mean generation interval has negligible effect compared to propagating the uncertainty in *r*. Uncertainty in the generation-interval dispersion also has important effects as it determines the functional form of the relationship between *r* and *ℛ*_{0} (compare “growth rate + GI mean” with “all”). For example, reducing the dispersion parameter *κ* from 1 (assuming exponentially distributed generation intervals) to 0 (assuming fixed generation intervals) changes the *r*– *ℛ*_{0} relationship from linear to exponential, therefore increasing the sensitivity of *ℛ*_{0} estimates to *r* and .

As uncertainty associated with the exponential growth rate decreases, accounting for uncertainties in generation intervals becomes even more important (Fig. 2B). Propagating error only from the growth rate gives very narrow confidence intervals in this case. Likewise, propagating errors from the growth rate and the mean generation interval gives wider but still too narrow confidence intervals. We expect this hypothetical example to better reflect more recent scenarios, as increased data availability will allow researchers to estimate *r* with more certainty.

We also compare the estimates of *ℛ*_{0} across different studies by replacing their values of *r*, , and *κ* with our pooled estimates (*µ*_{r}, *µ*_{G}, and *µ*_{κ}, respectively) one at a time and recalculating the basic reproductive number *ℛ*_{0} (Fig. 3). This procedure allows us to assess the sensitivity of the estimates of *ℛ*_{0} across appropriate ranges of uncertainties. We find that incorporating uncertainties one at a time increases the width of the confidence intervals in all but 7 cases. We estimate narrower confidence intervals for Study 3, Study 6, and Study 7 when we account for proper uncertainties in the generation-interval dispersion because they assume a narrow generation-interval distribution (compare “base” with “GI variation”); when higher values of *κ* are used, their estimates of *ℛ*_{0} become less sensitive to the values of *r* and , giving narrower confidence intervals. We estimate narrower confidence intervals for Study 5 and Study 7 when we account for proper uncertainties in the mean generation interval (compare “base” with “GI mean”) because the range of uncertainty in the mean generation interval they consider is much wider than the pooled range (Fig. 1). Substituting the reported *r* or from Study 1 with our pooled estimates give narrower confidence intervals for similar reasons.

We find that accounting for uncertainties in the estimate of *r* has the largest effect on the estimates of *ℛ*_{0} in most cases (Fig. 3). For example, recalculating *ℛ*_{0} for Study 7 by using our pooled estimate of *r* gives *ℛ*_{0} = 3.9 (95% CI: 2.3–8.6), which is much wider than the uncertainty range they reported (2.0–3.1). There are two explanations for this result. First, even though the exponential growth rate *r* and the mean generation interval have identical mathematical effects on *ℛ*_{0} in our framework (Eq. 2 in Methods), *r* is more influential in this case because it is associated with more uncertainty (Fig. 1). Second, assuming a fixed generation interval (*κ* = 0) makes the estimate of *ℛ*_{0} too sensitive to *r* and . One exception is Study 1: we find this estimate of *ℛ*_{0} is most sensitive to generation-interval dispersion *κ*. This is because Study 1 assumes an exponentially distributed generation interval (*κ* = 1): estimates that rely on this assumption make *ℛ*_{0} relatively insensitive and thus tend to have particularly narrow confidence intervals.

Finally, we incorporate all uncertainties by using posterior samples for *µ*_{r}, *µ*_{G}, and *µ*_{κ} to recalculate *R*_{0} and compare it with the reported *ℛ*_{0} estimates. Our estimated *ℛ*_{0} from the pooled distribution has a median of 2.9 (95% CI: 2.1–4.5). While the point estimate of *ℛ*_{0} is similar to other reported values from this date range, the confidence intervals are wider than all but one study. This result does not imply that assumptions based on the pooled estimate are too weak; we believe that this confidence interval more accurately reflects the level of uncertainties present in the information that was available when these models were fitted. In fact, because the pooled estimate does not account for overlap in data sources used by the models, we feel that it is more likely to be over-confident than under-confident. Our median estimate averages over the various studies, and therefore particular studies have higher or lower median estimates. We note in particular that, while the baseline example we used from Study 6 may appear to be an outlier, the authors of this study also explore different scenarios involving changes in reporting rate over time, under which their estimates of *ℛ*_{0} are similar to other reported estimates. Here, our focus is on estimating uncertainty, not on identifying potential explanations for these discrepancies.

## 4 Discussion

Estimating the basic reproductive number *ℛ*_{0} is crucial for predicting the course of an outbreak and planning intervention strategies. Here, we use a gamma approximation (Park et al., 2019) to decompose *ℛ*_{0} estimates into three key quantities (*r*, , and *κ*) and apply a multilevel Bayesian framework to compare estimates of *ℛ*_{0} for the novel coronavirus outbreak. Our results demonstrate the importance of accounting for uncertainties associated with the underlying generation-interval distributions, including uncertainties in the amount of dispersion in the generation intervals. Our analysis of individual studies shows that many early estimates of *ℛ*_{0} rely on strong assumptions.

Of the seven studies that we reviewed, two of them directly fit their models to cumulative number of confirmed cases. This approach can be appealing because of its simplicity and apparent robustness, but fitting a model to cumulative incidence instead of raw incidence can both bias parameters and give overly narrow confidence intervals, if the resulting non-independent error structures are not taken into account (Ma et al., 2014; King et al., 2015). Naive fits to cumulative incidence data should therefore be avoided.

Many sources of noise affect real-world incidence data, including both dynamical, or “process”, noise (randomness that directly or indirectly affects disease transmission); and observation noise (randomness underlying how many of the true cases are reported). Disease modelers face the choice of incorporating one or both of these in their data-fitting and modeling steps. This is not always a serious problem, particularly if the goal is inferring parameters rather than directly making forecasts (Ma et al., 2014). Modelers should however be aware of the possibility that ignoring one kind of error can give overly narrow confidence intervals (King et al., 2015; Taylor et al., 2016).

There are other important phenomena not covered by our simple framework. Examples that seem relevant to this outbreak include: changing reporting rates, reporting delays (including the effects of weekends and holidays), and changing generation intervals. For emerging pathogens such as SARS-CoV-2, there may be an early period of time when the reporting rate is very low due to limited awareness or diagnostic resources; for example, Zhao et al. (2020) (Study 6) demonstrated that estimates of *ℛ*_{0} can change from 5.47 (95% CI: 4.16–7.10) to 3.30 (95% CI: 2.73–3.96) when they assume 2-fold changes in the reporting rate between January 17, when the official diagnostic guidelines were released (World Health Organization, 2020b), and January 20. Delays between key epidemiological timings (e.g., infection, symptom onset, and detection) can also shift the shape of an observed epidemic curve and, therefore, affect parameter estimates as well as predictions of the course of an outbreak (Tariq et al., 2019). Even though a constant delay between infection and detection may not affect the estimate of the growth rate, it can still affect the associated confidence intervals. Finally, generation intervals can become shorter throughout an epidemic as intervention strategies, such as quarantine, can reduce the infectious period (Hethcote et al., 2002). Accounting for these factors is crucial for making accurate inferences.

Here, we focused on the estimates of *ℛ*_{0} that were published within a very short time frame (January 23–26). During early phases of an outbreak, it is reasonable to assume that the epidemic grows exponentially (Anderson and May, 1991). However, as the number of susceptible individuals decreases, the epidemic will saturate, and estimates of *r* used for *ℛ*_{0} should account for the possibility that *r* is decreasing through time. Although our analysis only reflects a snapshot of a fast-moving epidemic, we expect certain lessons to hold: confidence intervals must combine different sources of uncertainty. In fact, as epidemics progress and more data becomes available, it is likely that inferences about exponential growth rate (and other epidemiological parameters) will become more precise; thus the risk of over-confidence when uncertainty about the generation-interval distribution is neglected will become greater.

We strongly emphasize the value of attention to accurate characterization of the transmission chains via contact tracing and better statistical frameworks for inferring generation-interval distributions from such data (Britton and Scalia Tomba, 2019). A combined effort between public-health workers and modelers in this direction will be crucial for predicting the course of an epidemic and controlling it. We also emphasize the value of transparency from modelers. Model estimates during an outbreak, even in pre-prints, should include code links and complete explanations. We suggest using methods based on open-source tools allow for maximal reproducibility.

In summary, we have provided a basis for comparing exponential-growth based estimates of *ℛ*_{0} and its associated uncertainty in terms of three components: the exponential growth rate, mean generation interval, and generation interval dispersion. We hope this framework will help researchers understand and reconcile disparate estimates of disease transmission early in an epidemic.

## Data Availability

R code is available in GitHub (https://github.com/parksw3/nCoV_framework).

## Funding

BMB and DJDE were supported by Natural Sciences and Engineering Research Council (NSERC). ML was supported by Canadian Institutes of Health Research (CIHR). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

## Competing interests

We declare no competing interests.

## Contribution

SWP and JD developed the statistical framework. SWP reviewed the published literature. SWP performed the analysis. SWP, BMB, and JD created the figures. SWP and JD wrote the first draft. All authors contributed to the writing and approval of the final report.

## Data availability

R code is available in GitHub (https://github.com/parksw3/nCoV_framework).

## Acknowledgements

We thank Daihai He for providing helpful comments on the manuscript.