Estimates of outbreak-specific SARS-CoV-2 epidemiological parameters from genomic data

We estimate the basic reproductive number and case counts for 15 distinct SARS-CoV-2 outbreaks, distributed across 10 countries and one cruise ship, based solely on phylodynamic analyses of genomic data. Our results indicate that, prior to significant public health interventions, the reproductive numbers for a majority (10) of these outbreaks are similar, with median posterior estimates ranging between 1.4 and 2.8. These estimates provide a view which is complementary to that provided by those based on traditional line listing data. The genomic-based view is arguably less susceptible to biases resulting from differences in testing protocols, testing intensity, and import of cases into the community of interest. In the analyses reported here, the genomic data primarily provides information regarding which samples belong to a particular outbreak. We observe that once these outbreaks are identified, the sampling dates carry the majority of the information regarding the reproductive number. Finally, we provide genome-based estimates of the cumulative case counts for each outbreak, which allow us to speculate on the amount of unreported infections within the populations housing each outbreak. These results indicate that for the majority (7) of the populations studied, the number of recorded cases is much bigger than the estimated cumulative case counts, suggesting the presence of unsequenced pathogen diversity in these populations.

of the reproductive number is updated daily using the latest 23 confirmation, hospitalization, death, and excess death data 24 with a focus on European countries (9) leading to similar 25 results. 26 Despite the wide-spread application of such methods, the 27 estimates produced by line list data alone are inherently sus-28 ceptible to several biases and limitations (10-12). Firstly, the 29 presence of pools of undiagnosed infected individuals, together 30 with changes in testing methods and the extent to which test-31 ing is happening at all, can lead to misleading characterizations 32 of the epidemic. Secondly, it is often impossible to discriminate 33 between import cases and those attributable to local transmis-34 sion based on line list data. This has the potential to produce

Significance Statement
Since the beginning of the COVID-19 outbreak in late 2019, researchers around the globe have sought to estimate the rate at which the disease spread through populations prior to public health intervention, as quantified by the parameter R0. This is often estimated based on case count data and may be biased due to the presence of import cases. To overcome this, we estimate R0 by applying Bayesian phylodynamic methods to SARS-CoV-2 genomes which have been made available by laboratories worldwide. We provide R0 and absolute infection count estimates for 15 distinct outbreaks. These estimates contribute to our understanding of the baseline transmission dynamics of the disease, which will be critical in guiding future public health responses to the pandemic. Birth-death phylodynamic results are dependent not only 120 on the genomic data, but also on the distribution of sample 121 collection dates. In fact, we find that in this instance, the 122 sample collection dates carry most of the information regarding 123 R0. We demonstrated this by running an additional set of 124 phylodynamic analyses in which the genomic sequences were 125 treated as unknown. Additionally, we applied both a simplistic 126 linear regression approach (see Methods) and an established 127 traditional approach (12) to the cumulative sequence counts. 128 The results of these alternative analyses are summarized in 129 figure S5 and-in many cases-show relatively close agreement, 130 albeit with slightly less certainty in the estimates than those 131 shown in figure 1.

132
Given this dominating effect of the sampling times, it is 133 natural to consider how sensitive our results are to the as-134 sumption that the sampling rate and reproductive number 135 are fixed over the time period of each outbreak. We thus 136 performed a separate set of analyses in which these quantities 137 were allowed to change at a point at the center of the sampling 138 window of each outbreak (excluding the Diamond Princess 139 outbreak). The resulting R0 estimates, presented in figure S8, 140 show no major change in the results compared with those in 141 figure 1, with the exception of the Netherland (1) and WA 142 States (1) outbreaks which suggest higher R0 values. In order 143 to investigate how much our results are impacted by the prior, 144 we repeated the fixed-rate analyses with a broader prior on 145 R0. This broad prior did not qualitatively change the results 146 compared to our main analysis (figure S9).  (23), which are offset by 10 days to account for the delay between infection and case confirmation (9). Note that these inferences concern only those cases associated with the specific outbreak from which the sequence data are drawn, as detailed in the discussion section. The true cumulative case counts may have been much higher. (Inference for remaining outbreaks are shown in figure S6.)

Discussion.
Our central result is that prior to strong public  Fig. 3. Estimates of cumulative case counts obtained from phylodynamic analyses, with diamonds indicating recorded counts obtained from Dong, Du and Gardner (23), offset by 10 days to account for the delay between infection and case confirmation (9). The counts are for the date of the final genome sample considered in each population. We note that we have likely analyzed only a subset of the total number of outbreaks which were circulating in each country.
short sampling windows involved. The sampling model used 178 for these outbreaks assumes that samples accumulate at a rate 179 proportional to the number of infectious cases for the duration 180 over which samples are available. We showed that allowing 181 for a single shift in sampling rate and R0 during the outbreak 182 did not result in much lower R0 values for these remaining 183 outbreaks.

184
Another potential source of upward bias on R0 is the process 185 of outbreak selection. We necessarily restrict our attention to 186 outbreaks for which sufficient data exist to provide statistical 187 signal. This restriction may have the effect of selecting for 188 steeper outbreak trajectories. Since the birth-death models 189 under which we perform the inference do not account for this 190 conditioning, these steeper trajectories will be interpreted as 191 evidence for larger R0, even when the increased gradient is 192 simply the result of demographic noise in the growth of the 193 epidemic. Including appropriate conditioning in phylodynamic 194 inference to guard against this kind of bias will be the focus 195 of future research.

196
Given that most of information content in the genome 197 sequence data analyzed seems to come from sampling times, it 198 is natural to wonder whether the phylodynamic approach offers 199 additional insights on these outbreaks. The obvious answer 200 to this challenge is that, genomic data allowed us to identify 201 outbreaks in the absence of contact tracing data, which is 202 often not available for study. Furthermore, even though the 203 impact of the phylogeny within each identified outbreak on the 204 inferred epidemic parameters was negligible, the application 205 of phylodynamic methods yield information about the total 206 number of cases through time, including those which have 207 gone undetected. 208 We emphasize, however, that extreme care must be taken 209 when interpreting both these inferred and clinically confirmed 210 case counts as representative of the true underlying case load. 211 Firstly, our inferences correspond to the number of cases 212 associated only with the specific outbreaks from which the 213 genomic data originate. It is entirely possible that additional 214 outbreaks, from which we do not have genetic data, occurred 215 within a given population during the time periods considered. 216 tive. Whenever we were able to access this information we used 279 it to exclude non-randomly sampled sequences, but in many cases 280 the relevant information was either not collected or not readily 281 accessible.

282
Alignment and outbreak detection. After these filtering steps, we 283 aligned the remaining sequences to a reference genome generated 284 from an early COVID-19 patient in Wuhan (GenBank accession 285 number MN908947) (27). SNPs in the first 130 sites, last 50 sites, 286 and at sites 18529, 29849, 29851, and 29853 were masked from the 287 alignment because they are likely sequencing artifacts (26). 288 We built a maximum-likelihood phylogenetic tree using this 289 alignment. We then picked clades from this tree where sufficient 290 (≥ 9) samples from the same region clustered together. We assume 291 that these clusters represent primarily within-country transmission 292 events rather than introductions from abroad.

293
Exceptionally for the Itay, Iran, and China outbreaks we addi-294 tionally identified samples from cases that were presumably exposed 295 to the virus in these regions but were sampled abroad (travel cases). 296 The data set for Italy included sequences from both non-travel and 297 travel cases, while those for China and Iran were composed exclu-298 sively of sequences from travel cases. This exposure information 299 comes from metadata available on GISAID and Nexstrain, as well as 300 information provided by sequencing centers and in media accounts. 301 Sample set truncation. To limit sampling to the early, exponential 302 growth phase of each regional outbreak, we truncated sampling 303 based on the dates of major public health interventions (Table S1). 304 We retained only samples collected before or on the date of theses 305 public health interventions, with the exception of the Iran, Iceland, 306 and Spain outbreaks. For these outbreaks, we extended the time 307 cutoff so that the sample size was not prohibitively small. (The 308 extension for Iran was 11 days, for Iceland it was 2 days, and the 309 cutoff for Spain was extended by 1 day, as shown in Table S1.) Since 310 the transmission events leading to sampled cases likely happened at 311 least a few days before sampling, these cutoffs should, for the most 312 part, be conservative. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 14, 2020. . https://doi.org/10.1101/2020.09.12.20193284 doi: medRxiv preprint we ran a separate analysis using a Unif(0,10) prior for each R