## Abstract

New and more transmissible variants of SARS-CoV-2 have arisen multiple times over the course of the pandemic. Rapidly identifying mutations that affect transmission could facilitate outbreak control efforts and highlight new variants that warrant further study. Here we develop an analytical epidemiological model that infers the transmission effects of mutations from genomic surveillance data. Applying our model to SARS-CoV-2 data across many regions, we find multiple mutations that strongly affect the transmission rate, both within and outside the Spike protein. We also quantify the effects of travel and competition between different lineages on the inferred transmission effects of mutations. Importantly, our model detects lineages with increased transmission as they arise. We infer significant transmission advantages for the Alpha and Delta variants within a week of their appearances in regional data, when their regional frequencies were only around 1%. Our model thus enables the rapid identification of variants and mutations that affect transmission from genomic surveillance data.

Viruses can acquire mutations that affect how efficiently they infect new hosts, for example by increasing viral load or escaping host immunity^{1–4}. The ability to rapidly identify mutations that increase transmission could inform outbreak control efforts and identify potential immune escape variants^{5–9}. However, estimating how individual mutations affect viral transmission is a challenging problem.

Current methods to estimate changes in viral transmission generally rely on phylogenetic analyses or fitting changes in variant frequencies to a simple growth model^{5,10–12}. Phylogenetic analyses for viruses can be challenging due to a high degree of sequence similarity, which implies that the data can be explained equally well by a number of different trees^{13}. Phylogenetic analyses also typically rely on extensive Markov chain Monte Carlo sampling that becomes intractable for very large data sets. Simple growth models can estimate the difference in transmissibility between one variant and others circulating in the same region. However, they typically do not systematically account for competition between multiple variants, and their estimates may be difficult to compare for variants that arose in other regions or with different genetic backgrounds. These approaches also do not consider travel of infected individuals, nor do they account for superspreading —where a small number of infected individuals cause the majority of secondary infections —which has been observed for viruses like SARS-CoV and SARS-CoV-2^{14,15}.

To overcome these challenges, we developed a method to infer the effects of single nucleotide variants (SNVs) on viral transmission from genomic surveillance data that accounts for competition between viral lineages, travel, superspreading, and outbreaks in different locations. For clarity, we refer to non-reference nucleotides (including deletions or insertions) as SNVs and viral lineages possessing common sets of SNVs as variants. Simulations show that our approach can reliably estimate transmission effects of SNVs even from limited data. We applied our method to more than 1.6 million SARS-CoV-2 sequences from 87 geographical regions to reveal the effects of mutations on viral transmission throughout the pandemic. While the vast majority of SARS-CoV-2 mutations have negligible effects, we readily observe increased transmission for sets of SNVs in Spike and other hotspots throughout the genome. We further quantified the influence of travel and competition between multiple variants, using Nextstrain lineage 20E (EU1) as an example case. We found that realistic rates of travel during the pandemic would only slightly affect estimated changes in viral transmission. However, competition between variants has significant effects that must be accounted for in order to accurately estimate changes in transmission.

Importantly, our approach is sensitive enough to identify variants with increased transmission before they reach high frequencies. We demonstrate our capacity for early detection by studying the rise of the Alpha and Delta variants using data from the UK. In both cases, we reliably infer increased transmission for these variants within a week of their emergence, when their frequency in the region was only around 1%. Collectively, these data show that our model can be applied for the surveillance of evolving pathogens to robustly identify variants with transmission advantages and to highlight key mutations that may be driving changes in transmission.

## Results

### Epidemiological Model

To quantify the effects of mutations on viral transmission, we developed a stochastic branching process model of infection based on the well-known susceptible-infected-recovered (SIR) model. Our model incorporates superspreading by drawing the number of secondary infections caused by an infected individual from a negative binomial distribution with mean *R*, referred to as the effective reproduction number, and dispersion parameter *k* (refs.^{14,15}). Multiple variants with different transmission rates are included by assigning a variant *a* an effective reproduction number *R*_{a} = *R*(1 +*w*_{a}). Under an additive model, the net increase or decrease in transmission for a variant is the sum of the individual transmission effects *s*_{i} for each SNV *i* that the variant contains. In analogy with population genetics, we refer to the *w*_{a} and *s*_{i} as selection coefficients. In addition to superspreading and multiple variants, our model also incorporates travel of infected individuals into or out of a localized outbreak (Methods).

We can then apply Bayesian inference to estimate the transmission effects of SNVs that best explain the observed evolutionary history of an outbreak. To simplify our analysis, we use a path integral technique from statistical physics, recently applied in the context of population genetics^{16}, to efficiently quantify the probability of the model parameters given the data (for details, see Supplementary Information). This allows us to derive an analytical estimate for the maximum *a posteriori* selection coefficients ** ŝ** for a given set of viral genomic surveillance data,
Here Δ

**is the change in the SNV frequency vector over time,**

*x**γ*specifies the width of a Gaussian prior probability distribution for the selection coefficients

*s*

_{i}, and

*I*is the identity matrix.

*C*

_{int}is the covariance matrix of SNV frequencies integrated over time, and accounts for competition between variants as well as the speed of growth for different viral lineages (Supplementary Information).

*τ*_{int}accounts for travel and is given explicitly in equation (S2). Data from multiple outbreaks can be combined by summing contributions to the integrated covariance, frequency change, and travel terms from each individual trajectory.

### Validation in simulations

To test our ability to reliably infer selection, we analyzed simulation data using a wide range of parameters. We found that inference is accurate even without abundant data, especially when we combine information from multiple outbreaks (**Fig. 1, Supplementary Fig. 1**). Because we model the evolution of relative frequencies of different variants, accurate inference of selection does not require the knowledge of difficult-to-estimate parameters such as the current number of infected individuals or the effective reproduction number (Methods). Selection can be accurately inferred not only when the population evolves according to the superspreading model, but also if it evolves according to a classical multi-variant SIR model (**Supplementary Figs. 2-3**). This indicates that selection can be accurately estimated for a broad range of epidemiological dynamics.

### Global patterns of selection in SARS-CoV-2

We studied the evolutionary history of SARS-CoV-2 using genomic data from GISAID^{17} as of August 6th, 2021. We separated data by region and estimated selection coefficients jointly over all regions (Methods). After filtering regions with low or infrequent coverage, our analysis included more than 1.6 million SARS-CoV-2 sequences from 87 different regions, containing 13,189 nonsynonymous SNVs observed at nontrivial frequencies.

Our analysis revealed that, while the great majority of SNVs were nearly neutral, a few dramatically increased viral transmission (**Fig. 2a, Table 1**). We observe clusters of SNVs with strong effects on transmission along the SARS-CoV-2 genome (**Fig. 2b**). The highest density of SNVs that increase transmission is in Spike, especially in the S1 subunit. Of the top 20 mutations that we infer to be most strongly selected, 10 are in Spike (**Table 1**). However, clusters of SNVs with a strong selective advantage are also found in other proteins, especially in N, M, ORF3a, and NSP6.

### Mutations inferred to substantially increase transmission

The top 50 mutations inferred to increase SARS-CoV-2 transmission the most are given in **Table 1**. Experimental evidence exists to directly or indirectly support many of these inferences. For clarity, we will reference mutations at the amino acid level rather than the underlying SNVs, which are also given in **Table 1**. Spike mutations L452R and P681R/H comprise three of the top four mutations, and all have demonstrated functional effects that could increase transmission^{4,18–21}. Similarly, Spike mutations such as E484K (*ŝ* = 5.2%, ranked 9th) and S477N (*ŝ* = 4.4%, ranked 15th) appear prominently in our analysis. E484K has been shown to increase resistance to antibodies^{22}, and both E484K and S477N were rapidly selected for increased ACE2 receptor binding during *in vitro* evolution^{23}. Four Spike N-terminal domain (NTD) mutations/deletions (L18F, Δ141-142, and D253G) are also strongly selected. These lie in the antigenic supersite where mutations have been shown to decrease the neutralization potency of NTD-specific monoclonal antibodies^{24}. Spike mutations D614G (*ŝ* = 2.9%, ranked 52nd) and N501Y (*ŝ* = 2.9%, ranked 58th) fall just outside the top 50 mutations in **Table 1**. D614G has been shown to increase binding affinity to the ACE2 receptor, thus increasing viral load and likely contributing to increased transmission^{7,25}. Similarly, there is evidence that N501Y increases ACE2 binding affinity as well as transmission of infection^{26}.

Research on viral transmission has naturally focused on Spike because of its role in viral entry and as a target of neutralizing antibodies. However, our analysis also reveals strongly selected mutations outside of Spike. These include the Nucleocapsid mutations D3L and R203M. D3L (*ŝ*= 6%, ranked 5th) has been reported to increase production of a non-canonical subgenomic RNA that encodes for ORF9b (ref.^{27}), an interferon suppressing gene that can aid innate immune evasion and thereby increase transmission^{28}. R203M (*ŝ* = 4.6%, ranked 13th), present in the linker region of Nucleocapsid, has been shown to enhance viral RNA replication, delivery, and packaging, which may increase transmission^{29}. The Nucleocapsid mutation T205I (*ŝ* = 3.3%, ranked 40th) improves RNA delivery and expression to a lesser degree^{29}. A Nucleocapsid mutation S202N (*ŝ* = 3.3, ranked 39th) is also inferred to be strongly selected. While the effect of this specific mutation is unknown, a mutation to Arginine at the same residue (S202R, which we infer to be more moderately beneficial, *ŝ* = 1.7%) has been reported to increase replication, RNA delivery, and packaging^{29}. Other strongly selected mutations outside of Spike include the Membrane I82T mutation (*ŝ*= 7.5%, ranked 2nd) and NSP6 deletions Δ106-108 (*ŝ* = 5.4%, 4.2%, and 3.1%, ranked 7th, 23rd, and 43rd). Similar examples may provide good targets for future studies of the functional effects of non-Spike mutations.

### Estimates of selection for major SARS-CoV-2 variants

We estimated the net increase in viral transmission relative to the Wuhan-Hu-1 reference sequence for well-known SARS-CoV-2 variants by adding contributions from the individual variant-defining SNVs (**Fig. 3**, see Methods). Because our model uses global data and infers the transmission effects of individual SNVs, variants can be compared to one another directly even if they arose on different genetic backgrounds, or if they appeared in different regions or at different times. This also allows us to infer substantially increased transmission for variants such as Gamma, Beta, Lambda, and Epsilon, which never achieved the level of global dominance exhibited by Alpha and Delta (**Fig. 3**). For reference, we show the selection coefficient inferred for the cluster of mutations including the Spike mutation D614G that fixed early in the pandemic.

Our findings are consistent with past estimates that have shown a substantial transmission advantage for Alpha and Delta relative to other lineages^{30–32}. However, past estimates have varied substantially depending on the data source and method of inference. For example, in different analyses Delta has been inferred to have an advantage of between 34% and 97% relative to other lineages^{30–32}. Similarly, Alpha has been estimated to increase transmission by 29% to 90% relative to pre-existing lineages in different regions^{5,11,12,31,33}. One advantage of our approach is that it can infer selection coefficients that best explain the growth or decline of variants across many regions, allowing selection for different variants to be compared on even footing.

In November 2021, a new variant, Omicron, was detected in South Africa. The data that we consider only extends to August 6th, 2021, and thus Omicron is not present in our data set. However, because this variant bears SNVs observed in other lineages, we can provide a preliminary estimate of its transmission advantage. Even without considering the effects of 17 Omicron SNVs (out of 96 total) that were not previously observed in data, the total selection coefficient for Omicron is *ŵ* = 55.2%, which surpasses Alpha. While more data will be necessary to fully assess the transmission advantage of this variant, our model suggests that Omicron is highly transmissible, supporting its designation as a variant of concern (VOC) by WHO^{34}.

### Effects of travel and competition on inferred coefficients

In addition to differences in transmissibility, variant frequencies are affected by the travel of infected individuals between regions and competition between variants. Multiple introductions of a new variant into a region can increase its local frequency even if the variant has no transmission advantage^{35,36}. This could, for example, make a neutral variant appear beneficial. Conversely, variants that transmit more effectively than ancestral SARS-CoV-2 can still be outcompeted by other, more transmissible variants, causing them to decline. In population genetics, this is referred to as clonal interference^{37,38}.

We studied the history of Nextstrain lineage 20E (EU1) as an example to investigate the influence of travel and inter-lineage competition on inferred changes in transmission. A detailed analysis showed how 20E (EU1) spread from Spain to other regions in Europe^{35}. There it was estimated that 20E (EU1) was introduced into the UK roughly 380 times during the summer of 2020. Assuming no travel between the UK and other regions, we infer a total selection coefficient for novel 20E (EU1) SNVs of 15.6% using UK data gathered from the beginning of the pandemic through May 1st, 2021, when 20E (EU1) had died out locally. Including 380 importations into the UK during the summer of 2020, our inferred selection coefficient is only slightly reduced to 15.4% (Methods). Around 37,000 importations would be necessary during this time for 20E (EU1) to be inferred to be completely neutral. Thus, while travel was crucial for the early spread of 20E (EU1) in the UK and across Europe, its subsequent growth ultimately dominates estimates of its effects on transmission.

Competition between Alpha and 20E (EU1), which was rising in the UK before the emergence of Alpha, provides a clear example of clonal interference in SARS-CoV-2. Interference can be readily observed in their frequency trajectories in the UK (**Fig. 4a**). Changes in frequency for the two variants have a Pearson correlation of −0.73.

To measure the influence of competition on estimates of transmission, we inferred selection coefficients for variants circulating in the UK with all nucleotide changes present in the Alpha variant reverted to the Wuhan-Hu-1 reference sequence. This provides an estimate for the 20E (EU1) selection coefficient that ignores competition with Alpha. Using all of the data, 20E (EU1) is inferred to have a significant transmission advantage relative to the previously dominant variant B.1 (*ŵ*_{20E(EU1)} − *ŵ*_{B.1} = 10.2%). However, when competition with Alpha is ignored, 20E (EU1) is inferred to be mildly deleterious (*ŵ*_{20E(EU1)} − *ŵ*_{B.1} = − 1.7%). Thus, ignoring competition between variants would lead to a dramatic and likely incorrect change in inferred selection.

The re-emergence of the Spike mutation A222V, which appeared within the 20E (EU1) variant, further supports a transmission advantage. Using global data, we infer A222V to significantly increase transmission (*ŝ* = 3.8%, ranked 29th, see **Table 1**). In our analysis, A222V contributes far more than any other mutation to the increased transmission of 20E (EU1) relative to contemporary variants. Focusing on data from the UK alone, the inferred benefit of A222V remains roughly constant even as 20E (EU1) is outcompeted by Alpha (**Fig. 4b**). Later, A222V arises on other sequence backgrounds and steadily increases in frequency (**Fig. 4c**), consistent with the increase in transmission inferred by our model.

Our analysis therefore suggests that 20E (EU1) possessed a modest but real transmission advantage relative to contemporary SARS-CoV-2 sequences. This finding is consistent with analysis in ref.^{35}, where a model of 20E (EU1) spread due to travel alone underestimated the observed frequency of the variant by 1-to 12-fold in all regions. More generally, our analysis suggests that very large inferred selection coefficients for variants or SNVs in our model are unlikely to be explained by travel alone.

### Rapid detection of variants with enhanced transmission

Rapidly identifying variants with increased transmission is important to inform public health efforts to limit viral spread. However, the inherent stochasticity of infection and of genomic surveillance data collection makes accurate inferences difficult. For example, neutral or modestly deleterious variants may initially appear to be beneficial due to a transient rise in frequency despite having no selective advantage.

To quantify how fluctuations affect estimates of selection for neutral variants, we first identified all variants (including both SNVs and collections of SNVs that are strongly linked to one another) that are inferred to have selection coefficients with magnitude less than 1% using all of the data. We then calculated the selection coefficients that would have been inferred for the SNVs or variants at all earlier time points and in all regions after they were first observed in the data. This “null” distribution (**Fig. 5a**) quantifies fluctuations in inferred selection coefficients for nearly-neutral variants due to stochasticity in viral spread and sampling. Variants with selection coefficients larger than any in the null distribution could then be expected with high confidence to have some transmission advantage.

To assess our ability to rapidly detect variants with a transmission advantage, we studied the rise of the Alpha variant in the London area. Using the above criterion, we identify Alpha as very likely to increase transmission using sequence data that was collected on or before November 9th, 2020 (**Fig. 5b**). This is roughly three weeks before Public Health England labeled Alpha as a variant of interest (VOI)^{39}, and more than a month before it was classified as a VOC^{40}. At this time the frequency of Alpha in London was around 1%.

A similar analysis also shows that our model rapidly infers increased transmission for the Delta variant. Using data from the UK, we reliably infer Delta to increase transmission by March 31st, 2021 (**Fig. 5c**). Delta was classified as a VOI on April 4th, 2021 and as a VOC more than one month later on May 6th, 2021^{41}. At the time that we detected increased transmission for Delta, its frequency was still low (< 1%) in the UK. Collectively, these results demonstrate our ability to rapidly identify variants with higher transmission, even when they represent a small fraction of all infections in a region.

## Discussion

Quantifying the effects of mutations on viral transmission is an important but challenging problem. To overcome limitations of current methods, we developed a flexible, SIR-based epidemiological model that provides analytical estimates for the transmission effects of SNVs from genomic surveillance data. Applying our model to SARS-CoV-2 data, we identified SNVs that substantially increase viral transmission, including both experimentally-validated Spike mutations and other, less-studied mutations that may be promising targets for future investigation. We further explored the effects of travel and competition between variants on inferred changes in transmission, using the history of 20E (EU1) as an example. Importantly, we found that our model is sensitive enough to detect substantial transmission advantages for variants such as Alpha and Delta even when they comprised only a small fraction of the total number of infections in a region, thus providing an “early warning” for more transmissible variants.

Further monitoring will be important to identify and characterize new variants as they arise. The Omicron variant that was recently detected in South Africa provides one such example. While the data in our study only extends to August 6th, 2021, we would estimate a selection coefficient of 55.2% for Omicron based on the mutations that it shares with previous variants alone.

While our study has focused on SARS-CoV-2, the epidemiological model that we have developed is very general. The same methodology could be applied to study the transmission of other pathogens such as influenza. Combined with thorough genomic surveillance data, our model provides a powerful method for rapidly identifying more transmissible viral lineages and quantifying the contributions of individual mutations to changes in transmission.

## Data Availability

Sequence data and metadata used in this study are available from GISAID. Processed data and code are available in the linked GitHub repository.

## AUTHOR CONTRIBUTIONS

All authors contributed to methods development, data analysis, interpretation of results, and writing the paper. B.L. and J.P.B. led theoretical analyses. M.S.S. led SIR model simulations. B.L. led branching process simulations. J.P.B. conceptualized and supervised the project.

## Methods

### Epidemiological model

We use a discrete time branching process to model the spread of infection. Individuals can be infected by any one of *M* viral variants, which are represented by genetic sequences ** g** = {

*g*

_{1},

*g*

_{2}, …,

*g*

_{L}} of length

*L*. For simplicity, we will first assume that alleles at each site

*i*in the genetic sequence for variant

*a*are either equal to the “wild-type” or reference or mutants . Later we will relax this assumption to consider genetic sequences with 5 possible states at each site (4 nucleotides or a gap). We call

*n*

_{a}(

*t*) the number of individuals infected by variant

*a*at time

*t*. To account for super-spreading, the number of newly infected individuals at time

*t*+ 1 follows a negative binomial distribution

^{42},

*P*(

*n*

_{a}(

*t*+ 1)|

*n*

_{a}(

*t*),

*k, R*

_{a}) =

*P*

_{NB}(

*r, p*), where

*r*=

*n*

_{a}

*k, p*=

*k/*(

*k*+

*R*

_{a}), and

*R*

_{a}=

*R*(1 +

*w*

_{a}). Here

*r*and

*p*are the negative binomial distribution parameters,

*k*is the dispersion,

*R*is the average effective reproductive number, and

*w*

_{a}encodes the variant dependence of the infectivity. Here

*R*is also an implicit function of time, representing change in the number of susceptible and recovered individuals as well as the effects of public health interventions or changes in behavior that modify viral transmission.

To incorporate travel, *n*_{a} is the sum of the locally infected population and the net flux of infected individuals into the region, *n*_{a} = *n*_{a,local} + *n*_{a,inflow} − *n*_{a,outflow} ≡ *n*_{a,local} + *δn*_{a}. Defining the frequency of variant *a* as *y*_{a} = *n*_{a}*/Σ*_{b} *n*_{b}, the probability that the frequency vector is ** y**(

*t*+ 1) =

*{y*

_{1}(

*t*+ 1),

*y*

_{2}(

*t*+ 1),…

*}*given the initial frequency vector

**(0), is**

*y*### Derivation of the estimator

Because (2) is difficult to work with directly, we introduce a “diffusion approximation” where we assume that the total number of infected individuals is large and the effects of mutations on transmission are small. Similar approximations have been widely used in population genetics^{43–45}. Under these assumptions, the probability distribution for the variant frequencies satisfies a Fokker-Planck equation with terms derived from the first and second moments of the frequency changes *y*_{a}(*t* + 1) − *y*_{a}(*t*) under the negative binomial distributions above.

However, the genotype space is high-dimensional (dimension 2^{L}, with either a mutant or wild-type allele at each site) and undersampled, making inference of selection for genotypes extremely challenging. To simplify the inference problem, we assume that selection is additive, so the total selection coefficient *w*_{a} for a variant *a* is the sum of selection coefficients *s*_{i} for mutant alleles at each site *i*:
We can then derive a Fokker-Planck expression for the dynamics of mutant allele frequencies
At the allele level, the Fokker-Planck equation has a drift vector given by
and a diffusion matrix
where *x*_{ij} is the frequency of infected individuals that have mutant alleles at both site *i* and site *j* at time *t*, and *δx*_{i} is the change in the frequency due to individuals traveling. Here is the total number of individuals infected by all variants. If *l*_{i} = *l*_{i,inflow} − *l*_{i,outflow} is the number of people traveling into a region minus that traveling out of it who are infected with a variant that has a mutant allele at site *i*, then *δx*_{i} = *l*_{i}*/n*.

The Fokker-Planck equation can then be used to derive a path integral, which expresses the probability of an entire evolutionary history or “path” (i.e., frequencies of genetic variants over time, . The path integral is
The path integral quantifies the probability density for paths of mutant allele frequencies in the evolutionary history of the pathogen. We can then use Bayesian inference to find the maximum *a posteriori* estimate for the selection coefficients given the frequencies, the infected population size, the parameters *R* and *k*, and the composition of the population traveling into or out of the regional outbreak. The posterior probability of the selection coefficients is
where is the probability of a path given by (4) and the *P*_{Prior}(** s**) is a Gaussian prior probability for the selection coefficients with zero mean and covariance matrix

*σ*

^{2}

*I*. Here,

*I*is the identity matrix and

*σ*

^{2}is the variance of the prior. The selection coefficients that maximize (5) are where the parameters

*k, R*, and

*n*are implicitly functions of

*t*.

There are two interesting limiting forms of the estimator. First, we define the new matrix whose entries are
In the limit that *k* → ∞, the negative binomial distribution for new infections becomes a Poisson distribution with rate *λ* = *R*. In this special case, the model is equivalent to the Wright-Fisher model from population genetics. The estimator reduces to
where we have dropped the migration term for simplicity.

The opposite limit *k* → 0 corresponds to a distribution for new infections with extremely heavy tails, i.e., one where super-spreading is dominant. In this case the drift in (3), which quantifies expected frequency changes due to selection and travel, is unchanged. However, the diffusion matrix, which encodes linkage as well as the changes in frequency that are due to the stochastic nature of infection transmission, diverges. In this case, diffusion dominates the process entirely.

### Simplifying the estimator and robustness to incomplete knowledge of time-varying parameters

In practice, parameters appearing in (6), such as the infected population size *n*, the dispersion *k*, and the mean reproductive number *R*, are likely time-varying. While such time dependencies are accommodated by our model, they can be challenging to reliably estimate from data. However, we generally do not require full knowledge of these time-dependent parameters to accurately estimate selection.

In fact, due to finite sampling noise, estimates of selection produced by assuming constant (and incorrect) parameters are more accurate than estimates that use the true time-varying parameters (**Supplementary Fig. 4**). The naive estimator in (6) implies that time points or regions with larger *R, n*, or *k* should be weighted more heavily in the estimate. However, frequency information is always inaccurate due to noise from finite sampling, so weighing some time points or regions significantly more than others based upon the parameters alone means that undue weight is given to the uncertain information available from these times and regions.

For this reason, we assume parameters that are spatially and temporally constant in all of the following analysis, except when considering travel, as discussed below. This allows the estimator to be simplified substantially. If we assume constant parameters and scale the regularization *γ* by the prefactor in the numerator in (6), the parameter dependence in the numerator and the denominator is almost the same and largely cancels out. With the same definition of the matrix as above, and additionally defining , the simplified estimator is given by
This form of the estimator has significant advantages over (6). The most important is that, if the travel term is dropped, the difficult-to-estimate parameters *R, k*, and *n* are no longer required. For methods of inferring these parameters as well as discussions about the difficulty of inferring them, see refs.^{46–55}.

### Extension to multiple regions and multiple SNVs at each site

The model can easily account for outbreaks in multiple regions or outbreaks at different times. If the probability of the evolutionary path in each region is independent, which is the case if there is no travel between regions, then the probability of all of the evolutionary paths in all of the regions is simply the product of the probabilities of the paths in each region, given by (4). Bayesian inference can be applied in the same way as before, resulting in the estimator
where *Q* is the number of regions, *t*_{r} is the time in region *r, T*_{r} is the final time in region *r, t*_{r,0} is the initial time in region *r*, *x*_{r} is the frequency in region *r*, and is the scaled integrated covariance matrix in region *r* given by integrating (7) over time. The estimator can further be extended to allow for multiple different nucleotides at each site by simply letting each different nucleotide have its own entry in the frequency vector *x*_{i}. If there are *J* mutations at each site this results in a frequency vector of length *LJ*, and a covariance matrix of size *LJ* × *LJ*. By convention, reference sequence alleles have selection coefficients of zero, so the mutant allele selection coefficients at each site are normalized by subtracting the inferred coefficient for the reference allele.

### Branching process simulations

We implemented the superspreading branching process for the number of infected individuals in Python. We used a negative binomial distribution for the number of secondary infections caused by a group of individuals infected with the same pathogen variant. To test how finite sampling affects model estimates, we sampled *n*_{s} genomes per time point to use for analysis. We computed the single and double mutant frequencies, *x*_{i} and *x*_{ij}, respectively, from the sampled sequences and estimated the selection coefficients from these using (1), possibly extended to account for multiple outbreaks or multiple alleles at each locus as described above.

### Susceptible-infected-recovered simulations

We simulated a multi-variant Susceptible-Infected-Recovered (SIR) model with *M* variants of a pathogen circulating in a population of *N* individuals, assuming total cross-immunity between variants. Mathematically, the SIR dynamics are expressed as
for *a* = 1, …, *M*, where *I*_{a} is the number of individuals infected by variant *a, β*_{a} and *r*_{a} are the transmission and recovery rate associated with the *a*th variant, and *S* and *R* are the total number of susceptible and recovered individuals, respectively. Each variant is represented by a binary sequence of length *L*, with 0 representing the wild-type (WT) allele and 1 representing the mutant allele. Considering the all-zero sequence as the reference with transmission rate *β*_{ref}, the transmission rate of variant *a* can be expressed as , with and *s*_{i} defined as above.

We further incorporate the effect of public health interventions and changing human behaviour on transmission by making the transmission rate a function of time, i.e., *β*_{ref}(*t*). As the number of susceptible individuals decreases, the effective transmission rate will decrease. The effective reproductive number of the *a*th variant at time *t* is
We used MATLAB to simulate the SIR model under a scenario where the number of newly-infected individuals continues to increase and then remains fixed (**Supplementary Fig. 2**), and a scenario where we fix *r*_{a} = 1 and adapt the transmission rate over time such that the system follows the typical SIR dynamics (**Supplementary Fig. 3**). In the SIR model there is no superspreading, which corresponds to the limit *k* → ∞ in the branching process model described above. The estimator for the selection coefficients then reduces to a scaled version of (8),
where *n* = *n*(*t*) is the number of newly-infected individuals at time *t* (and is different therefore from *I*), ** x** is the frequency vector of newly infected individuals, and

*γ*is the regularization. In both cases selection coefficients are accurately recovered.

### Regions and time-series for SARS-CoV-2 analysis

We used sequence alignments and metadata downloaded from GISAID (ref.^{56}) on August 14th, 2021, which includes more than 3 million sequences. Ideally, we would like to divide this data into the smallest separate areas that have outbreaks that are largely independent of those in the surrounding regions, so as to avoid biases due to travel between regions or unequal sampling in different locations. However, this needs to be balanced with the limitations of the data, since regions with poor sampling could contribute more noise than signal. We therefore divided data into the smallest regions available in the metadata that are still large enough such that infections resulting from travel outside of the region are likely to be far less frequent than transmission within the region. This results in the inclusion of mostly separate countries in Europe and Asia and states in North America. Two exceptions to this are that we separate northern and southern California due to the geographical separation of population centers, and we separate Northern Ireland from the rest of the United Kingdom due to its geographical isolation.

To minimize the effects of sampling noise, we chose regions and time-series within these regions based on the following criteria:

In any period of 5 days within the time-series there are at least 20 total samples.

The number of days in the time-series is greater than 20.

The number of new infections per day is at least around 100.

The last criterion ensures that there are enough infected individuals that transmission is not driven overwhelmingly by stochasticity.

Our results are robust to reasonable variation in these parameters. Comparing the number of locations used and the sample sizes shown in **Supplementary Fig. 5** in the data to those used in the simulations shown in **Supplementary Fig. 1**, we expect our inference to accurately distinguish beneficial, deleterious, and neutral SNVs from one another.

### Data processing

We perform a number of preprocessing steps to ensure data quality. We first eliminated incomplete sequences with gaps at more than one third of the genome. We then removed sites from our analysis where gaps are observed at > 95% frequency, since these sites may represent very rare insertions or sequencing errors. We also removed sites in noncoding regions of the SARS-CoV-2 genome and ones where all observed SNVs are synonymous. We imputed ambiguous nucleotides with the nucleotide at the same site that occurs most frequently in other sequences from the same region.

For the remaining sites, we excluded rare SNVs whose frequency is never larger than 1% in any region and ones that are not observed at least 5 times. These sites, if included, are almost always inferred to have extremely small selection coefficients. Furthermore, since their frequencies are so small, their covariance with other sites is also small and is therefore unlikely to have a large effect on inference. We verified that different reasonable values for these cutoffs result in essentially identical selection coefficients (**Supplementary Fig. 6**).

### Calculating frequency changes and covariances

To increase robustness to finite sampling, we integrated terms in (6) over time, assuming that frequencies and covariances are piecewise linear, rather than summing contributions from each time point^{57}. To obtain better estimates of changes in SNV frequencies (the term *x*(*T*) − *x*(0) in (9)), we averaged ** x**(

*T*) as the frequencies in the window of the final 10 days and

**(0) as the frequencies in the window of the first 10 days for each time-series and region. This smoothing is necessary especially in regions where sampling is sparse, where the number of genomes sampled on a particular day may be as small as 1 or 2. We confirmed that our results are robust to reasonable changes of this window size of 10 days (**

*x***Supplementary Fig. 6**).

We also normalized time in units of serial intervals or “generations” by dividing the integrated covariance matrix by 5, following results that the serial interval for SARS-CoV-2 is roughly 5 days^{58–60}. This allows us to convert from units of time in days to generations, as in (9).

### Calculating selection coefficients

After the above preprocessing (before eliminating synonymous SNVs) there remain 21, 050 SNVs observed at a frequency above 1% in at least one region and observed at least 5 times. We assume constant values for *R, n*, and *k* in all regions, and use (10) to estimate selection. When *R, n*, and *k* are constant, these terms can be effectively absorbed into the regularization *γ*.

We normalize selection coefficients such that the nucleotide for the Wuhan-Hu-1 reference sequence at each site has a selection coefficient of 0. To do this, we subtract the selection coefficient for the reference nucleotide from the inferred coefficient for each other allele at that site after all selection coefficients have been computed.

We used these estimates for the selection coefficients for nonsynonymous SNVs to estimate the corresponding selection coefficients for amino acid substitutions (**Table 1**). If there were multiple SNVs in a codon that result in the same amino acid variant, but are not strongly linked to one another, then the selection coefficient for the amino acid was calculated as the largest (in absolute value) of the SNVs. If there were multiple SNVs in the same codon that yield the same amino acid and these SNVs are strongly linked to one another, then the selection coefficient for the mutant amino acid was calculated as the sum of the selection coefficients for the SNVs.

We calculated selection coefficients for major variants by summing the individual nucleotide SNVs that define the variant, which follows from our assumption of additive fitness. SNVs for major variants were obtained by first finding groups of strongly linked SNVs that correspond to a variant, and then adding any other mutations given on https://covariants.org that were not identified by our linkage analysis.

We also computed selection coefficients for collections of strongly linked SNVs that may not be officially-designated variants. To determine sets of strongly linked SNVs, we considered the following statistics. If the number of genomes with a SNV at site *i* is called *h*_{i} and the number of genomes with SNVs at both site *i* and site *j* is *h*_{ij}, then we say that two sites *i* and *j* are strongly linked if *h*_{ij}*/h*_{i} and *h*_{ij}*/h*_{j} are both greater than 80%. As for the major variants, we computed selection coefficients for sets of strongly linked SNVs by summing the contributions from individual SNVs. Selection coefficients for strongly linked SNVs were used to compute the “null” distribution that we use as a metric for early detection of variants with increased transmission.

### Choice of regularization

In principle, the regularization strength *γ* is related to the width of the prior distribution for SNV selection coefficients. The regularization strength also plays a role in reducing noise in selection coefficient estimates due to finite sampling of viral sequences. This is especially important for SNVs that are observed only briefly in data, as they will have small integrated variances in the “denominator” of (6). Larger values of the regularization more strongly suppress noise, but they also shrink inferred selection coefficients towards zero.

We use a regularization strength of *γ* = 40 after absorbing factors of *n, k*, and *R* into *γ* (see (S4) in Supplementary Information). For much smaller values of *γ*, selection coefficient estimates are unstable due to sampling noise. However, inferred selection coefficients stabilize and become insensitive to the precise value of *γ* for *γ* ≳ 10 (**Supplementary Fig. 6**). Larger values of *γ* will result in selection coefficients with smaller absolute values, but for large enough *γ* the rank ordering of inferred selection coefficients is highly reliable. In summary, the coefficients that appear to be the most beneficial or deleterious remain this way regardless of reasonable choices for *γ*, though their precise values scales with the regularization strength.

### Rapid detection of variants with increased transmission

To estimate how quickly we can detect a transmission advantage for a new SNV or variant, selection coefficients are calculated only in the specific region where the variant arose. Since inference is only done in a single region, SNVs that appear only briefly at low frequencies —and which therefore are unlikely to change transmission rate —only appear once, whereas in the global analysis such SNVs may appear at low frequencies in multiple regions. For this reason we use a lower regularization of 10 for regional analysis. The null distribution is calculated by first finding all variants (including one or more SNVs) that are inferred to have a selection coefficient of absolute value less than 1% using the joint inference over all regions. We then calculated the selection coefficients that would have been inferred for these variants at all earlier time points in each region after they were first observed in that region. We can then say with high confidence that a variant increases transmission once the inferred coefficient for that variant in a specific region surpasses any of the inferred coefficients in the null distribution.

### Effects of travel on inferred selection

Travel of infected individuals can bias inferred selection coefficients by changing the frequency of variants in a region for reasons that are not due to increased or decreased transmission. To analyze the effect that travel has on the inferred selection coefficients, we focused on the United Kingdom, and especially on the variant 20E (EU1), because sampling the UK is excellent and because there exists a high-quality estimate for the number of importations of this variant^{61}. In order to quantify the effect of travel, it is important to have an estimate for the number of newly infected individuals on each day since the effect due to travel depends on the number of cases that are imported relative to the number of local cases (see 6). We used statistics from the Institute for Health Metrics and Evaluation^{62} to estimate the number of newly infected individuals in the UK. For simplicity, we assumed a constant number of importations of 20E (EU1) per day starting on July 7th, 2020 (the first day a 20E (EU1) sequence was sampled in the UK) and continuing for 100 days. We then inferred the selection coefficients for many different numbers of importations. The results are shown in **Supplementary Fig. 7**, where we find that a very large number of importations is necessary for 20E (EU1) to be inferred to be neutral (*ŵ* = 0).

In our full data set, selection coefficients that are inferred to be close to zero may in fact be slightly beneficial or deleterious and are inferred incorrectly due to travel. However, given the degree of travel needed to substantially bias inferred selection demonstrated in **Supplementary Fig. 7**, travel is unlikely fully explain large inferred selection coefficients. This is especially true for variants observed in regions where travel restrictions reduced the number of infected individuals entering or leaving a region. In addition, the effects of travel are also muted when the number of local infections is large.

### Data and code

Sets of processed data, computer code, and scripts that we have used in our analysis are available in the GitHub repository located at https://github.com/bartonlab/paper-SARS-CoV-2-transmission. This repository also contains Jupyter notebooks that can be run to reproduce the results presented here, using sequence data and metadata from GISAID.

## Supplementary Information

### 1. Summary

Here we discuss three main topics. First, we give a detailed introduction of our epidemiological model as well as a derivation of the estimator (1). We then describe simulations of an outbreak and show that selection coefficients can be accurately recovered from simulation data even with relatively poor sampling. Finally, we discuss our analysis of SARS-CoV-2 evolution during the outbreak. We recover known results, and we show that inference is insensitive to a large variety of parameter choices.

### 2. Epidemiological model

#### 2.1. Introduction

In epidemiology, the spread of infection can be modeled as a branching process where each infected individual (also referred to as a case) infects *n* additional individuals^{1}. The distribution of *n* is often taken to be Poisson, but differences in the number of contacts with susceptible individuals, disease course within an individual, and other factors mean that the Poisson rate *λ* is not generally the same for all cases^{2}. Below, we first follow ref.^{2} to explore families of distributions for the number of new cases per infected individual. Next, we extend these models to consider multiple variants of the pathogen that differ in their spreading efficiency. We seek to characterize how the distribution of pathogen variant frequencies is expected to change over time, and how such data can be used to estimate the relative spreading efficiency of different variants.

#### 2.2. Distributions for the number of infected individuals

As noted above, the basic distribution of the number of new cases *n* caused by one case in a susceptible population is Poisson,
Typically we might take the Poisson rate *λ* to be *R*, the effective reproduction number, which is the expected number of cases directly caused by one case. In that case, the average number of cases following the Poisson distribution is
To account for variability in transmission dynamics, the basic Poisson distribution with a single rate *R* can be replaced with a continuous mixture of Poisson distributions, where the rate parameter *λ* follows a gamma distribution,
with shape parameter *α* and rate parameter *β*. The average value of *λ* is
and its variance is
In this context, it is natural to take *α* = *k* and *β* = *k/R*. With these choices, the gamma distribution reads
The parameter *k* is a dispersion parameter that determines how long-tailed the distribution is. The mean value of *λ* is always *R*, but when *k* is smaller its variance increases. In the limit that *k* → ∞, we recover the pure Poisson distribution with rate *λ* = *R*. When *k* = 1, the distribution of the number of cases *n* is geometric,
where *p* = 1*/*(1 + *R*). For arbitrary values of *k* > 0, the number of cases follows a negative binomial distribution,
The standard parameters of the negative binomial distribution are *r* and *p*, which are set to *k* and *k/*(*k* + *R*) in our parameterization above.

#### 2.3. Dynamics for the pathogen variant frequencies

Let us assume that there exist multiple variants of a pathogen, which are distinguished by an index *a*. The number of cases infected with variant *a* is *n*_{a}. We assume that different variants have slightly different transmission probabilities, so that *R*_{a} = *R*(1 + *w*_{a}), with |*w*_{a}| ≪ 1. The term *w*_{a} is analogous to a selection coefficient in population genetics.

##### 2.3.1. Dynamics of multiple cases infected by a single variant

First, let us assume that *n* individuals, each labeled by an index *i*, are all infected by the same variant of a pathogen. For now we will assume that there is no travel into or out of the population, though we will include it later. How many cases will be generated from these individuals? The number of new cases for all individuals is
where the numbers of cases generated by individual *i* follows a negative binomial distribution. Because all individuals are infected by the same variant, the negative binomial parameter *p* = *k/*(*k* + *R*) is the same for each of them. Then, assuming that all of the infection events are independent, it can be shown that the probability distribution for the total number of new cases *n*′ also follows a negative binomial distribution with the same value of *p*, and with *r* = *nk* (that is, the new *r* parameter value is the sum of the individual *r* parameter values). Thus, the distribution of *n*′ is

##### 2.3.2. Dynamics for multiple cases infected by multiple variants

Let us extend the previous example to consider *m* variants of a pathogen. At the starting point, the number of individuals infected by a given variant *a* is *n*_{a}, with *a* ∈ *{*1,…, *m}*. The fraction of cases infected by variant *a* is
Now, we would like to know how the fraction of individuals infected by each variant is expected to change with each round of infections. In other words, for variant *a*, we would like to compute
where the outer sum is over all vectors ** n′** with entries , and with for all

*b*. Here, we have assumed that the are independent across

*b*.

To proceed, it is convenient to write the negative binomial distributions as mixtures of Poisson distributions (as indicated above), giving
Next, we use the fact that the sum of independent Poisson-distributed random variables is also Poisson with rate parameter equal to the sum of the individual rates, and that the distribution of independent Poisson random variables conditioned on their sum is multinomial, to write
Here ** λ** is a vector with entries

*λ*

_{1},

*λ*

_{2}, …, and we have also introduced Σ

_{a}

*λ*

_{a}=

*λ*. Note also that the outer sum on the first line is over all vectors

**whose (non-negative) entries sum to**

*n*′*n*

**′**.

Computing the remaining integrals exactly is challenging, largely because the Gamma distributions have different rate parameters. To address this, next we will expand our expression to first order in the *w*_{a}, since these are assumed to be small parameters. Referring back to Eq. (S1), the expansion gives
Next we change variables to {*λ, q*_{1} = *λ*_{1}*/λ, q*_{2} = *λ*_{2}*/λ*,…, *q*_{m−1} = *λ*_{m−1}*/λ*}, because the distribution of the sum of gamma-distributed random variables, *λ*, with the same rate parameter and the ratios of the individual variables to the total (*λ*_{a}*/λ*) follow independent gamma and Dirichlet distributions^{3}. The *m*th ratio by conservation. By convention we will also set *w*_{m} = 0, which can be thought of as normalizing the value of *R* relative to a reference genotype. The transformation then gives
In the expressions above *P*_{D}(** q**|

**) is the Dirichlet distribution, with concentration parameters**

*α***given by**

*α*

*n**k*in our case. Note that if

*w*

_{m}≠ 0, the last line should instead read Thus, we obtain (with

*w*

_{m}= 0) Following a similar approach, we can compute the second moments. First, we consider In going from the third to the fourth line above, we have made the approximation that which is valid for

*λ*≳ 1. Similarly, Simplifying the expressions above is tedious but straightforward. The following results are helpful: Here we have frequently used

*n*

_{a}=

*ny*

_{a}to simplify expressions.

With the above results, simplifying expressions for the second moments, we finally find
and
where we have assumed that the *w*_{a} are 𝒪 (1*/n*), as in the Wright-Fisher model with weak selection. We have thus found that the first and second moments of frequency changes in our multi-variant epidemiological model have the same frequency dependence as those in the multispecies Wright-Fisher model, but with different scaling. The first moment (‘drift’) is multiplied by a factor of *nk/*(*nk* + 1), and the second moment (‘diffusion’) by
These prefactors match with the Wright-Fisher model exactly when *k* → ∞ (i.e., a pure Poisson distribution for the number of new cases per infected individual) and *R* = 1.

#### 2.4. Correction to the first moment due to travel

Travel of infected individuals can change the total number of infected individuals in a region and will thus lead to a correction in the first and second moments that have been calculated above. Call *n*_{a,in} − *n*_{a,out} = *δn*_{a}. There will be a first order correction to the first moment due to fact that the number of individuals of variant *a* in the next generation is now , where now represents the number of individuals of variant *a* in the next generation that don’t come from travel. In addition, the total population in the next generation will be *n*′ + Σ_{b} *δn*_{b}, where . Therefore, in calculating the first moment we will now have
If we allow that the total inflow and outflow of a specific variant, *δn*_{a}, and of all variants together, Σ_{d} *δn*_{d}, are both much smaller than the number of infected individuals, then the term on the far right can be expanded:
The first term simply reproduces the first order moment without travel. The last two terms lead to the correction
where *δy*_{a} = *δn*_{a}*/n*. Finally, for the second term, we have
Ultimately, this produces the additional correction
Remembering that both *δy*_{d} and *w*_{d} are small, the overall first order correction will be

#### 2.5. Correction to the second moment due to travel

There will also be a correction to the second moment due to travel, and for the diagonal terms, we will have,
Again assuming that the number of infected individuals traveling is small compared to the total number of infected individuals in a region, we can expand the term on the far right:
Similarly, we have
and the term on the far right can be expanded as
These can be calculated in much the same way as the correction to the first moment, and all of the resulting terms are 𝒪 (1*/n*^{2}).

#### 2.6. Derivation of the selection coefficient estimator

The derivation in this section closely follows that given in ref.^{4}. It is well known that a WF process can be approximated by a continuous-time continuous-frequency diffusion process in the large *n* limit. In the continuous-time limit the time variable *t* has units of *n* generations, with one generation in discrete time taking *τ* = 1*/N* continuous time units. The change in the frequency of variant *a* in one generation due to travel is *δy*_{a}, so the change of frequency in *τ* generations is *δy*_{a} * *τ*. Furthermore, the selection coefficients *w*_{a} are assumed to scale with *n* such that , where is a parameter independent of the population size *n*. In the limit of large population size, our generalized super-spreading model can, like the WF process, be approximated by a diffusion process, where the transition probability density *ϕ* is the solution to the Fokker-Planck equation
where *M* is the number of distinct genotypes, ** y** is the genotype frequency vector,

**is the drift vector, and**

*d**C*is the diffusion matrix. Ignoring recombination and mutation, since these are comparatively small and therefore unlikely to significantly affect estimates of changes in viral transmission (though these can be included and the solution remains tractable), the drift and diffusion have entries given by, The Fokker-Planck equation can be converted into a path integral approximation for the transition probability density We write the re-scaled drift vector as . Since we aim to infer selection coefficients for the SNVs, it is more convenient to work with the allele frequencies

*x*

_{i}instead of the genotype frequencies

*y*

_{a}. The allele frequency at site

*i*is given by where is a 1 if there there is a mutant allele at site

*i*on genome

*a*and zero if there is not. Similarly, if the selection coefficient for the genotype

*a*is

*w*

_{a}and the allele level selection coefficient for allele

*i*is

*s*

_{i}, then they are related by: where

*L*is the length of the genome.

The allele level drift and diffusion terms will be linear combinations of the genotype level drift and diffusion, just as with the frequencies and the selection coefficients. The drift vector for the allele frequencies can be transformed by
The sum in the last term can be interpreted as the total number of infected individuals added or subtracted to the population due to travel, divided by the population size. This can be used, along with the transition probability density for genomes, in order to find an approximation for the mutant allele transition probability density:
where here the diffusion *C* is derived similarly to the drift ** d** and has entries
A path integral then gives the probability of observing a trajectory of allele frequencies (

**(**

*x**t*

_{1}),

**(**

*x**t*

_{2}), …,

**(**

*x**t*

_{K})), and is given by Bayesian analysis can then be used to show that the posterior probability of the selection coefficients

**= (**

*s**s*

_{1},

*s*

_{2}, …,

*s*

_{L}) given an observed frequency path

**(0),**

*x***(1), …,**

*x***(**

*x**T*) is where we use a Gaussian prior distribution with zero mean and adjustable covariance determined by the parameter

*γ*.

For the inferred coefficients, we take those that maximize the posterior probability. They can be analytically found by a simple application of the Euler-Lagrange equations to equation S3 and are given by The second term in the numerator is the correction due to travel of infected individuals into and out of the region, and is given by

#### 2.7. Extension to multiple regions

In the SARS-CoV-2 pandemic, and in real disease outbreaks in general, there are frequently multiple different outbreaks in different regions that develop largely or entirely independently of one another. In order to find the best estimate for the selection coefficients using the data from multiple regions, the estimator can be generalized to find the maximum a posteriori estimate for the selection coefficients given the time series of allele frequencies in each of the regions. If the probability for a specific path in a specific region *r* is given by , where *x*_{r} is the allele frequency vector in region *r*, then the joint probability of the specific paths in all of the regions is simply the product of the individual region probabilities:
where *M* is the number of different regions. Since this is a product of exponential functions, the log posterior will be the sum of the exponents and the regularization. This can be maximized with respect to the selection coefficient vector ** s** as before, which, dropping the travel term, leads to the estimator:

#### 2.8. Simplification of the estimator

In real outbreaks the parameters *k, R*, and *n* are in general time-varying. In our simulations as well, *R* and *n* are time-varying (and *k* can be constant or time-varying). In order to accurately infer the selection coefficients according to Eq. (S4) or Eq. (S5), it would seem that we need to accurately infer the values of *k, R*, and *N* at every point in the time series. In practice, this would be extremely difficult. For general discussion about the effective reproduction number *R* and the basic reproduction number *R*_{t} as well as some attempts to infer this, see refs.^{5–9}. In order to get an accurate estimate for *k* it is necessary to have pervasive contact tracing, so that the negative binomial distribution is well sampled, and there are other difficulties in inferring *k* as well^{10–12}. Lastly, it can be difficult to estimate the number of new infections due to multiple factors, including the difference between the population that gets tested and the population that does not, test result inaccuracies, and delays between symptom onset, testing, and reporting^{13,14}.

We propose an alternative that lets us avoid these complications. The prefactor *nkR/*(*R* + *k*), multiplies both the numerator and the denominator. Therefore, the only effect of the prefactor is to weight time points more heavily if the population size, the dispersion parameter, or the basic reproduction number, is larger. This makes sense in theory, because a larger *n* or *k* implies that there is less noise and the trajectories are more deterministic, while a larger *R* means that there are more new infections per generation and thus more data to use to infer the selection coefficients. This does hold with perfect information, that is, if all infected individuals are sampled at every time point. However, in practice, finite sampling is the source of significantly more noise than that due to a time-varying population size or dispersion, so weighting the time points based upon *n, k*, or *R* in fact leads to worse inference than assuming the parameters are constant in time and thus weighting the time points equally. However, in the special and unrealistic case of perfect sampling, using the actual parameters does lead to better inference than using constant parameters (see **Supplementary Fig. 4**). If the time points are weighted equally, then, provided that the regularization *γ* is scaled appropriately (and in general it must be determined by separate means, discussed below), the prefactors in the numerator and denominator cancel, and the estimator is independent of *n, k*, and *R*. Defining by
so that
Eqs. (S4) and (S5) for the selection coefficients become, respectively
which are the same as the MPL estimators for the Wright-Fisher model except that we have ignored the mutation term because the mutation rate for SARS-CoV-2 is small^{4}.

### 3. Simulations

We tested the inference using simulations of disease spread. We used two different kinds of simulations. The first is the super-spreader simulation based on the model described above, which is an analog of the Wright-Fisher model where the sampling distribution for the number of new infections per infected individual is drawn from a negative binomial distribution instead of a pure Poisson distribution The second is a standard deterministic susceptible-infected-recovered (SIR) model that has been adapted in order to include multiple variants of the virus with different transmission rates, described in Methods.

#### 3.1. Description of simulations

We simulated disease spread as a branching process in which the number of individuals infected per currently infected individual is drawn from a negative binomial distribution whose shape is determined by the basic reproduction number *R*_{0} (or the reproduction number, *R*, in a population that is not totally susceptible) and the dispersion parameter *k*. Because we sample in this way, the population size is not constant. However, if the population size is too small, then the population is extremely likely to die off stochastically, and if the population size is too large, then sampling from the negative binomial becomes too computationally expensive. In order to avoid both of these problems, once the population size is large enough *R* is adaptively adjusted so that the average reproduction number for the entire population will remain near 1, and the population size will oscillate around a fixed value. An explicit time-varying population size can also be used as input, and *R* will be adaptively adjusted to remain near the given curve. Constant values can be used for the dispersion *k* or *k* can vary as a function of time, perhaps representing different degrees of social distancing or lockdown measures at different times. Since different interventions implemented to prevent the spread of disease would likely affect the shape of the distribution of the number of individuals infected by a single infected individual, time-varying values for *k* and *R* can be used to reflect these effects. We also implement travel of specified variants into or out of the population over time.

#### 3.2. Inference

The simulations are run for a number of generations and genomes are sampled from the population of infected individuals at different times using a multinomial sampling distribution. This sampled time series is then used to infer the selection coefficients using Eq. (S4). Alternatively, multiple simulations can be run and the joint inference of the selection coefficients can be made using Eq. (S5). We find that, given good enough sampling, a long enough time series, and sampling that occurs at a sufficient number of times, the selection coefficients can be inferred very accurately (**Fig. 1**). The quality of inference is significantly improved if multiple simulations are combined and if mutated sites show up in more than one of the simulations, even under less than ideal sampling conditions. Beneficial coefficients are typically inferred more accurately than deleterious ones, likely because deleterious SNVs frequently die off and therefore there is less data to use for inference.

The inference is robust to shortening the time-series or lowering the number of samples taken per generation, though obviously if either of these conditions is too extreme (or worse, both), the inference starts to break down. The negative effects of a short time-series or poor sampling can be somewhat made up for by using multiple simulations, which is analogous to using data from outbreaks in multiple regions. In addition, the diffusion approximation is only valid in the large *n* limit. However, we tested the inference for small population sizes and found that inference is accurate even if the population of newly infected individuals per serial interval is as low as a few hundred (**Fig. 1**).

It is reasonable to expect that in a real outbreak there will be some travel of infected individuals into and out of the population. This can affect the estimation of the selection coefficients if the travel term, Eq. (S2), is ignored. As a simplified example, imagine there is a steady influx of a variant that has a SNV that is entirely neutral, and little to no outflow of this variant. In this case the selection coefficient for the SNV that migrates into the population will likely be overestimated because the frequency of the SNV will in general be increasing, even though this increase in frequency is not due to a selective advantage. Similarly, if there is an excess of outflow of a certain neutral SNV, then the fitness for this SNV will in general be underestimated. Testing this with simulations, we found that modest influx of a neutral SNV over a long time (25 importations per serial interval compared to *n* = 10^{4} local transmissions, continuing for 100 serial intervals) produces a small but detectable bias in the inferred selection coefficient, which can be corrected by including the travel term and using the true flux of variants for *δn*_{a} (**Supplementary Fig. 8**). More generally, corrections due to travel should become significant when the term *τ*_{int} becomes large compared to observed changes in SNV frequencies.

## ACKNOWLEDGEMENTS

We gratefully acknowledge the numerous laboratories worldwide that have provided sequence data and metadata to GISAID. A full list of originating and submitting laboratories for the sequences used in our analysis can be found at https://www.gisaid.org using the EPI-SET-ID: EPI_SET_20211201ze. The work of B.L., E.F., and J.P.B. reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM138233. The work of S.F.A., M.S.S., A.A.Q., and M.R.M. was supported by the General Research Fund of the Hong Kong Research Grants Council under Grant Number 16213121.