## Abstract

The emergence of multiple new SARS-CoV-2 variants, characterized to varying degrees by increased infectivity, higher virulence and evasion of vaccine and infection-induced immunologic memory, has hampered international efforts to contain the virus. While it is generally believed that these variants first develop in single individuals with poor immunologic control of the virus, the factors governing variant predominance in the population remain poorly characterized. Here we present a mathematical framework for variant emergence accounting for the highly variable number of people secondarily infected by individuals with SARS-CoV-2 infection. Our simulations suggest that threatening new variants probably develop within infected people fairly commonly, but that most die out and do not achieve permanence in the population. Variants that predominate are more likely to be associated with higher infectiousness, but also the occurrence of a super-spreader event soon after introduction into the population.

## Introduction

The emergence and subsequent regional predominance of multiple, highly infectious novel SARS-CoV-2 variants in locations across the globe surprised the scientific and public health communities. Several variants are characterized by more than a dozen new mutations often in the genomic region encoding the viral spike protein (1-3).This rate of mutational change surpassed expectations from previous phylogenetic surveys, which identified population sweeps with single new point mutations only after multiple generations of infection (4). Immuno-compromised hosts are a likely source of these variants. Individuals who have impaired cell-mediated or humoral immune function can shed the virus at high viral loads for many weeks, presumably in the relative absence of immune selection pressure (5-8). Yet, it remains unknown why certain variants ultimately predominate.

The spread of new variants has had a dramatic impact on global SARS-CoV-2 epidemiology. The B.1.1.7 variant has higher infectivity and virulence than baseline variants (1, 9), while the B.1.3.5.1 and P.1 variants may have the ability to evade vaccine- and infection-induced immunologic memory (10-12) and may also have higher infectivity and virulence (3, 13, 14). The true abundance of epidemiologically important variants is likely underestimated by sequencing limitations in many global infection hot spots. The B.1.1.7, B.1.3.5.1 and P.1 variants may also undergo further important evolutionary changes over time. Here we perform mathematical modeling to characterize determinants of variant emergence and predominance.

## Results

### Frequent stochastic elimination of new SARS-CoV-2 variants

We previously developed a mathematical model which captures the highly variable secondary transmission pattern of SARS-CoV-2 by fitting to empirically observed distributions of serial intervals within transmission pairs, as well as distributions of number of secondary transmissions from infected people (15-17). The model explains why the majority of infected people do not infect others while a minority are the index case for large super-spreader events (18-20): the exposure contact network of infected people is highly over-dispersed meaning that on rare occasions, an infected person may have dozens of exposure contacts who have the potential to become secondarily infected. Presumably this phenomenon, which is not observed for influenza (21), is due to aerosolization of SARS-CoV-2 within crowded indoor environments in which masking is limited (22). For a super-spreader event to occur, the model projects that the transmitter must also be within the limited time window when they are shedding at a sufficiently high viral load (15). More model details are in the **Methods**.

We adapted this model to assume that new variants emerge from a single infected person. We then ran individual stochastic simulation to assess whether variants are present after 100 days or whether they are extinguished. Simulations were governed by an average effective reproductive number (R_{e}) defined as the mean number of secondary infections created per person across the entire population, as well as a value for the “super-spreader parameter” ρ which captures the variability of a gamma-distributed exposure contact network: the value is low (ρ=0.01) for infections such as influenza in which there is low day-to-day and person-to-person variance in the number of exposure contacts, and high (ρ=40) for SARS-CoV-2 (15).

For each parameter set, we ran 1000 simulations. We identified that stochastic burnout of infections is more likely when R_{e} is lower but also when the contact network is highly over-dispersed as with SARS-CoV-2 (**Figure 1a**). This suggests that most highly infectious SARS-CoV-2 variants will die out when generated within a single person, even when R_{e} is quite high.

We performed an equivalent analysis with 10 starting infections as might occur if an outbreak of a new variant first spreads in a small household or work cluster. The rate of burnout was still relatively high for low R_{e} and high over-dispersion scenarios but decreased substantially with higher R_{e} values for a given variant (**Figure 1b**). We next performed an analysis with 100 starting infections as might occur with a larger initial super-spreader event. The rate of burnout was low for all R_{e} values and assumptions regarding contact network dispersion (**Figure 1c**). Therefore, although stochastic extinguishment of novel SARS-CoV-2 variants is likely to be common, once roughly 100 cases are established, a variant is likely to continue to expand exponentially in the absence of intensification of non-pharmaceutical interventions (NPIs).

### Timing of SARS-CoV-2 variant emergence is highly variable at realistic effective reproductive numbers

We next evaluated time from first case of a new variant to 1000 cumulative infections in simulations in which stochastic burnout did not occur. We performed 1000 simulations under each assumed value of R_{e}. When starting with one infection, we observed a wide variance in time to 1000 infections with an increase in the median time from 23 to 40 days as R_{e} decreased from 2.2 to 1.2 (**Figure 2a**). The variance and median time (19 to 38 days) to 1000 infections were similar when starting from 10 infections (**Figure 2b**) but the median time (7 to 17 days) and variance decreased when starting from 100 infections (**Figure 2c**) again demonstrating that stochastic forces are less important once 100 cumulative infections are reached.

### Early super-spreader events as a predictor of rapid progression to high prevalence of emerging variants

We next examined the timing and number of super-spreader events during these simulations. Super-spreader events were defined as events in which one person infected a minimum of 5 (**Figure 3a**), 10 (**Figure 3b**) or 20 (**Figure 3c**) other people in a day. With each definition and assumed value of R_{e}, the timing of the first super-spreader event correlated with time to 1000 cumulative infections. The strength of this correlation increased with the least inclusive definition of a super-spreader event and with higher values for R_{e} (**Figure 3c**).

The number of super-spreader events prior to 1000 cumulative infections correlated positively or not at all with time to 1000 cumulative infections for the more inclusive definitions of super-spreader events (at least 5 or 10 secondary infections, **Figure 3a & b**) across all R_{e} values, signifying that multiple small super-spreader events do not accelerate early epidemic spread. However, the number of super-spreader events correlated negatively with time to 1000 cumulative infections with low to moderate values for R_{e} (1.2-1.8) and less inclusive definitions of super-spreader events (at least 20 secondary infections, **Figure 3c**), meaning that early large super-spreader events are a likely driver of variant predominance under parameter assumptions which are compatible with known features of variant B.1.1.7 (1).

### Increased likelihood of new variant predominance when co-circulating baseline variants have an effective reproductive number less than or equal to one

We next performed analyses assuming a baseline variant at different values for R_{e} (0.8, 1.0 or 1.2) with 1000 baseline infections and assumed that 1% of transmission events result in a new variant. R_{e} for new variants were drawn from a uniform distribution varying from 1.0-2.2 at intervals of 0.2. For each scenario, we performed 100 simulations until 100,000 cumulative infections were generated or until stochastic burnout of all variants occurred. Variants were tracked individually.

Super-spreader events (defined here as >=5 secondarily infected people) were requisite for new variants to reach 1000 cumulative infections in individual simulations. For example, in the case of a baseline variant with R_{e}=1.0 and new variant with R_{e}=1.2, 95 variants had a super-spreader event, out of which 33 reached 1000 cumulative infections. 905 variants did not have a super-spreader event, and none reached 1000 cumulative infections (p<2.2×10^{−16}, Fisher’s exact test). The trend of no variant reaching 1000 cumulative infections in the absence of super-spreader events was observed for all assumed values of new variant R_{e}.

When we assumed a baseline variant (starting with 1000 infected at t=0) with R_{e}=1.0 (**Figure 4a**), we observed predominance of a new variant in 98 out of 100 simulations (**Figure 4b**). There was a slightly higher likelihood that more infectious variants with higher R_{e} would predominate **(Figure 4b)**. The timing of variant predominance was highly variable (**Figure 4a**). When we performed the same analysis assuming that 0.1% and 0.01% of transmission events result in a new variant, then variant takeover only occurred in 44% and 6% of simulations (**Figure 4c, d**) with more evenly distributed values of R_{e}. This suggests that the frequency of variant evolution at the within-host level may determine the characteristics of the predominant strain at the population level.

When we performed the analysis using a baseline variant with R_{e}=1.2 and 1% of transmission events resulting in a new variant, then variant takeover only occurred in 42 of 100 of simulations, whereas a baseline variant with R_{e}=0.8 allowed variant takeover in 100 of 100s simulations with no stochastic burnout of virus. This results explains why new variant predominance is often observed when incidence of the baseline variant is decreasing (1).

### High incidence outbreaks and formation of new variants

In simulations with the baseline variant R_{e}=1.0, new high incidence waves of infection were predictably associated with emergence of more new variants some of which ultimately predominated due to higher R_{e} (**Figure 4a**, top and middle left panels). This finding highlights that prevention of new variants is achieved most effectively by avoidance of large waves of infection.

## Discussion

Our simple mathematical model identifies that while higher infectivity is one important predictor of a variant’s ultimate predominance in a population, but there is also a substantial element of bad luck. Stochastic burnout is a common event when a pathogen with a reproductive number between 1 and 2 is introduced into a population (23). The over-dispersed secondary infection rate associated with SARS-CoV-2 further increases the likelihood of stochastic burnout given an equivalent basic reproductive number. This suggests that most new highly infectious variants which emerge from infected individuals never spread substantially in the population. It also raises the more provocative hypothesis that human coronaviruses with pandemic potential such as SARS, MERS and SARS-CoV-2 are introduced into the human population fairly commonly, but that most local outbreaks are avoided due to good fortune alone.

Among variants which do establish a foothold, our model suggests that early large super-spreader events, which are relatively rare at the individual level but become increasingly likely as incidence increases during a local outbreak, may determine which variant is likely to predominate. These events provide a head start for a given variant, bypassing the slower exponential growth phase to a phase of epidemic growth which is more deterministic. Super-spreader events later during an epidemic growth curve are relatively less important for a variant to achieve predominance.

From a public health perspective, our results provide yet another reason to intensely focus non-pharmaceutical interventions (NPIs) on preventing large super-spreader events. This policy prescription includes prohibition of large indoor gatherings, a focus on adequate ventilation in indoor work environments and schools, and enforcement of highest quality masks (K95 or N95) in circumstances where group exposures cannot be avoided (17). Prevention of super-spreader events will limit number of infections and lower the introduction of new variants and will also decrease the probability that a single large super-spreader event will initiate a more rapid local epidemic as has already occurred in Boston, South Korea and other locations during the pandemic (24, 25).

Our model has important limitations. While the model’s qualitative findings are robust, we cannot estimate the outbreak size necessary to ensure introduction of new variants into a population as many parameters required to do so are unknown. For instance, it is not yet clear whether the percentage of immunocompromised hosts varies across populations based on factors such as HIV prevalence and availability and use of immunosuppression for organ transplantation and cancer treatment. The number of secondary infections created by a person with new variants may also differ from that of other members of the population in ways that are difficult to project. On the one hand, these individuals may shed for longer and at a higher viral load (5, 7). Yet, they also may be more ill and therefore quarantined at home or in the hospital limiting contact exposures. Moreover, while all variants will be impacted in the same way by the introduction of NPIs such as masking and physical distancing, the utilization of these interventions varies considerably among regions and over time. Our model does not capture these nuances and in this sense is intended to be phenomenological only.

We demonstrate that new variants are frequently created and introduced into the population during large waves of SARS-CoV-2 infection. Yet, most variants ultimately burn out and those that ultimately predominate likely were associated with early super-spreader events. These variants are most likely to emerge when the previously dominant variant is decreasing.

## Methods

### SARS-CoV-2 within-host model

We used the within-host model describing the SARS-CoV-2 infection from our previous study (16). This model assumes that the contact of SARS-CoV-2 (V) with susceptible cells (S) produces infected cells at rate βVSwhich then generates new virus at a per-capita rate π. The model also incorporates the death of infected cells mediated by (1) the innate responses (δI^{k}) and (2) the acquired immune responses by SARS-CoV-2-specific effector cells (E). The magnitude of the innate immunity is dependent on the infected cell density and the exponent k. The nonlinearity of the acquired responses is captured by the Hill coefficient r that allows for rapid saturation of the killing. Finally, the parameter Φ defines level of SARS-CoV-2-specific effector cells at which the killing of infected cells becomes half maximal. In the model, the rise of SARS-CoV-2-specific effector cells rise is described in a two-stage manner. The first stage defines the proliferation of the first precursor cell compartment (M_{1}) at rate ωIM_{1} and differentiation into a second precursor cell compartment (M_{2}) at a per capita rate q. Finally, second precursor cells differentiate into effector cells at the same per capita rate q and are cleared at rate δ_{E}.

The model is expressed as a system of ordinary differential equations:
The initial conditions for the model were assumed as S(0) = 10^{7} cells/mL, I(0) = 1 cells/mL, and E_{0} = 0. For simulations we sampled parameter values from a nonlinear mixed-effect model as described in (15), with the following fixed effects and standard deviation of the random effects (in parenthesis): Log_{10}β: - 7.23 (0.2) virions^{-1} day^{-1}; δ: 3.13 (0.02) day^{-1} cells^{-k}; k: 0.08 (0.02); Log10(π): 2.59 (0.05) day^{-1}; m: 3.21 (0.33) days^{-1}cells^{-1}; Log10(ω): -4.55 (0.01) days^{-1}cells^{-1}. We also assumed r = 10; δ_{E} = 1 day^{-1}; q = 2.4 × 10^{−5} day^{-1} and c = 15 day^{-1}.

### Dose-response model

We employed our previously developed dose-response model to estimate the probability of virus entering the airway given a transmitter viral load (i.e., contagiousness) and the probability of cellular infection given a transmitter viral load, (i.e., infectiousness) P_{t}[V(t)] (response) based on viral loads V(t) (dose) (15). The relation between the response and the dose follows, , being λ the viral load that corresponds to 50% infectiousness and 50% contagiousness and α the Hill coefficient that controls the sharpness in the dose-response curve. We assumed that the viral load-dependent contagiousness (i.e., the probability that virus is passaged to the exposed person’s airway) is the same as infectiousness. We estimate the transmission risk as the product of the infectiousness and contagiousness (15).

### Transmission model and reproduction number

As in our previous model (26), we determined the total exposed contacts of a transmitter within a time step (Δ_{t}) using a gamma distribution, i.e. , where θ and ρ represent the average daily contact rate and the dispersion parameter, respectively. The true number of exposure contacts (with viral airway exposure) was then obtained by multiplying the total exposed contacts and the contagiousness of the transmitter (i.e.,). We modelled infectiousness as a Bernoulli event with mean P_{t}, yielding the number of secondary infections within a time step as . Finally, we summed up the number of secondary infections over 30 days since the time of exposure to obtain the individual effective reproduction number, i.e.. For each successful transmission, we further assumed that it takes τ days for the first infected cell to produce virus.

In simple steps, we followed the procedure below to estimate R_{e},

Simulate viral load V(t) of a simulated infected individual using the within-host model.

For a given combination of (λ, τ, α, θ, ρ)

For each time step Δ

_{t}Compute P

_{t}[V(t); λ, α]Draw

Calculate

Calculate

Repeat Steps 1 and 2 to estimate R

_{C}for 1,000 infected individuals. The population level R_{C}can then be calculating by taking the mean of 1,000 individual R_{C}values.

### Parameter values for the transmission model

For simulations, we used the parameter set [α, λ, τ, θ, ρ] = [0.8, 10^{7}, 0.5, 4, 40]) (15) as they most closely reproduces empirically observed individual R_{e} and serial interval histograms as well as mean R_{e} across individuals (R_{0} ∈ [1.4, 2.5]) and mean serial interval across individuals (SI ∈ [4.0, 4.5]) early during the pandemic (18, 19, 27-29).

### Simulating temporal dynamics from the transmission model

For a specific scenario with selected ρ and θ, we followed the procedure below to transform the transmission model into temporal transmission model:

Discretize the time-space of 150 days over time steps Δ

_{t}of 1 day.With n

_{t}representing the number of transmitters at any time t, we start with presumed n_{0}transmitters at t=0 and zero transmitters at the remaining time points.Starting at t = 0, we first determine the number of transmitters at that time step and then,

For each of n

_{t}transmitters:Simulate V(T) over [t, t + 30] at daily intervals (i.e., ΔT = 1) using the within-host model in eq. 1.

Compute P

_{T}[V(T); λ, α].Draw .

Calculate .

Determine times of successful transmission (t

_{S}) as those times ‘t’ where T_{ΔT}> 0 and the number of secondary transmissions at those time points as T_{ΔT}.Update .

Repeat Step 3 for t = Δ

_{t}, t = 2Δ_{t}and so on over the discretized time-space of 150 days.

### Simulating multi-class temporal dynamics from the transmission model

We assumed 7 classes of mutant strains, each with a different R_{e} of 1.0, 1.2, 1.4, 1.6, 1.8, 2.0 and 2.2. To simulate R_{e} of 1.0, 1.2, 1.4, 1.6, 1.8, 2.0 and 2.2, we employed θ of 2.3/day, 3.1/day, 3.5/day, 3.75/day, 4.0/day, 5.0/day and 5.5/day, respectively. For a specific scenario with selected ρ, θ and the probability µ of the transmitter transmitting mutant strain, we followed the procedure below to transform the transmission model into multi-class temporal transmission model:

Discretize the time-space of 150 days over time steps Δ

_{t}of 1 day.With n

_{tc}representing the number of transmitters at any time t of class ‘c’, we start with presumed n_{0c}transmitters at t=0 of class c and zero transmitters at the remaining time points for all classes.Starting at t = 0, for each of the seven classes,

we determine the number of transmitters at that time step of class ‘c’ and then,

For each of n

_{tc}transmitters:Simulate V(T) over [t, t + 30] at daily intervals (i.e., ΔT = 1) using the within-host model in eq. 1.

Compute P

_{T}[V(T); λ, α].Draw .

Calculate .

Determine times of successful transmission (t

_{S}) as those times ‘t’ where T_{ΔT}> 0 and the number of secondary transmissions at those time points as T_{ΔT}.Determine which strain was transmitted at times of successful transmission using µ

_{T}= Ber(µ). If µ_{T}equals 1, then only a mutant strain is transmitted and the class of the mutant strain is randomly selected from 7 pre-specified classes.Update n

_{tc}= n_{tc}+ T_{ΔT}.

Repeat Step 3 for t = Δ

_{t}, t = 2Δ_{t}and so on over the discretized time-space of 150 days.

In the case of R_{e} = 0.8 (simulated with *θ* = 1.1/day), we assume 8 variants instead of 7 and follow steps 1-4 as mentioned above.

## Data Availability

All code is available upon request.