Immune Heterogeneity and Epistasis Explain Punctuated Evolution of SARS-CoV-2

Identifying drivers of viral diversity is key to understanding the evolutionary as well as epidemiological dynamics of the COVID-19 pandemic. Using rich viral genomic data sets, we show that periods of steadily rising diversity have been punctuated by sudden, enormous increases followed by similarly abrupt collapses of diversity. We introduce a mechanistic model of saltational evolution with epistasis and demonstrate that these features parsimoniously account for the observed temporal dynamics of inter-genomic diversity. Our results provide support for recent proposals that saltational evolution may be a signature feature of SARS-CoV-2, allowing the pathogen to more readily evolve highly transmissible variants. These findings lend theoretical support to a heightened awareness of biological contexts where increased diversification may occur. They also underline the power of pathogen genomics and other surveillance streams in clarifying the phylodynamics of emerging and endemic infections. In public health terms, our results further underline the importance of equitable distribution of up-to-date vaccines.


23
During the coronavirus disease 2019 (COVID-19) pandemic, the responsible pathogen, severe acute 24 respiratory syndrome coronavirus 2 (SARS-CoV-2), has continuously evolved. However, evolution 25 has by no means happened at an even pace, but rather through a pattern of steady diversification 26 punctuated by sudden large jumps involving dozens of point mutations. Indeed, it has been sug-27 gested that SARS-CoV-2 exhibits saltational evolution, a process where evolution proceeds by large 28 multimutational jumps, rather than gradually (Corey et al., 2021).  51 In this paper, we present a mathematical model aimed at capturing the particular punctuated 52 evolutionary pattern of SARS-CoV-2. Our goal is to recapitulate the main features of the temporal 53 Hamming distribution observed during the COVID-19 pandemic (see Figure 1A) as parsimoniously 54 as possible in a dynamical model. 55 We show that the overall pattern can be captured by combining epistasis with heterogeneous 56 within-host evolution. The model is sufficiently general that it does not make any assumptions 57 about the biological mechanism behind saltations. 58 The proposed model is conceptually related to the NK Model of Kauffman and Levin (1987) and 59 Kauffman and Weinberger (1989) in that it operates on the space of possible genotypes, with each 60 genotype corresponding to a preassigned fitness value. This is in contrast to phenotypic fitness In a recent study, Starr et al. (2022) showed by deep mutational scanning that epistasis -includ-74 ing sign epistasis -is an important important feature of SARS-CoV-2 evolution. As a concrete exam-75 ple, they show that the N501Y mutation (which is present in the Alpha, Beta and Omicron variants) 76 and Q498R exhibit sign epistasis. In this case, the presence of the N501Y substitution changes the 77 contribution of Q498R from deleterious to advantageous, as measured by angiotensin-converting 78 enzyme 2 (ACE2) receptor binding affinity. 79 In general, the fitness landscape of an organism is combinatorially large, and the number of 80 possible evolutionary paths from one genotype to another fitter one is, a priori, enormous. How-81 ever, in a seminal paper, Weinreich et al. (2006,2013) showed that only very few such paths are 82 in fact accessible. The interpretation of this finding in terms of fitness landscapes is that epistasis 83 or the ruggedness of the landscape is highly important for understanding evolutionary trajectories  92 On the basis of UK sequences (a particularly rich data set), we have computed a time-dependent 93 Hamming distribution for SARS-CoV-2, which is presented in Figure 1. Figure 1A shows the full 94 Hamming distance histogram as a function of time, from March 2020 to mid-2022, with the colour 95 and height indicating the frequency of observing genome pairs with a particular Hamming distance. 96 The peaks corresponding to saltational variant transitions are clearly visible as isolated 'islands' at 97 large Hamming distance. The insert in the same panel shows a 2D representation of the data.  In Figure 1C (left), a snapshot from June 1st 2021 shows three well-defined peaks. Each peak 106 corresponds to comparisons between pairs of genomes, with the members of each pair belonging 107 to either the Alpha or Delta variant. The peak corresponding to the highest Hamming distance 108 is of course that due to comparisons between the 'new' and 'old' variant, since these are furthest 109 from each other in a genomic sense. Similarly, the plot clearly shows that variation within the Delta 110 variant is, at that point in time, much lower than within the Alpha variant, since each of the Delta 111 variant genomes belong to a clade with a recent common ancestor. In the right half of Figure 1C, 112 the situation 46 days later is shown, once the Hamming distribution has collapsed to a single peak, 113 corresponding to the then-dominant Delta variant.

114
During the month of March, 2020, the Hamming distribution appears bimodal, but there are 115 no signs of saltation. This transient bimodality, present in the early pandemic, can most clearly 116 be seen in Video 1. This can be explained by the D614G substitution, which was associated with 117 a clade that dominated from around the end of March/beginning of April 2020 (Volz et al., 2021). 118 This early, saltation-free transition is reminiscent of a result by Kauffman and Levin (1987), who 119 suggested that adaptation on a rugged fitness landscape is associated with two separate time 120 scales. First, the pathogen searches its neighbourhood in the fitness landscape until it finds a local 121 maximum. This does not require saltation and happens rather rapidly. Then, on a slower time 122 scale, the pathogen may transition to new fitness peaks by saltation.

123
Due to the relatively high quality of SARS-CoV-2 genomic surveillance in the United Kingdom, 124 both in terms of the absolute number of publicly available sequences and per capita coverage, we 125 have based the bulk of our observations on UK sequence data. However, patterns similar to those 126 presented here can be observed in US data, the analysis of which is included in Appendix 1.

127
For each day in the included range (2020-03-01 to 2022-05-10) a 7-day window (consisting of the 128 indicated day and the 6 following days) was considered. All high-quality sequences obtained within 129 that window were pooled, and a distribution of Hamming distances was compiled by repeatedly 130 picking out random pairs from the sequence pool and comparing.

131
While the Hamming distance is a somewhat crude measure of the variance between circulating    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2022. The plots of Figure 1 are based on the entire SARS-CoV-2 genome, meaning that a substitution 148 leading to an amino acid change in the spike protein (a major antigen) counts just as much as a 149 synonymous mutation elsewhere in the genome. In Figure 2, we probe to what extent the observed  3. The observed pattern is quite robust, being observed within the whole genome, in the amino 158 acid sequence as well as within the S-gene itself.

159
It is notable, however, that the S-gene does not undergo quite as much drift as the whole genome, 160 relatively speaking. That is to say, when the whole genome is considered, the Delta→Omicron jump 161 is associated with a peak that is approximately 5.5 times larger than the typical Hamming distances 162 in the weeks that preceded it, while the ratio is closer to 11 for the spike protein. We interpret this 163 to mean that, while the S-gene is subject to large saltations, it undergoes less drift than an average, 164 similarly sized section of the genome.

165
Mechanistic modelling captures the essential dynamics 166 Our goal is to capture the overall temporal pattern of diversity observed in Figure 1 in a mathe- . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2022.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint tional evolution. The details of both of these elements can be found in the Methods and Materials  The model assumes the existence of a number of possible high-fitness genotypes, but that each 172 of them are 'screened' by epistasis. From a fitness landscape viewpoint, this can be thought of as a 173 landscape with a number of peaks, each of which is surrounded by a fitness trough or valley. The 174 extent of sign epistasis is then determined by the depth (and width) of these valleys.

175
To get from a moderate-fitness genotype to a local fitness peak, it is thus necessary to either tra-176 verse a region of low fitness, with its potential for extinction, or to somehow jump across that 177 valley.

178
Evolutionary models typically assume that the 'weak mutation limit' holds, meaning that the 179 probability of several mutations arising in one genome in one generation is negligible (Katsnel-   The pattern shown in Figure 3 is the typical outcome of a model simulation, but occasional co- . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022.    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022.  and is no longer traversable by gradual evolution or a combination of the two.

247
As shown in Figure 5B, large saltations are necessary to overcome even moderate sign epistasis, 248 further explaining why the Hamming peaks seen in SARS-CoV-2 are so large.

249
Epidemic dynamics and spatial structure 250 In Figure 3, we made a number of simplifying assumptions, the major ones being constant preva-251 lence and absence of any spatial or population structure. 252 We first relax the former assumption by implementing susceptible-infected-recovered-susceptible  In Figure 6, a typical course of a simulation with SIRS dynamics, epistasis and saltation is shown.

258
As shown in panel A, the number of recovered (immune) individuals varies non-monotonically over 259 time, reflecting that individuals acquire immunity after being infected, and that the immunity even-260 10 of 23 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.   Since simulations are stochastic, this tendency is not necessarily clear from a single realization 269 such as Figure 6. A similar frequency dependence is likely to hold for SARS-CoV-2, since rare occur-270 rences in terms of within-host evolution are proportionally more likely to be observed with higher 271 incidence.

272
Next, we probe the impact of spatial separation on the diversity dynamics. Spatial structure is . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint where (relative) inter-population transmission rates are either 0, 10 −4 or 10 −3 (with intra-population 279 transmission rates ≈ 1). We find that with very low (< 10 −3 ) transmission rates between popu-280 lations, spatial structure leads to drawn-out transitions, but that this effect disappears as soon as 281 significant transmission between populations occur. This intuitively makes sense, since the within-282 population transmission rate will dominate as soon as even a few cases of a new variant have 283 spilled into a population.

284
It is worth noting that spatial structure in itself cannot lead to the saltational signature seen in 285 Figure 1. By this we mean that a pathogen which evolves gradually (i.e. obeys the weak mutation 286 limit) in multiple spatial patches will not lead to a sudden spike in the Hamming distance once spill- . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint tinguished between whether that advantage stemmed purely from higher infectiousness or from 329 some degree of immune escape. However, it is worth noting that the empirical pattern of punc-330 tuated evolution held for every major transition (Figure 1 Figures 1 and 3.

367
There are a few models in the literature that seek to address the connection between saltation,  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint of individual pathogens, but allow us to question how co-circulating pathogens affect the diversity 380 dynamics of one another.

382
Temporally resolved Hamming distributions from sequence data 383 In this section, we describe the data processing workflow which was used to generate the Hamming 384 distance plots of Figures 1 and 2. We have used the open, GenBank-derived dataset of aligned 385 SARS-CoV-2 sequences from Nextstrain (Hadfield et al., 2018). In our main analysis, we have used

397
It is then this function, ( ), that it plotted in Figure 1A. In practice, we have used the metadata pro- The mechanistic model developed for this study is a discrete-time branching model coupled to a 416 genotypic fitness landscape model.

417
In the simulations of Figures 3 and 4, we assumed a constant prevalence, for simplicity. This 418 amounts to keeping the mean effective reproductive number across the population at unity. In 419 Figure 6 we relax this assumption and explore a version of the model with epidemic dynamics. 420 We start by documenting the constant-prevalence version of the model, as well as the genotypic 421 fitness landscape element, before we go on to describe how we incorporate SIRS dynamics and 422 spatial structure. given genotype is then obtained by adding up the contributions for each of the epitopes: with the constraint that 0 ≥ 0. In practice this constraint is enforced by letting 15 of 23 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint ( ) which has support at = 0, then each peak in the fitness landscape will not be com-467 pletely surrounded by troughs. In other words, in that case it may be possible to evolve to a highly 468 fit variant through a series of single point mutations without suffering decreased fitness in the pro-469 cess.

470
Unless otherwise specified, we run our simulations with the parameter values given in Table 1 471 Incorporating SIRS dynamics 472 In the simulations of Figures 3 and 4 we assumed constant incidence, meaning that the number 473 of infected within any given generation was ( ) = 0 with 0 a constant (thus, prevalence was con-474 stant as well). However, to relax this assumption we incorporate susceptible-infected-recovered-475 susceptible (SIRS) dynamics. 476 We continue to model the infected individuals by a branching process, but now also keep track of 477 the number of susceptible ( ( )) and recovered ( ( )) individuals in each generation of the out-  Each recovered person has a constant probability rate for becoming susceptible once again. In 486 other words, this is modeled as a Poisson process with rate . Note that this not only corresponds 487 to waning of immunity, but also to any other mechanism by which a recovered individual may 488 become replaced by a susceptible one (such as population turnover). However, we will refer to . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint to reflect any particular value for SARS-CoV-2, but is rather used to illustrate the robustness of the 492 pattern of punctuated evolution to waning immunity. In the interest of simplicity, we have ignored 493 any seasonal effects on transmission. We consider this a reasonable simplification, both due to 494 the conceptual nature of our model and the understanding that susceptible dynamics rather than 495 seasonality is the major limiting factor in the pandemic phase (Baker et al., 2020). 496 Spatial structure 497 We implement a minimal model of spatial structure by incorporating a metapopulation element. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint This is of course a highly simplistic variation upon the base model, but it serves to show that pro-532 longed infection or introduction of (mutated versions of) previous variants can account for absolute 533 Hamming distance sometimes decreasing at variant transitions.     CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint One possible explanation for such a decrease in absolute Hamming distance is that prolonged infections with an earlier variant have occurred in some individuals or a population, albeit without accompanying accelerated evolution. Once the unobserved lineage spills into 22 of 23 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint the sampled population, it may lead to a variant transition despite being closer to the ancestral variant than to the currently dominating one. In terms of our model, this kind of dynamics can be quite simply incorporated. While it is not impossible to see decreases in absolute Hamming distance in the simple model formulation behind Figure 3, it is highly statistically unlikely, since each saltation happens on the basis of the genomes present in the previous generation. The very simplest way to incorporate the possibility of decreasing absolute Hamming distance is to assume that a certain fraction of the population are initially infected with the ancestral strain, but that their infections are prolonged (and that they thus only transmit the disease much later). If the pathogen undergoes mutation within these hosts, it is possible to obtain a pattern such as the one shown in Appendix 2 Figure 2B. As evidenced by panel A of that same figure, the pairwise 'relative' Hamming distance between genomes in the same generation is not qualitatively affected by this addition to the model. Transitions are still driven by large saltations, and the (relative) Hamming distance is characterized by periods of linear growth punctuated by large increases and subsequent collapses in diversity. The technical details of this addition to the model can be found in the Materials and Methods section. While allowing for persistent infections with the initial variant is of course a very simple variation on the base model, it does show that prolonged infections with previous variants can account for occasional sudden decreases in the absolute Hamming distance.    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

A B A Absolute Relative
The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint H3N2 (worldwide) Hemagglutinin H1N1 (worldwide) Hemagglutinin Figure 1-Figure supplement 2. Hamming distributions for influenza H3N2 and H1N1. With influenza, the amount of genomic surveillance data is much more limited and the temporal Hamming distributions are much less well-defined. In order to have enough data for each time point, a sampling window of 30 days was used here, as opposed to the 7 days used for SARS-CoV-2 in the main text.

721
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The fitness landscape and epistasis components of the model. The majority of the fitness landscape is assumed neutral. In the case of gradual evolution devoid of saltation (top), the pathogen performs a random walk in this neutral space until it hits upon a deleterious configuration. As a model of sign epistasis, beneficial configurations are surrounded by deleterious ones. In the case of gradual evolution, the deleterious regions are unlikely to be traversed before the lineage dies out. However, in the case of saltational evolution (bottom), several point mutations may occasionally happen in the same genome within the same generation, leading to a jump which can enable the pathogen to bypass a deleterious region. Note that this is only a 1-dimensional conceptual representation of a highly multidimensional fitness landscape. B) In each generation of the branching model, each individual stochastically infects new individuals. Upon infection, the pathogen genome (depicted as a string of black and white squares) is transferred. Occasionally a point mutation will occur, as indicated in the lower right genome. In the case of saltation (see panel A), multiple such point mutations can occur within the same genome in the same generation.

722
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. When saltational evolution is allowed, but epistasis is absent or very weak, a mixture of qualitatively different transitions occur. Some resemble the diversity spikes seen in Figure 3, but more commonly transitions will involve a gradual, linear increase in diversity followed by a collapse, as seen in Figure 4. A) Time evolution of the Hamming distance distribution. For each generation indicated on the vertical axis, the colour encodes the histogram of Hamming distances between genomes within that generation. B) Time evolution of the mean and median Hamming distance between genomes present in any given generation of the model simulation. In these simulations, = 0 (no epistasis) while saltations were of typical size 1 = 150.

724
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2022. ; https://doi.org/10.1101/2022.07.27.22278129 doi: medRxiv preprint