Meta-Analysis of the Dynamics of the Emergence of Mutations and Variants of SARS-CoV-2

The novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) emerged in late December 2019 in Wuhan, China, and is the causative agent for the worldwide COVID-19 pandemic. SARS-CoV-2 is a 29,811 nucleotides positive-sense single-stranded RNA virus belonging to the betacoronavirus genus. Due to inefficient proofreading ability of the viral RNA-dependent polymerase complex, coronaviruses are known to acquire new mutations following replication, which constitutes one of the main factors driving the evolution of its genome and the emergence of new genetic variants. In the last few months, the identification of new B.1.1.7 (UK), B.1.351 (South Africa) and P.1 (Brazil) variants of concern (VOC) highlighted the importance of tracking the emergence of mutations in the SARS-CoV-2 genome and their impact on transmissibility, infectivity, and neutralizing antibody escape capabilities. These VOC demonstrate increased transmissibility and antibody escape, and reduce current vaccine efficacy. Here we analyzed the appearance and prevalence trajectory of mutations that appeared in all SARS-CoV-2 genes from December 2019 to January 2021. Our goals were to identify which modifications are the most frequent, study the dynamics of their spread, their incorporation into the consensus sequence, and their impact on virus biology. We also analyzed the structural properties of the spike glycoprotein of the B.1.1.7, B.1.351 and P.1 variants. This study offers an integrative view of the emergence, disappearance, and consensus sequence integration of successful mutations that constitute new SARS-CoV-2 variants and their impact on neutralizing antibody therapeutics and vaccines.


INTRODUCTION 56
In late December 2019, a new betacoronavirus known as Severe Acute Respiratory Syndrome 57 Coronavirus 2 (SARS-COV-2) emerged in Wuhan, the province of Hubei, China (1). SARS-CoV-2 is the 58 etiological agent for the worldwide COVID-19 pandemic resulting in more than 90 million infected and 2 59 million death worldwide as of Dec. 2020 (2,3). SARS-CoV-2 is an enveloped, positive-sense single-60 stranded RNA (+ssRNA) virus with a genome length of 26,000 to 32,000 bps (4). The mutation rates of 61 RNA viruses are generally higher than that of DNA viruses because of the low fidelity of their viral RNA immune system leads to the selection and evolution of viral genomes (6,7). However, coronaviruses are 67 one of the few members of the RNA virus family that possess limited but measurable proofreading ability 68 via the 3' to 5' exoribonuclease activity of the non-structural viral protein 14 (nsp14) (8,9). Coronaviruses 69 are therefore expected to evolve through genetic drift much slower than other RNA viruses that do not 70 have this ability, such as influenza and HIV (8,10). Additionally, SARS-CoV-2 and other coronaviruses 71 have low known occurrences of recombination between family members (i.e., genetic shift), and therefore 72 are mostly susceptible to genetic drift (11). 73 SARS-CoV-2 has reached pandemic status due to its presence on every continent and has since 74 maintained a high level of transmissibility across hosts of varied ethnical and genetic backgrounds (2, 12). 75 Moreover, SARS-CoV-2 infections have been reported to naturally infect minks, ferrets, cats, and dogs, 76 which allows the virus to replicate in completely new hosts and mutate to produce new variants and 77 possibly new strains (13,14). In March 2020, the now dominant D614G mutation first emerged in the 78 spike protein (S) of SARS-CoV-2 . The S protein is present as a trimer at the surface of the viral envelope 79 and is responsible for attachment of the virus to the human angiotensin converting enzyme 2 (hACE2), 80 the entry receptor for SARS-CoV-2 (15). Published evidence has now shown that D614G increases viral 81 fitness, transmissibility and viral load but does not directly affect COVID-19 pathogenicity (16,17,18,19). 82 Additionally, emerging evidence indicates that D614G may have epistatic interactions and exacerbate the 83 impact of several other independent mutations (19). Mutations in the S protein, and particularly in 84 receptor binding domain (RBD), are of very high concern given that they can direct influence viral 85 infectivity, transmissibility and resistance to neutralizing antibodies and T cell responses. 86 New mutations are frequently and regularly detected in the genome of SARS-CoV-2 through whole 87 genome sequencing, however very few of these mutations make it into the viral consensus sequence. The 88 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint consensus sequence or reference strain is generally regarded as the dominant transmitted strain at a given 89 time. This sequence is determined by aligning large numbers of recently sequenced genomes and 90 establishing a consensus of the highest frequency nucleotide for each position in the viral genome. A 91 genetic variant is a version of the reference strain that has acquire one or several mutations and acts as the 92 founder for further genetic diversification. Mutations arise regularly in the reference strain, but few are 93 longitudinally conserved. Genetic variants are therefore successful offshoots of the reference strain. 94 Some variants rise rapidly in frequency and then collapse and disappear, others will rise and overtake the 95 42% of specimens in the Amazonian city of Manaus and was detected for the first time in the U.S. at the 106 end of January 2021 (landed in Japan in January 2021) (23). All these variants possess the N501Y 107 mutation, which is a mutation in the RBD that is critical for the spike to interact with hACE2 (25). This 108 mutation is reported to cause increased resistance to Nabs, increased infectivity, and virulence in animal 109 models (26). In addition to the N501Y mutation, both the South African and Brazil variants possesses 110 RBD mutations K417N and E484K, which are also associated with increased Nabs escape capabilities 111 (24). 112 Here we present a retrospective metadata analysis study of mutations reaching higher than 1% global 113 frequency occurring throughout the SARS-CoV-2 genome over the past year and we specifically 114 interactions between the S protein and hACE2 and their potential impact on Nabs. 118 119 120 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

Data collection and mutational analysis 122
Genomes uploaded to the GISAID EpiCoV TM server database were analyzed from December 1 st , 2019, to 123 December 31 st , 2020, with collection of viral sequences from December 1 st , 2019 to January 6, 2021. The 124 mutations were selected by being in more than 500 reported genomes in August 2020, and another selection 125 was made in January 2021 to capture mutations with more than 4000 reported genomes.  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

Identification of emerging mutations in various SARS-CoV-2 genes. Emerging mutations in the 148
SARS-CoV-2 genome were investigated to determine the fluctuations of these mutations during a period 149 of twelve months. We compiled mutations reported in 500> genomes in August, and 4000> genomes in 150 early January from the GISAID database, and followed their appearance in reported SARS-CoV-2 151 genomes until December 31th , 2020. Genes NSP8, NSP10, NS6, NS7a, and E are not illustrated in 152 Figure 1 given that did not display mutation frequencies sufficiently high to meet our inclusion 153 conditions during the periods of data collection of our study. This is also indicative that these appear to be 154 the most conserved sequences of the SARS-CoV-2 genome. Our analysis focused on the emergence, 155 fixation, and fading of the mutations analyzed that met our inclusion criteria. Our analysis highlights the 156 fixation of the D614G mutation in the S protein and the P323L in the RdRp (Fig.1). Both are the only 157 mutations to have been successful to reach reference strain status until now. They appeared to have 158 emerged simultaneously in January, 2020 and became present in >90% of all sequenced genomes by 159 June, 2020. However, some other mutations emerged rapidly and then stabilized or faded out. For 160 example, Q57H (NS3), R203K (N), G204R (N) are mutations that emerged rapidly and appeared to have 161 stabilized at a frequency of 15% to 40%. Others like I120F (NSP2), L37F (NSP6), S477N (S), and L84S 162 (NS8) illustrate mutations that emerged rapidly and then faded-out. We also demonstrate that most genes 163 in SARS-CoV-2 have mutations with frequencies lower than 10% (Fig.1). Also, these mutations are 164 summarized into Table 1, which illustrates nucleotide substitution producing the amino acid change, the 165 frequency at the end of 2020 and their respective effects. ( Table 1). These results indicate that only 166 D614G (S) and P323L (NSP12) were fixed in the viral consensus sequence, many of the mutations either 167 faded out or emerged and then stabilized at a frequency lower than 50%. 168 Geographic localization and timeline of the viral genes with mutations higher than 50% frequency. 169 To further study the mutations that have frequencies higher than 50%, we took the graphs of NSP12, S, 170 and N genes from Figure 1 and added worldwide geographic locations of the mutations provided from 171 GISAID. These maps are useful to track down if a mutation is a localized and regional event or found 172 worldwide. In the S protein, D614G is found worldwide with higher reported cases in the US, UK, and 173 Australia, probably due to more large-scale testing, while A222V is mostly reported in the UK, but has 174 not been reported in South America, Central and East African regions ( Fig. 2A). Like D614G, P323L in 175 the RdRp is also found worldwide, with higher reported cases in the US, UK, and Australia (Fig. 2B). 176 The N gene doesn't have successful mutations that have attained reference strain status, but R203K, 177 G204R and A220V all had a frequency of 50% or higher during the past twelve months. Even if R203K 178 and G204R frequencies have decreasing since July of 2020 and stabilized in November of 2020, both 179 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint mutations are currently found worldwide. A220V emerged in August of 2020 and reached a frequency 180 higher than 50% in October of 2020. Mutations G204R, R203K and A220V are reported at high 181 frequency in the UK, but have not been detected in South America, Central, East, and South African 182 regions (Fig. 2C). The analyses of these data illustrate the localization of the most prevalent mutations to 183 date, which appear to be mostly present in Western and developed countries. This, however, is 184 undoubtable attributable to overall more testing. interactions with their environment. The A222V mutation has no reported effects on protein stability, 190 neutralizing antibody escape, and affinity for hACE2 ( Table 1). The substitution from A to V results in a 191 low steric clash between neighboring residues (Fig. 3B). The S477N substitution in the RBD enables 192 increased stability during hACE2-RBD interactions ( Table 1, Fig. 3C). In the N-terminal domain (NTD), 193 mutagenesis of L18F leads to a steric clash between neighboring residues (Fig. 3D). However, this does 194 not appear to impact the stability of the protein given that no such effects have been reported for L18F so 195 far ( Table 1). In the closed conformation, D614 makes an ionic bond with K854 in the S2 subunit of 196 another S protein monomer ( Fig. 3F) (28). In the open conformation, D614 (or G614) doesn't make 197 interactions or display steric clashes with neighboring residues (Fig. 3E). October of 2020 and reached a frequency of 68% to 72% mid February (Fig. 4, Fig. 5C, Table 2). 206 N501Y is found in the RBD and can interact with a lysine residue in hACE2 ( Fig. 5B & 5D, Fig. 8C). 207 The N501Y mutation is associated with an increased affinity to hACE2, along with an increase in 208 infectivity and virulence ( Table 2). Figure  ORF1ab, and T26801C in M protein are synonymous mutations. Also, the C27972T mutation has a 211 frequency of 71% and produces a premature stop codon in NS8 that inactivates the protein (Q27stop) 212 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. N501Y, which are also seen in the B.1.1.7 variant (Fig. 5A). Furthermore, many of the mutations found 225 in the B.1.351 variant have global frequencies lower than 2%. The exception is D614G, L18F and N501Y 226 (Fig. 6C, Table 3). In comparison to B.1.1.7 and P.1 variants, B.1.351 variant has not reached a 227 frequency higher than 1% as of mid February 2021 (Fig. 4). By using the mutagenesis tool of PyMOL, 228 we modeled the S protein in complex with hACE2, and with C103 and C121 Nabs, which are human 229 recombinant class I & II neutralizing antibodies, respectively(32). Our in silico mutagenesis predicts that 230 mutations in the RBD induce a loss of interactions with C102 Nab and C121 Nab. At the position 417 in 231 the S protein, a loss of interaction is predicted between the RBD and C102 Nab when the K417 is mutated 232 to Asn producing Nabs escape capability. (Fig. 6D & 8A). Another loss of interaction is predicted with 233 the E484K mutation and the C121 Nab, which could lead to Nabs escape capability ( Fig. 6E & 8B). As 234 previously mentioned, N501Y is also found in the B.1.351 strain and our modeling predicts that it will 235 have similar effects as those observed with the B.1.1.7 variant (Fig. 6F). Figure 6G (Fig. 7). 241 The P.1 variant harbours substitutions L18F, T20N, P26S, D138Y, and R190S in the NTD of the S 242 protein. H655Y and T1027I are in the subdomain 2 (SD2) and S2 subunit, respectively ( Fig. 7A & 7B, 243 Table 4). Overall, P.1-specific mutations have worldwide frequencies less than 2% (Fig. 7C, Table 4). 244 D614G, L18F and N501Y are not specific to P1. Similar to B.1.351, P.1 variant has low global frequency 245 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint with less than 1% of global variant frequency as of mid February 2021 (Fig. 4). We then modeled 246 mutations to investigate interaction alterations with known recombinant neutralizing antibodies (32). The 247 K417T mutation reduces interactions with neighboring residues in the C102 Nab and therefore we 248 predict, as with K417N in B.1.1.7, an increased ability to escape neutralization (Fig. 7D). Also, the P.1 249 variant has E484K, and N501Y mutations in the RBD. We predict they will have the same effect reported 250 for the B.1.1.7, and B.1.351 variants (Fig. 7E & 7F). In Figure 7G, we illustrate the synonymous, non-251 synonymous, and deletions in SARS-CoV-2 P.1 genome (Table 4).  (19). The study shows that D614G alone increases infectivity, but in combination with different other 280 mutations in the S protein, these can either increase or decrease viral infectivity. Similar findings have 281 been reported regarding sensitivity to Nabs. D614G alone has undetectable effects on Nabs escape. 282 However, the combination of D614G with other mutations in S can enable Nabs escape. This data 283 suggests that the continuous emergence of epistatic mutations in SARS-CoV-2 will likely be involved in 284 further altering properties of the virus, including transmissibility, pathogenicity, stability and Nab 285 resistance. 286 Our analyses have highlighted that several of the successful mutations analyzed had frequency trajectories 287 that eventually plunged or stabilized at low frequencies. Only the D614G in the S protein and the P323L 288 in the RdRp maintained their presence in the consensus sequence ( Fig. 1 & 2). Analyses of the GISAID 289 database also reveal which countries upload the most sequences to the database and are therefore carrying 290 out the most testing and sequencing. The global frequency of mutations and variants in the database is 291 therefore biased to represent the genetic landscape of the countries doing the most testing. Emergence of 292 new variants may therefore go undetected until they leave their point of origin and enter countries with 293 high testing and sequencing rates. This delayed notification constitutes a major obstacle in preventing the 294 spread of nefarious variants that are potentially resistant to current vaccines and neutralizing antibody 295 therapy. 296 In conclusion, our metadata analysis of emerging mutations has highlighted the natural upward and 297 downward fluctuation in mutation prevalence. We also illustrate how mutations sometimes need to co-298 emerge in order to create a favorable outcome for virus propagation. Tracking mutations and the 299 evolution of the SARS-CoV-2 genome is critical for the development and deployment of effective 300 treatments and vaccines. Thus, it is the responsibility of all countries and governing jurisdictions to 301 increase testing and sequencing and upload SARS-CoV-2 genomes to databases in real-time. On then will 302 we have the most accurate information to inform policy and decisions makers about interventions 303 required to blunt the global transmission of the virus and ensure that our tools remain effective against the 304 all circulating variants. 305 306 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

ACKNOWLEDGEMENTS 307
The authors wish to thank Sean Li for helpful comments on our manuscript. M.-A.L. holds a Canada 308

Research Chair in Molecular Virology and Intrinsic Immunity. This study was supported by a COVID-19 309
Rapid Response grant to M-A Langlois by the Canadian Institute of Health Research (CIHR) and by a 310 grant supplement by the Canadian Immunity Task Force (CITF). 311

CONFLICTS OF INTERESTS 313
The authors declare no competing interests. 314 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.       . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint     when the mutations are inserted into the structure. Graphs were generated using PyMOL. 539 Figure 4: Frequency of B.1.1.7, B.1351, P.1, D614G, and   interactions between residues. The graphs were generated using PyMOL. 578 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint Figure 5 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint Figure 6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint Figure 8 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.06.21252994 doi: medRxiv preprint