Antigenic minimalism of SARS-CoV-2 is linked to surges in COVID-19 community transmission and vaccine breakthrough infections

The raging COVID-19 pandemic in India and reports of vaccine breakthrough infections globally have raised alarm mandating the characterization of the immuno-evasive features of SARS-CoV-2. Here, we systematically analyzed over 1.3 million SARS-CoV-2 genomes from 178 countries and performed whole-genome viral sequencing from 53 COVID-19 patients, including 20 vaccine breakthrough infections. We identified 116 Spike protein mutations that increased in prevalence during at least one surge in SARS-CoV-2 test positivity in any country over a three-month window. Deletions in the Spike protein N-terminal domain (NTD) are highly enriched for these surge-associated mutations (Odds Ratio = 18.2, 95% CI: 7.53-48.7; p=1.465x10-18). In the recent COVID-19 surge in India, an NTD deletion ({Delta}F157/R158) increased over 10-fold in prevalence from February 2021 (1.1%) to April 2021 (15%). During the recent surge in Chile, an NTD deletion ({Delta}246-253) increased rapidly over 30-fold in prevalence from January 2021 (0.86%) to April 2021 (33%). Strikingly, these simultaneously emerging deletions associated with surges in different parts of the world both occur at an antigenic supersite that is targeted by neutralizing antibodies. Finally, we generated clinically annotated SARS-CoV-2 whole genome sequences and identified deletions within this NTD antigenic supersite in a patient with vaccine breakthrough infection ({Delta}156-164) and other deletions from unvaccinated severe COVID-19 patients that could represent emerging deletion-prone regions. Overall, the expanding repertoire of NTD deletions throughout the pandemic and their association with case surges and vaccine breakthrough infections point to antigenic minimalism as an emerging evolutionary strategy for SARS-CoV-2 to evade immune responses. This study highlights the urgent need to sequence viral genomes at a larger scale globally and to mandate that sequences are deposited with more granular and transparent clinical annotations to ensure that therapeutic development keeps pace with the evolution of SARS-CoV-2.


Introduction
The ongoing COVID-19 pandemic has infected around 160 million people and killed more than 3 million people worldwide, as of May 2021 1 . The continual emergence of SARS-CoV-2 variants with increased transmissibility and capacity for immune escape, such as B.1.1.7 ("UK variant") and P.1 ("Brazilian variant"), threatens to prolong the pandemic through devastating outbreaks such as the one currently being witnessed in India 2 . While multiple vaccines have demonstrated high effectiveness in clinical trials and real world studies [3][4][5] , there have been reports of "vaccine breakthrough infections" with SARS-CoV-2 variants 6,7 . A recent study described two such cases in New York, at least one of which occurred despite confirmation of a robust neutralizing antibody response. Variant classification schemes have been developed by the US Centers for Disease Control and Prevention (CDC) 8 and the World Health Organisation (WHO) 9 based on factors such as prevalence, evidence of transmissibility and disease severity, and ability to be neutralized by existing therapeutics or sera from vaccinated patients. Early and rapid detection of these emerging Variants of Concern/Interest is imperative to combat and contain the ongoing pandemic and future outbreaks.
Mapping the mutational landscape of SARS-CoV-2 in the context of natural and vaccineinduced immune responses is critical to understand the virus's molecular strategies for immune evasion. To this end, neutralizing antibodies which target the receptor-binding domain (RBD) or the N-terminal domain (NTD) of the Spike protein have been isolated from the sera of COVID-19 patients [10][11][12] . Recent studies contemporaneously found that several neutralizing antibodies target a single antigenic supersite in the NTD of the Spike protein 13,14 . The NTD is also a hotspot for inframe deletions in the SARS-CoV-2 genome, with four recurrent deletion regions (RDRs) identified 15 . Several such deletions have been experimentally demonstrated to reduce neutralization by NTD-targeting neutralizing antibodies 13,15 . Whether additional deletions are emerging in SARS-CoV-2 variants that drive case surges or vaccine breakthrough infections needs to be determined.
Concerted global data sharing efforts during the pandemic have led to the rapid development of large-scale genomic and epidemiological COVID-19 resources. Over 1.3 million SARS-CoV-2 genomes from 178 countries have been deposited throughout the pandemic in the GISAID database (Figure 1). In addition, we performed whole-genome sequencing of SARS-CoV-2 from patients at the Mayo Clinic that had recently developed SARS-CoV-2 infections or post-vaccination "breakthrough" infections. On the epidemiology front, population-level metrics including SARS-CoV-2 test positivity rates and mortality rates are being collected from 219 countries in databases such as OWID 16 . Such unprecedented availability of genomicepidemiology data combined with clinical genomic data provides a timely opportunity to systematically characterize the immuno-evasive features of SARS-CoV-2.
In this study, we uncover that deletion mutations in the Spike protein have a high likelihood of being associated with surges in community transmission. Further, we identify rapidly emerging surge-associated deletion mutations in India and Chile that map to an antigenic supersite in the NTD. Based on a global analysis of deletions we also highlight that the deletion-prone regions of the NTD are expanding during the course of the pandemic as part of an evolutionary strategy of "antigenic minimalism" to evade immune responses. Finally, using whole genome sequencing, we also identify deletion mutations in SARS-CoV-2 from COVID-19 patients with infection/vaccine-breakthrough infections, also mapping near the antibody-binding site and thus representing candidates for vaccine escape mutations.

Deletions are enriched for association with surges in community transmission of SARS-CoV-2
Analysis of 1,313,962 SARS-CoV-2 genome sequences obtained from the GISAID database 17 (Figure 1a) revealed the presence of 750 amino acid mutations (missense and indels) in the Spike protein. This list included mutations that were observed in at least 100 SARS-CoV-2 genome sequences in order to exclude random occurrences from sequencing errors. The list of 750 mutations included 718 substitutions (95.7%), 30 deletions (4%), and 2 insertions (0.3%). To identify the mutations associated with surges in the community transmission of COVID-19 ("surgeassociated mutations''), we shortlisted mutations that increased in prevalence monotonically during periods of monotonically increasing test positivity (Figure 1b). We identified 116 mutations that increased in prevalence during one or more surges in test positivity rate, in any country, over a three-month time interval. This approach recapitulated 45 out of 56 (80%) mutations known to be present in the CDC variants of interest or concern, including D614G, E484K, N501Y, P681H, P681R, ΔH69/V70, and ΔY144 ( Figure S1).
Further, we investigated whether a class of mutations (missense, insertions and/or deletions) are enriched for association with surges. Interestingly, 22 of 30 (73%) deletions were surge-associated, as compared to 94 of 718 (13%) substitutions, and 1 of 2 (50%) insertions (Figure 1c). This data indicates that deletions, but not substitutions or insertions, are highly enriched for association with surges (Chi-square Test p-value = 1.465 x 10 -18 ; Odds Ratio = 18.2, 95% CI: 7.5-48.7; Figure 1c). The surge-associated deletions in the Spike protein occur exclusively in the N-terminal domain (NTD), which is interesting in light of the recently identified recurrent deletion regions (RDRs) in the NTD 15 . This raises the possibility that SARS-CoV-2 may be acquiring deletion mutations to evolve new variants that drive the surges in community transmission of COVID-19.

Surging SARS-CoV-2 variants in India and Chile have acquired new deletions in an antigenic supersite to evade neutralizing antibodies
Recently there have been massive surges of COVID-19 cases in a few countries around the world, but most prominently in India 18 and Chile 19,20 . In order to identify the mutations that are associated with these surges, we systematically analyzed the mutations that increased monotonically in prevalence correlated with the monotonic increase in test positivity between February and April 2021 (Figure 2a and Figure 3a).
In India, 13 Spike protein mutations were correlated with the massive surge ("second wave of infections'') in April 2021 ( Table S1). The most prevalent mutations included D616G, L452R, P681R, G142D and Q1071H (Figure 2b). Interestingly, there is also a rapidly emerging deletion in the NTD (ΔF157/R158) that has increased 13.6-fold in prevalence from 1.1% (of 1254 sequences) in February 2021 to 15% (of 367 sequences) in April 2021. F157 and R158 reside in the antigenic supersite, which is recognized by a number of NTD-targeting neutralizing antibodies 14,21 (Figure 2b). Importantly, this deletion had not been identified at the time of the prior characterization of Spike protein deletions 15 , and thus we suggest that ΔF157/R158 represents a novel distinct fifth recurrent deletion region.
In Chile, 36 Spike protein mutations are correlated with the surge in April 2021 ( Figure  3b, Table S2). Clustering these mutations by co-occurrences shows the emergence of a new variant characterized by a novel in-frame deletion resulting in the loss of residues 246-253 and a D253G substitution (Δ246-253), which has increased in prevalence by over 30-fold from January to April 2020 (0.86% to 33.0%) (Figure 3b). This new variant in Chile is independent of two other concurrently circulating variants: P.2 (first identified in Brazil) and B.1.1.7 (first identified in the UK) (Figure 3b). Interestingly, residues 246-253 are part of the "supersite loop" in the same NTD antigenic supersite that includes F157 and R158 (Figure 3b) 14,21 , suggesting that their deletion could help evade NTD-targeting neutralizing antibodies. Indeed, mutations in this region (at P251, G252, and D253) were also found in neutralization escape mutants in vitro 13 .
Taken together, this analysis highlights two simultaneously surging strains in different parts of the world that have both acquired deletions in the NTD antigenic supersite (ΔF157/R158 in India, Δ246-253 in Chile) that is highly targeted by neutralizing antibodies. Indeed, deletions in the NTD domain have been previously shown to diminish the binding of neutralizing antibodies 13,15 . This suggests that the surging SARS-CoV-2 variants in India and Chile may have acquired NTD deletions in the antigenic supersite in order to evade neutralizing antibodies and achieve immune escape. From a viral evolution standpoint, these observations raise the question of whether SARS-CoV-2 is expanding its repertoire of deletable regions in the Spike protein as the pandemic progresses.

Recurrent deletion regions in the Spike protein emerge and expand over the course of the pandemic
In order to understand whether the deletable regions in the Spike protein are increasing, we examined the distribution of deletion frequencies for all amino acids in the Spike protein sequence from over 1.3 million sequences over the course of pandemic (Figure 4a,b; see Methods). This analysis includes more than 10-fold the number of sequences compared to the previous analysis of recurrent deletion regions that was based on 146,795 sequences as of October 24, 2020 (Figure 4a; see Methods).
In addition to ΔF157/R158 (the new surge-associated deletion in India; Figure 2), we observed that residues 14-18 (QCVNL) are deleted more frequently than expected based on the background distribution (see Methods). Interestingly, these residues are part of the N-terminal region of the same antigenic supersite (Figure 4a), and mutations in this region (at C15 and L18) were common among neutralization escape mutants 13,14 . Furthermore, this deletion region is emerging recently -most viral genomes containing one or more deletions in this region were deposited after October, 2020 15 . We also identified potential RDRs at residues S640/N641 and 675-681 (QTQTNSP), the latter of which directly precedes the Spike protein furin-cleavage site that we and others have described previously [22][23][24][25] . It is notable that these are the only RDRs . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint observed to date that are outside of the NTD (and thus outside of the antigenic supersite), and their functional significance warrants follow-up.
We also recognized that RDRs appear to have the capacity to expand (i.e., to involve more flanking amino acids) over time. For example, the Δ246-253 deletion in one of the surge associated Chile variants can be viewed as an expansion of the previously defined recurrent deletion region RDR4 (Δ242-248) 15 (Figure 5b). This expansion is likely of high immunologic relevance, as not only do these residues map to the supersite loop, but mutations in this region (at P251, G252, and D253) were also found in neutralization escape mutants in vitro 13 . Further biochemical experiments can determine whether these previously uncharacterized NTD deletions (residues 14-18, 156-159, and 249-253) impact the binding of NTD-targeted neutralizing antibodies or the capacity of sera from vaccinated individuals to neutralize the virus.
Taken together, our analysis highlights both the emergence of novel deletion regions and the expansion of previously defined deletion regions over the past several months should be monitored as candidates for emerging RDRs in the coming months.

Whole-genome sequencing of SARS-CoV-2 genomes from COVID-19 patients with vaccine breakthrough reveals the presence of distinct deletions in the N-terminal domain
The genomic-epidemiology analysis presented above based on publicly accessible data has shown that SARS-CoV-2 appears to be acquiring deletion mutations to evade neutralizing antibodies and the deletable regions are expanding over the course of the pandemic. However, the genome sequences deposited in publicly accessible databases (e.g. GISAID) lack any clinical or phenotypic data such as vaccination status or disease severity of the linked COVID-19 patients. To address this, we performed whole genome viral sequencing from 53 COVID-19 patients in the Mayo Clinic health system, for whom we have complete longitudinal health records and vaccination history. Of these, 20 cases were vaccine breakthrough infections, with the infected individuals having been fully vaccinated at the time of their positive SARS-CoV-2 test. In total, we identified 92 unique mutations, of which 28 are deletions (Figure 5a). All observed Spike protein deletions in this cohort occurred in the NTD, with Δ144 and ΔH69/V70 showing the highest prevalence (64% and 62%, respectively).
Interestingly, we identified four Spike protein variants harboring one or more deletions that warranted follow up (Figure 5b). Two of these novel deletions were isolated from previously vaccinated individuals. Whether the deletions were already present at the times of infection or evolved within these individuals under the pressure of vaccine-induced immunity is not known.
One patient who had received two doses of BTN162b2 in January 2021 was subsequently infected in April. The virus recovered from this patient contained a Δ156-164 deletion (Figure 5b). Given the prominent ΔF157/R158 deletion in India, this corroborates our prior observation that recurrent deletion regions can expand and suggests that ΔF157/R158 may actually be part of a larger deletion-prone region. More importantly, the possibility that this variant with a large contiguous deletion within the antigenic supersite was able to infect a fully vaccinated individual mandates further characterization of the potential immuno-evasive effects of deletions in this region.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint In another breakthrough infection (Ad26.COV2.S), a patient who had received the J&J vaccine in early April 2021 subsequently tested positive for SARS-CoV-2 by the end of the month. The recovered viral genome harbored a ΔD88 deletion in addition to ΔH69/V70 and ΔY144 (Figure 5b). This ΔD88 deletion is quite rare in GISAID, having been detected in only five of the 1.3 million total sequences deposited to date. That said, there was one unvaccinated individual in this Mayo Clinic cohort who experienced severe COVID-19 with a virus containing a ΔD85-90 deletion (Figure 5b). There are three sequences in the GISAID database containing this full stretch deletion, two of which are from Germany and one from Texas (Table S3). While still too rare to infer its future trajectory, deletions in this region should be epidemiologically monitored in the coming months, particularly to determine whether they impact vaccine effectiveness.
Finally, the viral genome recovered from another severe COVID-19 case in an unvaccinated individual contained a Δ167-174 deletion (Figure 4b). While these residues are not within the antigenic supersite itself, they do reside in a loop proximal to it, the deletion of which would almost certainly impact the three-dimensional structure of the supersite (Figure 4c). Interestingly, residues 168-172 appear to be a potential emerging deletion hotspot in the GISAID dataset, only narrowly missing classification as an RDR based on our defined criteria (see Methods). Thus, this is a candidate region to emerge as a bonafide RDR in the near future.
This real-time genomic surveillance of SARS-CoV-2 genomes from COVID-19 patients with linked phenotypic and clinical annotations greatly complements our analysis of the publicly accessible unannotated sequences. Specifically, we have identified deletions within the NTD antigenic supersite which are associated with vaccine breakthrough infections (e.g., Δ156-164), and we have identified other deletions in SARS-CoV-2 isolated from unvaccinated severe COVID-19 patients that could represent emerging recurring deletion regions (e.g., Δ167-174). Taken in the context of previous structural and experimental studies, these deletion patterns seem to represent a historical footprint and future trajectory of the antigenic minimalism strategy employed by SARS-SoV-2 to evade natural and vaccine-induced immunity.

Discussion
The worldwide mass vaccination campaign has had a profound impact on COVID-19 transmission. However, certain variants are less susceptible to neutralization by sera from vaccinated individuals and convalescent COVID-19 patients 26,27 . Such findings motivate the need to vigilantly track the emergence of new variants and to determine whether they are likely to cause surges or vaccine breakthrough infections. Here, through an integrated analysis of genomic and epidemiologic data, we found that (i) deletions are strongly associated with case surges (ii) deletions in the Spike protein NTD map to an antigenic supersite, (iii) the repertoire of deletions in the Spike protein is expanding over the course of the pandemic and (iv) are present in a subset of vaccine breakthrough variants. Indeed, deletion mutations are not operating independent of other mutation classes. In addition to deletion mutations, several substitution mutations are also associated with surges in cases (e.g. L452R and T478K in the receptor binding domain; Figure  S1). Thus, a concerted evolution of strategically placed deletions and substitutions appear to be conferring SARS-CoV-2 with the fitness to evade immunity and achieve efficient transmission between hosts (Figure 6).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Our finding that Spike protein NTD deletions are strongly enriched for association with test positivity surges is notable in the context of a previous report identifying the NTD as the most common site of deletions 15 . Specifically, this prior study highlighted four recurrent deletion regions in the NTD based on the GISAID data deposited as of October 2020 (146,795 total sequences). Several of these regions overlap with the putative residues of the recently identified NTD antigenic supersite, and deletions within them can abrogate binding to neutralizing antibodies [13][14][15] . Our study builds upon this prior work by examining the deletions which have arisen in the interim, during which over 1.1 million additional sequences have been deposited. In addition to validating the previously suggested definitions of recurrent deletion regions RDR1 (ΔH69/V70 and flanking deletions), RDR2 (ΔY144 and flanking deletions), and RDR3 (ΔI210 and ΔN211), we found that RDR4 (previously defined as positions 242-248) has recently expanded to include positions 249-253. These residues are indeed part of the structurally mapped supersite 13,14 , and a variant with the Δ246-253 deletion increased in prevalence during a recent test positivity surge in Chile. The recently evolved ΔF157/R158 deletion, which has expanded during the massive surge in India, marks a new recurrent deletion region which also maps to the supersite 14 . Finally, our real time surveillance of clinically annotated SARS-CoV-2 genomes among COVID-19 cases at the Mayo Clinic, including vaccine breakthrough infections, revealed contiguous deletions (Δ85-90 and Δ167-174) that have not been recognized as recurrent deletion regions previously. The proximity of these regions to the antigenic supersite suggests that they may become more prevalent in the coming months and that deletions in these regions should be monitored for associations with future surges. The striking trend that the most frequently deleted NTD regions are proximal to a single antigenic supersite highlights the prominent role that host immunity has played in shaping the genomic evolution of SARS-CoV-2 from the beginning of this pandemic.
There are a few limitations of this study. First, the geographic distribution of sequences deposited in the GISAID database is not representative of the global population, with a majority of the sequences coming from the United States or the United Kingdom. Future genomic epidemiology studies would be improved by expanded sequencing efforts in other countries. Second, the identification of mutations associated with surges during early months of the pandemic is complicated by the relative paucity of whole genome sequencing data deposited during that time. Third, the publicly accessible genomic data is not linked to any phenotypic information (e.g., disease severity) or relevant medical histories (e.g., comorbidities and vaccination status). Thus, while we are able to identify correlations between mutational prevalence and case surges, we cannot determine whether particular mutations are associated with more severe disease or are observed more frequently than expected by chance in vaccinated individuals. While the latter shortcoming is partially addressed by our independent whole genome sequencing of virus isolated from COVID-19 cases with accessible longitudinal records (including previously vaccinated individuals), this analysis was limited by the small size of the cohort (n = 53) and the lack of corresponding antibody titer data. We plan to address this by performing more whole genome viral sequencing of SARS-CoV-2 from COVID-19 patients.
Taken together, using genomic epidemiology and clinical genomics, we have uncovered that SARS-CoV-2 likely employs antigenic minimalism in the Spike protein as a strategy to evade immune responses induced by infection or vaccination. These findings have important therapeutic and public health policy implications. The repertoire of deletion mutations in the N-terminal domain . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint should be considered when developing future vaccines and biologics to counter the immunoevasive strategies of SARS-CoV-2. From the public health standpoint, we must expand sequencing efforts around the world and encourage the transparent linking of relevant deidentified patient phenotypic data (e.g. disease severity, vaccination status) to each deposited SARS-CoV-2 genome. While the current analysis focuses on the Spike protein, future work focusing on other SARS-CoV-2 proteins, such as the nucleocapsid protein and RNA-dependent RNA polymerase, will shed light on the role of the overall mutational landscape of the SARS-CoV-2 proteome for viral fitness and immune evasion. Such a holistic understanding of SARS-CoV-2 is imperative to proactively predict mutations that could trigger outbreaks and vaccine breakthroughs, as well as guide the development of comprehensive therapeutic strategies to defeat the COVID-19 pandemic.

Analysis of publicly deposited SARS-CoV-2 genomic sequences
1,313,962 SARS-CoV-2 genome sequences (with 1,276 unique lineages and 70,516 unique mutations) were obtained from GISAID 17 (data retrieved from https://www.gisaid.org/ on 30 April 2021) for the period of December 2019 to April 2021 across 178 countries. Using the Wuhan-Hu-1 sequence as reference (UniProt ID: P0DTC2), there have been 9919 amino acid mutations in the Spike protein detected in at least one sequence. To filter out potential sequencing artifacts, we excluded mutations that were present in fewer than 100 sequences, resulting in 750 unique Spike protein mutations.

Identification of surge-associated SARS-CoV-2 mutations
To identify mutations that have been temporally associated with surges in COVID-19 cases throughout the pandemic, we assessed monthly mutational prevalences and test positivity over three-month intervals in each country. For each of the 750 mutations, the monthly mutational prevalence was computed for a given country as: Positivity data for PCR tests was obtained from the OWID resource 28,29 (retrieved from https://github.com/owid/covid-19-data/tree/master/public/data on April 23, 2021). For each country, the monthly test positivity was calculated as: To identify surge-associated mutations, we classified the monthly mutational prevalence (for each mutation) and the monthly test positivity as increasing (monotonically), decreasing (monotonically), or mixed over sliding three-month intervals over the course of the pandemic. Any mutation which monotonically increased in prevalence over this interval in a country with a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint simultaneous monotonic increase in test positivity was defined as a "surge-associated mutation." There were 116 such mutations.

Comparison of surge-associated mutations to mutations in CDC variants of interest and concern
In order to test the value of our method, we obtained the set of CDC variants of interest and concern as of April 15, 2021 8 . At this time, there were 5 variants of concern and 8 variants of interest, with no variants of high consequence. From the 13 classified variants, there 56 unique mutations listed, of which 25 were found only in variants of interest, 24 were found only in variants of concern, and 7 were found in both variants of interest and concern. After identifying the surgeassociated mutations as described above, we determined the fraction of mutations comprising the CDC-classified variants which were captured by this approach.

Assessment of mutation types for enrichment of surge-associated mutations
After identifying the 116 surge-associated mutations, we tested whether any of the contributing mutation types (deletions, insertions, or substitutions) were enriched for surgeassociated mutations. To do so, we constructed a 2 x 3 table giving the number of surgeassociated and non-surge-associated mutations in each category. To determine whether one or more groups showed a statistically significant enrichment, a chi-square p-value was calculated using the chisq.test function from the stats package (4.0.3) in R. Post-hoc tests were performed by considered constructing 2x2 contingency tables to compare each mutation type against all others. Then, odds ratios and their corresponding 95% confidence intervals were calculated using the fisher.test function from the stats package (version 4.0.3) in R.

Identification of new recurrent deletion regions in the Spike protein
Recurrent deletion regions (RDRs) were previously defined as four sites within the NTD to which over 90% of all Spike protein deletions occurred, per the 146,795 SARS-CoV-2 sequences deposited in GISAID as of October 24, 2020. To identify potential new RDRs that have emerged since this time, we first plotted the distribution of deletion counts for each amino acid (i.e. number of sequences in which deletion of the given amino acid was observed) in the Spike protein, considering all 1,313,962 sequences analyzed in this study. We calculated the 95th percentile of the deletion count distribution, which is 22.4. We then bucketed each residue R into categories (Yes, No, Possible) reflecting whether or not it should be considered as part of an RDR (i.e, a contiguous stretch of two or more amino acid residues which undergo deletion events more frequently than expected by chance) as follows (illustrated schematically in Table S4).
Once each residue was categorized in this way, then any residue P in the "Possible" category were subjected to further analysis to convert their labels into "Yes" or "No." Specifically, we took a step-wise approach, walking in both directions from P until the first encounter of a residue categorized as "Yes" or "No" (i.e., other residues labeled as "Possible" were ignored). If a residue categorized as "Yes" was encountered before any residue categorized as "No" in either direction, then the "Possible" label was converted to "Yes." If a residue categorized as "No" was encountered before any residue categorized as "Yes" in both directions, then the "Possible" label was converted to "Yes." . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint With each residue categorized as "Yes" or "No", we then simply merged the residue windows with consecutive "Yes" labels to define the updated set of Spike protein RDRs. We name the RDRs on the basis of the first and last amino acid residues contained within the region; for example, the RDR including residues C14, Q15, and V16 is defined as RDR14-16.

Temporal analysis of expansions in recurrent deletion regions
To assess the expansion of regions undergoing deletions over time, we plotted a time series heatmap indicating the first time (month) at which a given deletion was identified across all GISAID sequences, and the number of sequences in which that deletion was detected in that month and all subsequent months. The residues plotted were defined based on the definition of RDRs provided above, which builds upon the regions defined previously 15 .

Structural analysis of SARS-CoV-2 Spike protein
Structural analyses and illustrations were performed in PyMOL (version 2.3.4). The cryo-EM structure of the Spike protein characterizing the interaction with a neutralizing antibody 4A8 (PDB identifier: 7C2L), described by Chi et al. 21 , was retrieved from the PDB.

Whole viral genome sequencing of SARS-CoV-2 obtained from individuals with breakthrough infections
This is a retrospective study of individuals who underwent polymerase chain reaction (PCR) testing for suspected SARS-CoV-2 infection at the Mayo Clinic and hospitals affiliated to the Mayo health system. This study was reviewed by the Mayo Clinic Institutional Review Board and determined to be exempt from human subjects research. Subjects were excluded if they did not have a research authorization on file.
SARS-CoV-2 RNA-positive upper respiratory tract swab specimens from patients with vaccine breakthrough or reinfection of COVID-19 were subjected to next-generation sequencing, using the commercially available Ion AmpliSeq SARS-CoV-2 Research Panel (Life Technologies Corp., South San Francisco, CA) based on the "sequencing by synthesis" method. The assay amplifies 237 sequences ranging from 125 to 275 base pairs in length, covering 99% of the SARS-CoV-2 genome. Viral RNA was first manually extracted and purified from these clinical specimens using MagMAX™ Viral / Pathogen Nucleic Acid Isolation Kit (Life Technologies Corp.), followed by automated reverse transcription-PCR (RT-PCR) of viral sequences, DNA library preparation (including enzymatic shearing, adapter ligation, purification, normalization), DNA template preparation, and sequencing on the automated Genexus™ Integrated Sequencer (Life Technologies Corp.) with the Genexus™ Software version 6.2.1. A no-template control and a positive SARS-CoV-2 control were included in each assay run for quality control purposes. Viral sequence data were assembled using the Iterative Refinement Meta-Assembler (IRMA) application (50% base substitution frequency threshold) to generate unamended plurality consensus sequences for analysis with the latest versions of the web-based application tools: Pangolin 30 for SARS-CoV-2 lineage assignment; Nextclade 31 for viral clade assignment, phylogenetic analysis, and S codon mutation calling, in comparison to the wild-type reference sequence of SARS-CoV-2 Wuhan-Hu-1 (lineage B, clade 19A).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 31, 2021.

Data Availability
After publication, the data will be made available upon reasonable requests to the corresponding author. A proposal with detailed description of study objectives and the statistical analysis plan will be needed for evaluation of the reasonability of requests. Deidentified data will be provided after approval from the corresponding author and the Mayo Clinic.

Declaration of Interests
AJV, PA, PL, PG, RS, AS, DRC, and VS are employees of nference and have financial interests in the company and in the successful application of this research. nference collaborates with bio-pharmaceutical companies on data science initiatives unrelated to this study. These collaborations had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. JCO receives personal fees from Elsevier and Bates College, and receives small grants from nference, Inc, outside the submitted work. ADB is a consultant for Abbvie and Flambeau diagnostics, is a paid member of the DSMB for Corvus pharmaceuticals, Equilium, and Excision biotherapeutics, has received fees for speaking for Reach MD, owns equity for scientific advisory board positions in nference and Zentalis, and is founder and President of Splissen therapeutics. JH, JCO, GJG, AWW, AV, MDS, and ADB are employees of the Mayo Clinic. The Mayo Clinic may stand to gain financially from the successful outcome of the research. nference and Mayo Clinic have filed a provisional patent application associated with this study. This research has been reviewed by the Mayo Clinic Conflict of Interest Review Board and is being conducted in compliance with Mayo Clinic Conflict of Interest policies.

Author Contributions
VS conceived the study. AJV and VS advanced the study design and generated the hypotheses. PA, PL, PG, AJV and VS wrote the manuscript and reviewed the findings. AJV, PA, PL, PG, RS, AS, DRC, JDY, and JCO contributed methods, data, analysis, or software. JCOH, JDY, BSP, AN, RTH, ADB, and JDH reviewed the study, findings, and the manuscript. All authors revised the manuscript.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint Figure S1. Comparison of surge-associated mutations identified in this study and mutations present in variants of interest or concern as categorized by the CDC.

Supplementary Information
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 31, 2021. ; https://doi.org/10.1101/2021.05.23.21257668 doi: medRxiv preprint