Antigenic minimalism of SARS-CoV-2 is linked to surges in COVID-19 community transmission and vaccine breakthrough infections ============================================================================================================================= * A.J. Venkatakrishnan * Praveen Anand * Patrick Lenehan * Pritha Ghosh * Rohit Suratekar * Abhishek Siroha * Dibyendu Roy Chowdhury * John C. O’Horo * Joseph D. Yao * Bobbi S. Pritt * Andrew Norgan * Ryan T. Hurt * Andrew D. Badley * John D. Halamka * Venky Soundararajan ## Abstract The raging COVID-19 pandemic in India and reports of “vaccine breakthrough infections” globally have raised alarm mandating the characterization of the immuno-evasive features of SARS-CoV-2. Here, we systematically analyzed over 1.3 million SARS-CoV-2 genomes from 178 countries and performed whole-genome viral sequencing from 53 COVID-19 patients, including 20 vaccine breakthrough infections. We identified 116 Spike protein mutations that increased in prevalence during at least one surge in SARS-CoV-2 test positivity in any country over a three-month window. Deletions in the Spike protein N-terminal domain (NTD) are highly enriched for these ‘*surge-associated mutations*’ (Odds Ratio = 18.2, 95% CI: 7.53-48.7; p=1.465×10−18). In the recent COVID-19 surge in India, an NTD deletion (ΔF157/R158) increased over 10-fold in prevalence from February 2021 (1.1%) to April 2021 (15%). During the recent surge in Chile, an NTD deletion (Δ246-253) increased rapidly over 30-fold in prevalence from January 2021 (0.86%) to April 2021 (33%). Strikingly, these simultaneously emerging deletions associated with surges in different parts of the world both occur at an antigenic supersite that is targeted by neutralizing antibodies. Finally, we generated clinically annotated SARS-CoV-2 whole genome sequences and identified deletions within this NTD antigenic supersite in a patient with vaccine breakthrough infection (Δ156-164) and other deletions from unvaccinated severe COVID-19 patients that could represent emerging deletion-prone regions. Overall, the expanding repertoire of NTD deletions throughout the pandemic and their association with case surges and vaccine breakthrough infections point to antigenic minimalism as an emerging evolutionary strategy for SARS-CoV-2 to evade immune responses. This study highlights the urgent need to sequence viral genomes at a larger scale globally and to mandate that sequences are deposited with more granular and transparent clinical annotations to ensure that therapeutic development keeps pace with the evolution of SARS-CoV-2. ## Introduction The ongoing COVID-19 pandemic has infected around 160 million people and killed more than 3 million people worldwide, as of May 20211. The continual emergence of SARS-CoV-2 variants with increased transmissibility and capacity for immune escape, such as B.1.1.7 (“UK variant”) and P.1 (“Brazilian variant”), threatens to prolong the pandemic through devastating outbreaks such as the one currently being witnessed in India2. While multiple vaccines have demonstrated high effectiveness in clinical trials and real world studies3–5, there have been reports of “vaccine breakthrough infections” with SARS-CoV-2 variants6,7. A recent study described two such cases in New York, at least one of which occurred despite confirmation of a robust neutralizing antibody response. Variant classification schemes have been developed by the US Centers for Disease Control and Prevention (CDC)8 and the World Health Organisation (WHO)9 based on factors such as prevalence, evidence of transmissibility and disease severity, and ability to be neutralized by existing therapeutics or sera from vaccinated patients. Early and rapid detection of these emerging Variants of Concern/Interest is imperative to combat and contain the ongoing pandemic and future outbreaks. Mapping the mutational landscape of SARS-CoV-2 in the context of natural and vaccine-induced immune responses is critical to understand the virus’s molecular strategies for immune evasion. To this end, neutralizing antibodies which target the receptor-binding domain (RBD) or the N-terminal domain (NTD) of the Spike protein have been isolated from the sera of COVID-19 patients10–12. Recent studies contemporaneously found that several neutralizing antibodies target a single antigenic supersite in the NTD of the Spike protein13,14. The NTD is also a hotspot for in-frame deletions in the SARS-CoV-2 genome, with four recurrent deletion regions (RDRs) identified15. Several such deletions have been experimentally demonstrated to reduce neutralization by NTD-targeting neutralizing antibodies13,15. Whether additional deletions are emerging in SARS-CoV-2 variants that drive case surges or vaccine breakthrough infections needs to be determined. Concerted global data sharing efforts during the pandemic have led to the rapid development of large-scale genomic and epidemiological COVID-19 resources. Over 1.3 million SARS-CoV-2 genomes from 178 countries have been deposited throughout the pandemic in the GISAID database (**Figure 1**). In addition, we performed whole-genome sequencing of SARS-CoV-2 from patients at the Mayo Clinic that had recently developed SARS-CoV-2 infections or post-vaccination “breakthrough” infections. On the epidemiology front, population-level metrics including SARS-CoV-2 test positivity rates and mortality rates are being collected from 219 countries in databases such as OWID16. Such unprecedented availability of genomic-epidemiology data combined with clinical genomic data provides a timely opportunity to systematically characterize the immuno-evasive features of SARS-CoV-2. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/31/2021.05.23.21257668/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/F1) Figure 1. Identifying potential SARS-CoV-2 variants contributing to COVID-19 surge. (a) Overview of COVID-19 prevalence and SARS-CoV-2 variants globally during the pandemic. (b) Identifying surge-associated mutations comparing mutational prevalence and test-positivity over three-month windows. Structure shows the sites of all the 750 mutations (as red spheres) on a subunit of the Spike protein — each site can harbor multiple mutations. Chart shows correlation between trends of mutational prevalence (e.g. F178del) and test positivity during a three-month window (e.g. Feb, 2021 to Apr. 2021 in India). These correlations are computed for all three-months across all the countries being studied. Barchart shows the frequency of mutations that fall in each combination of trends: (i) 116 surge-associated mutations: mutational prevalence and test positivity are increasing monotonically, (ii) 27 mutations where mutational prevalence is increasing monotonically and test positivity decreasing monotonically, and (iii) 76 mutations where mutational prevalence is increasing monotonically and test positivity trend is mixed. Other categories (e.g. test positivity decreasing) are not shown for clarity. (c) The enrichment of N-terminal domain deletions among surge-associated mutations. Structure shows the sites of all the 22 surge-associated deletion mutations (as magenta spheres) on a subunit of the Spike protein. Venn diagram shows the overlap of surge-associated mutations and deletion mutations. The 2×3 table shows the distribution of mutations based on surge-association and mutation category (deletion, insertion, substitution). Chi-square test results are also shown. In this study, we uncover that deletion mutations in the Spike protein have a high likelihood of being associated with surges in community transmission. Further, we identify rapidly emerging surge-associated deletion mutations in India and Chile that map to an antigenic supersite in the NTD. Based on a global analysis of deletions we also highlight that the deletion-prone regions of the NTD are expanding during the course of the pandemic as part of an evolutionary strategy of “antigenic minimalism” to evade immune responses. Finally, using whole genome sequencing, we also identify deletion mutations in SARS-CoV-2 from COVID-19 patients with infection/vaccine-breakthrough infections, also mapping near the antibody-binding site and thus representing candidates for vaccine escape mutations. ## Results ### Deletions are enriched for association with surges in community transmission of SARS-CoV-2 Analysis of 1,313,962 SARS-CoV-2 genome sequences obtained from the GISAID database17(**Figure 1a**) revealed the presence of 750 amino acid mutations (missense and indels) in the Spike protein. This list included mutations that were observed in at least 100 SARS-CoV-2 genome sequences in order to exclude random occurrences from sequencing errors. The list of 750 mutations included 718 substitutions (95.7%), 30 deletions (4%), and 2 insertions (0.3%). To identify the mutations associated with surges in the community transmission of COVID-19 (“surge-associated mutations’’), we shortlisted mutations that increased in prevalence monotonically during periods of monotonically increasing test positivity (**Figure 1b**). We identified 116 mutations that increased in prevalence during one or more surges in test positivity rate, in any country, over a three-month time interval. This approach recapitulated 45 out of 56 (80%) mutations known to be present in the CDC variants of interest or concern, including D614G, E484K, N501Y, P681H, P681R, ΔH69/V70, and ΔY144 (**Figure S1**). Further, we investigated whether a class of mutations (missense, insertions and/or deletions) are enriched for association with surges. Interestingly, 22 of 30 (73%) deletions were surge-associated, as compared to 94 of 718 (13%) substitutions, and 1 of 2 (50%) insertions (**Figure 1c**). This data indicates that deletions, but not substitutions or insertions, are highly enriched for association with surges (Chi-square Test p-value = 1.465 × 10−18; Odds Ratio = 18.2, 95% CI: 7.5-48.7; **Figure 1c**). The surge-associated deletions in the Spike protein occur exclusively in the N-terminal domain (NTD), which is interesting in light of the recently identified recurrent deletion regions (RDRs) in the NTD15. This raises the possibility that SARS-CoV-2 may be acquiring deletion mutations to evolve new variants that drive the surges in community transmission of COVID-19. ### Surging SARS-CoV-2 variants in India and Chile have acquired new deletions in an antigenic supersite to evade neutralizing antibodies Recently there have been massive surges of COVID-19 cases in a few countries around the world, but most prominently in India18 and Chile19,20. In order to identify the mutations that are associated with these surges, we systematically analyzed the mutations that increased monotonically in prevalence correlated with the monotonic increase in test positivity between February and April 2021 (**Figure 2a** and **Figure 3a**). ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/31/2021.05.23.21257668/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/F2) Figure 2. Rapidly emerging deletions in India map to antigenic supersite in N-terminal domain mapping to antigenic supersite. (a) Identification of mutations in the SARS-CoV-2 Spike protein that are associated with the COVID-19 surge in India between February and April 2021, based on correlations between mutational prevalence and test positivity. 13 mutations were found to be positively correlated with the surge over the three-month window. (b) Tracking the prevalence of ΔF157/R158 in the emerging “Indian variant”. The inset shows the location of these two residues on the Spike protein structure. ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/31/2021.05.23.21257668/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/F3) Figure 3. New surging variant in Chile has acquired deletions in the antigenic supersite. (a) Identification of mutations in the SARS-CoV-2 Spike protein that are associated with the COVID-19 surge in Chile between February and April 2021, based on correlations between mutational prevalence and test positivity. 36 mutations were found to be positively correlated with the surge over the three-month window. The clustering of these mutations based on their co-occurrence (number of sequences with both mutations present) using the complete method (farthest point algorithm) reveals two dominant clusters. (b) Deletions in the emerging “Chile variant” map to the antigenic supersite that is bound by most anti-NTD antibodies. (c) Rapid increase of the prevalence of surge-associated mutations in the Chile variant (blue). The other existing variants (B.1.1.7, P.2) are also shown. In India, 13 Spike protein mutations were correlated with the massive surge (“second wave of infections’’) in April 2021 (**Table S1**). The most prevalent mutations included D616G, L452R, P681R, G142D and Q1071H (**Figure 2b**). Interestingly, there is also a rapidly emerging deletion in the NTD (ΔF157/R158) that has increased 13.6-fold in prevalence from 1.1% (of 1254 sequences) in February 2021 to 15% (of 367 sequences) in April 2021. F157 and R158 reside in the antigenic supersite, which is recognized by a number of NTD-targeting neutralizing antibodies14,21 (**Figure 2b**). Importantly, this deletion had not been identified at the time of the prior characterization of Spike protein deletions15, and thus we suggest that ΔF157/R158 represents a novel distinct fifth recurrent deletion region. In Chile, 36 Spike protein mutations are correlated with the surge in April 2021 (**Figure 3b, Table S2**). Clustering these mutations by co-occurrences shows the emergence of a new variant characterized by a novel in-frame deletion resulting in the loss of residues 246-253 and a D253G substitution (Δ246-253), which has increased in prevalence by over 30-fold from January to April 2020 (0.86% to 33.0%) (**Figure 3b**). This new variant in Chile is independent of two other concurrently circulating variants: P.2 (first identified in Brazil) and B.1.1.7 (first identified in the UK) (**Figure 3b**). Interestingly, residues 246-253 are part of the “supersite loop” in the same NTD antigenic supersite that includes F157 and R158 (**Figure 3b**)14,21, suggesting that their deletion could help evade NTD-targeting neutralizing antibodies. Indeed, mutations in this region (at P251, G252, and D253) were also found in neutralization escape mutants *in vitro**13*. Taken together, this analysis highlights two simultaneously surging strains in different parts of the world that have both acquired deletions in the NTD antigenic supersite (ΔF157/R158 in India, Δ246-253 in Chile) that is highly targeted by neutralizing antibodies. Indeed, deletions in the NTD domain have been previously shown to diminish the binding of neutralizing antibodies13,15. This suggests that the surging SARS-CoV-2 variants in India and Chile may have acquired NTD deletions in the antigenic supersite in order to evade neutralizing antibodies and achieve immune escape. From a viral evolution standpoint, these observations raise the question of whether SARS-CoV-2 is expanding its repertoire of deletable regions in the Spike protein as the pandemic progresses. ### Recurrent deletion regions in the Spike protein emerge and expand over the course of the pandemic In order to understand whether the deletable regions in the Spike protein are increasing, we examined the distribution of deletion frequencies for all amino acids in the Spike protein sequence from over 1.3 million sequences over the course of pandemic (**Figure 4a,b**; see **Methods**). This analysis includes more than 10-fold the number of sequences compared to the previous analysis of recurrent deletion regions that was based on 146,795 sequences as of October 24, 2020 (**Figure 4a**; see **Methods**). ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/31/2021.05.23.21257668/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/F4) Figure 4. Deletion mutations are expanding to contiguous regions. (a) Frequency of occurrence of deletion mutations in the N-terminal domain across 1.3 million Spike protein sequences. The recurrent deletion regions, both known as well as new, are illustrated schematically and mapped on the structure of the Spike protein. (b) Heatmap showing the expansion of “deletable” regions in the course of the pandemic, where the rows denote residue positions in the Spike protein and columns denote the time course of the pandemic (in months). The boxes denote the frequency of the deletion mutation across the world in that month. The color of the boxes corresponds to a frequency of 1 to 100,000 sequences shown on a log10 scale. These deletion mutations are shown on the 3D structure of the Spike protein. Positions corresponding to deletion mutations are shown as spheres. In addition to ΔF157/R158 (the new surge-associated deletion in India; **Figure 2**), we observed that residues 14-18 (QCVNL) are deleted more frequently than expected based on the background distribution (see **Methods**). Interestingly, these residues are part of the N-terminal region of the same antigenic supersite (**Figure 4a**), and mutations in this region (at C15 and L18) were common among neutralization escape mutants13,14. Furthermore, this deletion region is emerging recently — most viral genomes containing one or more deletions in this region were deposited after October, 202015. We also identified potential RDRs at residues S640/N641 and 675-681 (QTQTNSP), the latter of which directly precedes the Spike protein furin-cleavage site that we and others have described previously22–25. It is notable that these are the only RDRs observed to date that are outside of the NTD (and thus outside of the antigenic supersite), and their functional significance warrants follow-up. We also recognized that RDRs appear to have the capacity to expand (i.e., to involve more flanking amino acids) over time. For example, the Δ246-253 deletion in one of the surge associated Chile variants can be viewed as an expansion of the previously defined recurrent deletion region RDR4 (Δ242-248)15 (**Figure 5b**). This expansion is likely of high immunologic relevance, as not only do these residues map to the supersite loop, but mutations in this region (at P251, G252, and D253) were also found in neutralization escape mutants *in vitro**13*. Further biochemical experiments can determine whether these previously uncharacterized NTD deletions (residues 14-18, 156-159, and 249-253) impact the binding of NTD-targeted neutralizing antibodies or the capacity of sera from vaccinated individuals to neutralize the virus. ![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/31/2021.05.23.21257668/F5.medium.gif) [Figure 5.](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/F5) Figure 5. Deletion mutations present in the Spike proteins sequences derived from vaccination breakthrough or reinfection cases. (a) Whole genome sequencing of SARS-CoV-2 from COVID vaccination breakthrough/reinfections from Mayo Clinic. (b) Four different stretches of Spike protein deletions from the patients with vaccine breakthrough or reinfections of COVID-19. In the heatmap, rows denote patients (state, date of sample and vaccination status are shown) and columns denote deletion mutations. ** indicates vaccinated individuals and NA denotes suspected re-infection. Filled boxes denote the presence of deletion mutations, which are shown on the 3D structure of the Spike protein in panel c. (c) Positions corresponding to deletion mutations are shown as spheres. Taken together, our analysis highlights both the emergence of novel deletion regions and the expansion of previously defined deletion regions over the past several months should be monitored as candidates for emerging RDRs in the coming months. ### Whole-genome sequencing of SARS-CoV-2 genomes from COVID-19 patients with vaccine breakthrough reveals the presence of distinct deletions in the N-terminal domain The genomic-epidemiology analysis presented above based on publicly accessible data has shown that SARS-CoV-2 appears to be acquiring deletion mutations to evade neutralizing antibodies and the deletable regions are expanding over the course of the pandemic. However, the genome sequences deposited in publicly accessible databases (e.g. GISAID) lack any clinical or phenotypic data such as vaccination status or disease severity of the linked COVID-19 patients. To address this, we performed whole genome viral sequencing from 53 COVID-19 patients in the Mayo Clinic health system, for whom we have complete longitudinal health records and vaccination history. Of these, 20 cases were vaccine breakthrough infections, with the infected individuals having been fully vaccinated at the time of their positive SARS-CoV-2 test. In total, we identified 92 unique mutations, of which 28 are deletions (**Figure 5a**). All observed Spike protein deletions in this cohort occurred in the NTD, with Δ144 and ΔH69/V70 showing the highest prevalence (64% and 62%, respectively). Interestingly, we identified four Spike protein variants harboring one or more deletions that warranted follow up (**Figure 5b**). Two of these novel deletions were isolated from previously vaccinated individuals. Whether the deletions were already present at the times of infection or evolved within these individuals under the pressure of vaccine-induced immunity is not known. One patient who had received two doses of BTN162b2 in January 2021 was subsequently infected in April. The virus recovered from this patient contained a Δ156-164 deletion (**Figure 5b**). Given the prominent ΔF157/R158 deletion in India, this corroborates our prior observation that recurrent deletion regions can expand and suggests that ΔF157/R158 may actually be part of a larger deletion-prone region. More importantly, the possibility that this variant with a large contiguous deletion within the antigenic supersite was able to infect a fully vaccinated individual mandates further characterization of the potential immuno-evasive effects of deletions in this region. In another breakthrough infection (Ad26.COV2.S), a patient who had received the J&J vaccine in early April 2021 subsequently tested positive for SARS-CoV-2 by the end of the month. The recovered viral genome harbored a ΔD88 deletion in addition to ΔH69/V70 and ΔY144 (**Figure 5b**). This ΔD88 deletion is quite rare in GISAID, having been detected in only five of the 1.3 million total sequences deposited to date. That said, there was one unvaccinated individual in this Mayo Clinic cohort who experienced severe COVID-19 with a virus containing a ΔD85-90 deletion (**Figure 5b**). There are three sequences in the GISAID database containing this full stretch deletion, two of which are from Germany and one from Texas (**Table S3**). While still too rare to infer its future trajectory, deletions in this region should be epidemiologically monitored in the coming months, particularly to determine whether they impact vaccine effectiveness. Finally, the viral genome recovered from another severe COVID-19 case in an unvaccinated individual contained a Δ167-174 deletion (**Figure 4b**). While these residues are not within the antigenic supersite itself, they do reside in a loop proximal to it, the deletion of which would almost certainly impact the three-dimensional structure of the supersite (**Figure 4c**). Interestingly, residues 168-172 appear to be a potential emerging deletion hotspot in the GISAID dataset, only narrowly missing classification as an RDR based on our defined criteria (see **Methods**). Thus, this is a candidate region to emerge as a bonafide RDR in the near future. This real-time genomic surveillance of SARS-CoV-2 genomes from COVID-19 patients with linked phenotypic and clinical annotations greatly complements our analysis of the publicly accessible unannotated sequences. Specifically, we have identified deletions within the NTD antigenic supersite which are associated with vaccine breakthrough infections (e.g., Δ156-164), and we have identified other deletions in SARS-CoV-2 isolated from unvaccinated severe COVID-19 patients that could represent emerging recurring deletion regions (e.g., Δ167-174). Taken in the context of previous structural and experimental studies, these deletion patterns seem to represent a historical footprint and future trajectory of the antigenic minimalism strategy employed by SARS-SoV-2 to evade natural and vaccine-induced immunity. ## Discussion The worldwide mass vaccination campaign has had a profound impact on COVID-19 transmission. However, certain variants are less susceptible to neutralization by sera from vaccinated individuals and convalescent COVID-19 patients26,27. Such findings motivate the need to vigilantly track the emergence of new variants and to determine whether they are likely to cause surges or vaccine breakthrough infections. Here, through an integrated analysis of genomic and epidemiologic data, we found that (i) deletions are strongly associated with case surges (ii) deletions in the Spike protein NTD map to an antigenic supersite, (iii) the repertoire of deletions in the Spike protein is expanding over the course of the pandemic and (iv) are present in a subset of vaccine breakthrough variants. Indeed, deletion mutations are not operating independent of other mutation classes. In addition to deletion mutations, several substitution mutations are also associated with surges in cases (e.g. L452R and T478K in the receptor binding domain; **Figure S1**). Thus, a concerted evolution of strategically placed deletions and substitutions appear to be conferring SARS-CoV-2 with the fitness to evade immunity and achieve efficient transmission between hosts (**Figure 6**). ![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/31/2021.05.23.21257668/F6.medium.gif) [Figure 6.](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/F6) Figure 6. Schematic overview of the evolution of SARS-CoV-2 N-terminal domain using deletion mutations to evade immune response. The deletion mutations occur concurrently with other substitution mutations (not highlighted) in the background. Our finding that Spike protein NTD deletions are strongly enriched for association with test positivity surges is notable in the context of a previous report identifying the NTD as the most common site of deletions15. Specifically, this prior study highlighted four recurrent deletion regions in the NTD based on the GISAID data deposited as of October 2020 (146,795 total sequences). Several of these regions overlap with the putative residues of the recently identified NTD antigenic supersite, and deletions within them can abrogate binding to neutralizing antibodies13–15. Our study builds upon this prior work by examining the deletions which have arisen in the interim, during which over 1.1 million additional sequences have been deposited. In addition to validating the previously suggested definitions of recurrent deletion regions RDR1 (ΔH69/V70 and flanking deletions), RDR2 (ΔY144 and flanking deletions), and RDR3 (ΔI210 and ΔN211), we found that RDR4 (previously defined as positions 242-248) has recently expanded to include positions 249-253. These residues are indeed part of the structurally mapped supersite13,14, and a variant with the Δ246-253 deletion increased in prevalence during a recent test positivity surge in Chile. The recently evolved ΔF157/R158 deletion, which has expanded during the massive surge in India, marks a new recurrent deletion region which also maps to the supersite14. Finally, our real time surveillance of clinically annotated SARS-CoV-2 genomes among COVID-19 cases at the Mayo Clinic, including vaccine breakthrough infections, revealed contiguous deletions (Δ85-90 and Δ167-174) that have not been recognized as recurrent deletion regions previously. The proximity of these regions to the antigenic supersite suggests that they may become more prevalent in the coming months and that deletions in these regions should be monitored for associations with future surges. The striking trend that the most frequently deleted NTD regions are proximal to a single antigenic supersite highlights the prominent role that host immunity has played in shaping the genomic evolution of SARS-CoV-2 from the beginning of this pandemic. There are a few limitations of this study. First, the geographic distribution of sequences deposited in the GISAID database is not representative of the global population, with a majority of the sequences coming from the United States or the United Kingdom. Future genomic epidemiology studies would be improved by expanded sequencing efforts in other countries. Second, the identification of mutations associated with surges during early months of the pandemic is complicated by the relative paucity of whole genome sequencing data deposited during that time. Third, the publicly accessible genomic data is not linked to any phenotypic information (e.g., disease severity) or relevant medical histories (e.g., comorbidities and vaccination status). Thus, while we are able to identify correlations between mutational prevalence and case surges, we cannot determine whether particular mutations are associated with more severe disease or are observed more frequently than expected by chance in vaccinated individuals. While the latter shortcoming is partially addressed by our independent whole genome sequencing of virus isolated from COVID-19 cases with accessible longitudinal records (including previously vaccinated individuals), this analysis was limited by the small size of the cohort (n = 53) and the lack of corresponding antibody titer data. We plan to address this by performing more whole genome viral sequencing of SARS-CoV-2 from COVID-19 patients. Taken together, using genomic epidemiology and clinical genomics, we have uncovered that SARS-CoV-2 likely employs antigenic minimalism in the Spike protein as a strategy to evade immune responses induced by infection or vaccination. These findings have important therapeutic and public health policy implications. The repertoire of deletion mutations in the N-terminal domain should be considered when developing future vaccines and biologics to counter the immuno-evasive strategies of SARS-CoV-2. From the public health standpoint, we must expand sequencing efforts around the world and encourage the transparent linking of relevant deidentified patient phenotypic data (e.g. disease severity, vaccination status) to each deposited SARS-CoV-2 genome. While the current analysis focuses on the Spike protein, future work focusing on other SARS-CoV-2 proteins, such as the nucleocapsid protein and RNA-dependent RNA polymerase, will shed light on the role of the overall mutational landscape of the SARS-CoV-2 proteome for viral fitness and immune evasion. Such a holistic understanding of SARS-CoV-2 is imperative to proactively predict mutations that could trigger outbreaks and vaccine breakthroughs, as well as guide the development of comprehensive therapeutic strategies to defeat the COVID-19 pandemic. ## Methods ### Analysis of publicly deposited SARS-CoV-2 genomic sequences 1,313,962 SARS-CoV-2 genome sequences (with 1,276 unique lineages and 70,516 unique mutations) were obtained from GISAID17 (data retrieved from [https://www.gisaid.org/](https://www.gisaid.org/) on 30 April 2021) for the period of December 2019 to April 2021 across 178 countries. Using the Wuhan-Hu-1 sequence as reference (UniProt ID: P0DTC2), there have been 9919 amino acid mutations in the Spike protein detected in at least one sequence. To filter out potential sequencing artifacts, we excluded mutations that were present in fewer than 100 sequences, resulting in 750 unique Spike protein mutations. ### Identification of surge-associated SARS-CoV-2 mutations To identify mutations that have been temporally associated with surges in COVID-19 cases throughout the pandemic, we assessed monthly mutational prevalences and test positivity over three-month intervals in each country. For each of the 750 mutations, the monthly mutational prevalence was computed for a given country as: ![Formula][1] Positivity data for PCR tests was obtained from the OWID resource28,29 (retrieved from [https://github.com/owid/covid-19-data/tree/master/public/data](https://github.com/owid/covid-19-data/tree/master/public/data) on April 23, 2021). For each country, the monthly test positivity was calculated as: ![Formula][2] To identify surge-associated mutations, we classified the monthly mutational prevalence (for each mutation) and the monthly test positivity as increasing (monotonically), decreasing (monotonically), or mixed over sliding three-month intervals over the course of the pandemic. Any mutation which monotonically increased in prevalence over this interval in a country with a simultaneous monotonic increase in test positivity was defined as a “surge-associated mutation.” There were 116 such mutations. ### Comparison of surge-associated mutations to mutations in CDC variants of interest and concern In order to test the value of our method, we obtained the set of CDC variants of interest and concern as of April 15, 20218. At this time, there were 5 variants of concern and 8 variants of interest, with no variants of high consequence. From the 13 classified variants, there 56 unique mutations listed, of which 25 were found only in variants of interest, 24 were found only in variants of concern, and 7 were found in both variants of interest and concern. After identifying the surge-associated mutations as described above, we determined the fraction of mutations comprising the CDC-classified variants which were captured by this approach. ### Assessment of mutation types for enrichment of surge-associated mutations After identifying the 116 surge-associated mutations, we tested whether any of the contributing mutation types (deletions, insertions, or substitutions) were enriched for surge-associated mutations. To do so, we constructed a 2 x 3 table giving the number of surge-associated and non-surge-associated mutations in each category. To determine whether one or more groups showed a statistically significant enrichment, a chi-square p-value was calculated using the chisq.test function from the stats package (4.0.3) in R. Post-hoc tests were performed by considered constructing 2×2 contingency tables to compare each mutation type against all others. Then, odds ratios and their corresponding 95% confidence intervals were calculated using the fisher.test function from the stats package (version 4.0.3) in R. ### Identification of new recurrent deletion regions in the Spike protein Recurrent deletion regions (RDRs) were previously defined as four sites within the NTD to which over 90% of all Spike protein deletions occurred, per the 146,795 SARS-CoV-2 sequences deposited in GISAID as of October 24, 2020. To identify potential new RDRs that have emerged since this time, we first plotted the distribution of deletion counts for each amino acid (i.e. number of sequences in which deletion of the given amino acid was observed) in the Spike protein, considering all 1,313,962 sequences analyzed in this study. We calculated the 95th percentile of the deletion count distribution, which is 22.4. We then bucketed each residue *R* into categories (Yes, No, Possible) reflecting whether or not it should be considered as part of an RDR (i.e, a contiguous stretch of two or more amino acid residues which undergo deletion events more frequently than expected by chance) as follows (illustrated schematically in **Table S4**). Once each residue was categorized in this way, then any residue *P* in the “Possible” category were subjected to further analysis to convert their labels into “Yes” or “No.” Specifically, we took a step-wise approach, walking in both directions from *P* until the first encounter of a residue categorized as “Yes” or “No” (i.e., other residues labeled as “Possible” were ignored). If a residue categorized as “Yes” was encountered before any residue categorized as “No” in either direction, then the “Possible” label was converted to “Yes.” If a residue categorized as “No” was encountered before any residue categorized as “Yes” in both directions, then the “Possible” label was converted to “Yes.” With each residue categorized as “Yes” or “No”, we then simply merged the residue windows with consecutive “Yes” labels to define the updated set of Spike protein RDRs. We name the RDRs on the basis of the first and last amino acid residues contained within the region; for example, the RDR including residues C14, Q15, and V16 is defined as RDR14-16. ### Temporal analysis of expansions in recurrent deletion regions To assess the expansion of regions undergoing deletions over time, we plotted a time series heatmap indicating the first time (month) at which a given deletion was identified across all GISAID sequences, and the number of sequences in which that deletion was detected in that month and all subsequent months. The residues plotted were defined based on the definition of RDRs provided above, which builds upon the regions defined previously 15. ### Structural analysis of SARS-CoV-2 Spike protein Structural analyses and illustrations were performed in PyMOL (version 2.3.4). The cryo-EM structure of the Spike protein characterizing the interaction with a neutralizing antibody 4A8 (PDB identifier: 7C2L), described by Chi et al.21, was retrieved from the PDB. ### Whole viral genome sequencing of SARS-CoV-2 obtained from individuals with breakthrough infections This is a retrospective study of individuals who underwent polymerase chain reaction (PCR) testing for suspected SARS-CoV-2 infection at the Mayo Clinic and hospitals affiliated to the Mayo health system. This study was reviewed by the Mayo Clinic Institutional Review Board and determined to be exempt from human subjects research. Subjects were excluded if they did not have a research authorization on file. SARS-CoV-2 RNA-positive upper respiratory tract swab specimens from patients with vaccine breakthrough or reinfection of COVID-19 were subjected to next-generation sequencing, using the commercially available Ion AmpliSeq SARS-CoV-2 Research Panel (Life Technologies Corp., South San Francisco, CA) based on the “sequencing by synthesis” method. The assay amplifies 237 sequences ranging from 125 to 275 base pairs in length, covering 99% of the SARS-CoV-2 genome. Viral RNA was first manually extracted and purified from these clinical specimens using MagMAX™ Viral / Pathogen Nucleic Acid Isolation Kit (Life Technologies Corp.), followed by automated reverse transcription-PCR (RT-PCR) of viral sequences, DNA library preparation (including enzymatic shearing, adapter ligation, purification, normalization), DNA template preparation, and sequencing on the automated Genexus™ Integrated Sequencer (Life Technologies Corp.) with the Genexus™ Software version 6.2.1. A no-template control and a positive SARS-CoV-2 control were included in each assay run for quality control purposes. Viral sequence data were assembled using the Iterative Refinement Meta-Assembler (IRMA) application (50% base substitution frequency threshold) to generate unamended plurality consensus sequences for analysis with the latest versions of the web-based application tools: Pangolin30 for SARS-CoV-2 lineage assignment; Nextclade31 for viral clade assignment, phylogenetic analysis, and S codon mutation calling, in comparison to the wild-type reference sequence of SARS-CoV-2 Wuhan-Hu-1 (lineage B, clade 19A). ## Data Availability After publication, the data will be made available upon reasonable requests to the corresponding author. A proposal with detailed description of study objectives and the statistical analysis plan will be needed for evaluation of the reasonability of requests. Deidentified data will be provided after approval from the corresponding author and the Mayo Clinic. ## Data Availability After publication, the data will be made available upon reasonable requests to the corresponding author. A proposal with detailed description of study objectives and the statistical analysis plan will be needed for evaluation of the reasonability of requests. Deidentified data will be provided after approval from the corresponding author and the Mayo Clinic. ## Declaration of Interests AJV, PA, PL, PG, RS, AS, DRC, and VS are employees of nference and have financial interests in the company and in the successful application of this research. nference collaborates with bio-pharmaceutical companies on data science initiatives unrelated to this study. These collaborations had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. JCO receives personal fees from Elsevier and Bates College, and receives small grants from nference, Inc, outside the submitted work. ADB is a consultant for Abbvie and Flambeau diagnostics, is a paid member of the DSMB for Corvus pharmaceuticals, Equilium, and Excision biotherapeutics, has received fees for speaking for Reach MD, owns equity for scientific advisory board positions in nference and Zentalis, and is founder and President of Splissen therapeutics. JH, JCO, GJG, AWW, AV, MDS, and ADB are employees of the Mayo Clinic. The Mayo Clinic may stand to gain financially from the successful outcome of the research. nference and Mayo Clinic have filed a provisional patent application associated with this study. This research has been reviewed by the Mayo Clinic Conflict of Interest Review Board and is being conducted in compliance with Mayo Clinic Conflict of Interest policies. ## Author Contributions VS conceived the study. AJV and VS advanced the study design and generated the hypotheses. PA, PL, PG, AJV and VS wrote the manuscript and reviewed the findings. AJV, PA, PL, PG, RS, AS, DRC, JDY, and JCO contributed methods, data, analysis, or software. JCOH, JDY, BSP, AN, RTH, ADB, and JDH reviewed the study, findings, and the manuscript. All authors revised the manuscript. ## Supplementary Information ![Figure S1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/05/31/2021.05.23.21257668/F7.medium.gif) [Figure S1.](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/F7) Figure S1. Comparison of surge-associated mutations identified in this study and mutations present in variants of interest or concern as categorized by the CDC. View this table: [Table S1:](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/T1) Table S1: List of GISAID accession IDs with the same recurrent deletions observed as seen in the vaccine breakthrough patients. View this table: [Table S2:](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/T2) Table S2: All the mutations in the spike protein that have positive correlation with the test positivity percentage across the recent timeline (Feb-Apr 2021) of pandemic in India has been tabulated here. Following are the expansion of the abbreviations used in the table header - Total Seqs. Dep. : Total number of sequences deposited in the particular month in India. Test Pos.% : Test positivity percentage, Mut Prev.%: Mutation prevalence percentage, Rho (Pearson) Mut Prev.% vs Test Pos.% : The Pearson correlation Rho value between test positivity and mutational prevalence, Test pos. List : test positivity percentage over the window of 3 months (February 2021 to April 2021), Mut Prev. List : mutation prevalence percentage over the window of 3 months, MaxΔ Mut Prev. : maximum difference in the mutational prevalence percentage observed over the window of 3 months. Total sequences depos View this table: [Table S3:](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/T3) Table S3: All the mutations in the spike protein that have positive correlation with the test positivity percentage across the recent timeline of pandemic (Feb-Apr 2021) in Chile has been tabulated here. Following are the expansion of the abbreviations used in the table header - Total Seqs. Dep. : Total number of sequences deposited in the particular month in Chile. Test Pos.% : Test positivity percentage, Mut Prev.%: Mutation prevalence percentage, Rho (Pearson) Mut Prev.% vs Test Pos.% : The Pearson correlation Rho value between test positivity and mutational prevalence, Test pos. List : test positivity percentage over the window of 3 months, Mut Prev. List : mutation prevalence percentage over the window of 3 months, MaxΔ Mut Prev. : maximum difference in the mutational prevalence percentage observed over the window of 3 months. View this table: [Table S4:](http://medrxiv.org/content/early/2021/05/31/2021.05.23.21257668/T4) Table S4: Schematic representation of the decision schema for considering a residue *R* to be a part of a RDR. Deletion count of <=22 is represented by **⨯**. Deletion count of >=23 is represented by **✓**. ## Acknowledgements The authors thank Murali Aravamudan and Eric Schadt for the careful review and feedback on this manuscript. * Received May 23, 2021. * Revision received May 23, 2021. * Accepted May 31, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## References 1. 1.COVID-19 map - johns Hopkins Coronavirus resource Center. [https://coronavirus.jhu.edu/map.html](https://coronavirus.jhu.edu/map.html). 2. 2.Mallapaty, S. India’s massive COVID surge puzzles scientists. Nature 592, 667–668 (2021). 3. 3.Pawlowski, C. et al. FDA-authorized COVID-19 vaccines are effective per real-world evidence synthesized across a multi-state health system. MedRxiv (2021). 4. 4.Corchado-Garcia, J. et al. Real-world effectiveness of Ad26.COV2.S adenoviral vector vaccine for COVID-19. doi:10.1101/2021.04.27.21256193. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMS4wNC4yNy4yMTI1NjE5M3YxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDUvMzEvMjAyMS4wNS4yMy4yMTI1NzY2OC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 5. 5.Dagan, N. et al. BNT162b2 mRNA Covid-19 Vaccine in a Nationwide Mass Vaccination Setting. N. Engl. J. Med. 384, 1412–1423 (2021). 6. 6.Hacisuleyman, E. et al. Vaccine Breakthrough Infections with SARS-CoV-2 Variants. N. Engl. J. Med. (2021) doi:10.1056/NEJMoa2105000. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMoa2105000&link_type=DOI) 7. 7.Kustin, T. et al. Evidence for increased breakthrough rates of SARS-CoV-2 variants of concern in BNT162b2 mRNA vaccinated individuals. bioRxiv (2021) doi:10.1101/2021.04.06.21254882. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1101/2021.04.06.21254882&link_type=DOI) 8. 8.CDC. SARS-CoV-2 Variant Classifications and Definitions. [https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/variant-surveillance/variant-info.html](https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/variant-surveillance/variant-info.html) (2021). 9. 9.COVID-19 Virtual Press conference transcript - 10 May 2021. [https://www.who.int/publications/m/item/covid-19-virtual-press-conference-transcript10-may-2021](https://www.who.int/publications/m/item/covid-19-virtual-press-conference-transcript10-may-2021). 10. 10.Barnes, C. O. et al. SARS-CoV-2 neutralizing antibody structures inform therapeutic strategies. Nature 588, 682–687 (2020). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33045718&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F31%2F2021.05.23.21257668.atom) 11. 11.Zost, S. J. et al. Rapid isolation and profiling of a diverse panel of human monoclonal antibodies targeting the SARS-CoV-2 spike protein. Nat. Med. 26, 1422–1427 (2020). 12. 12.Liu, L. et al. Potent neutralizing antibodies against multiple epitopes on SARS-CoV-2 spike. Nature 584, 450–456 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2571-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32587975&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F31%2F2021.05.23.21257668.atom) 13. 13.McCallum, M. et al. N-terminal domain antigenic mapping reveals a site of vulnerability for SARS-CoV-2. Cell 184, 2332–2347.e16 (2021). 14. 14.Cerutti, G. et al. Potent SARS-CoV-2 neutralizing antibodies directed against spike N-terminal domain target a single supersite. Cell Host Microbe 29, 819–833.e7 (2021). 15. 15.McCarthy, K. R. et al. Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape. Science 371, 1139–1142 (2021). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNzEvNjUzNC8xMTM5IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDUvMzEvMjAyMS4wNS4yMy4yMTI1NzY2OC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 16. 16.Ritchie, H. et al. Coronavirus Pandemic (COVID-19). Our World in Data (2020). 17. 17.Shu, Y. & McCauley, J. GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 22, (2017). 18. 18.PANGO lineages. [https://cov-lineages.org/lineages/lineage\_B.1.617.html](https://cov-lineages.org/lineages/lineage_B.1.617.html). 19. 19.Kuppalli, K. et al. India’s COVID-19 crisis: a call for international action. Lancet (2021) doi:10.1016/S0140-6736(21)01121-1. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(21)01121-1&link_type=DOI) 20. 20.Taylor, L. Covid-19: Spike in cases in Chile is blamed on people mixing after first vaccine shot. BMJ 373, 1023 (2021). 21. 21.Chi, X. et al. A neutralizing human antibody binds to the N-terminal domain of the Spike protein of SARS-CoV-2. Science 369, 650–655 (2020). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjkvNjUwNC82NTAiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wNS8zMS8yMDIxLjA1LjIzLjIxMjU3NjY4LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 22. 22.Anand, P., Puranik, A., Aravamudan, M., Venkatakrishnan, A. J. & Soundararajan, V. SARS-CoV-2 strategically mimics proteolytic activation of human ENaC. Elife 9, (2020). 23. 23.Johnson, B. A. et al. Loss of furin cleavage site attenuates SARS-CoV-2 pathogenesis. Nature 591, 293–299 (2021). 24. 24.Coutard, B. et al. The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade. Antiviral Res. 176, 104742 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.antiviral.2020.104742&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32057769&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F31%2F2021.05.23.21257668.atom) 25. 25.Walls, A. C. et al. Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein. Cell 181, 281–292.e6 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=doi.org/10.1016/j.cell.2020.02.058&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F31%2F2021.05.23.21257668.atom) 26. 26.Liu, Y. et al. Neutralizing Activity of BNT162b2-Elicited Serum. N. Engl. J. Med. 384, 1466–1468 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMc2102017&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33684280&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F31%2F2021.05.23.21257668.atom) 27. 27.Wang, P. et al. Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7. Nature 593, 130–135 (2021). 28. 28.Our World in Data. [https://ourworldindata.org/](https://ourworldindata.org/). 29. 29.Hasell, J. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F05%2F31%2F2021.05.23.21257668.atom) 30. 30.COG-UK. [https://pangolin.cog-uk.io/](https://pangolin.cog-uk.io/). 31. 31.Nextclade. [https://clades.nextstrain.org/](https://clades.nextstrain.org/). [1]: /embed/graphic-7.gif [2]: /embed/graphic-8.gif