2020 SARS-CoV-2 diversification in the United States: Establishing a pre-vaccination baseline

In 2020, SARS-CoV-2 spread across the United States (U.S.) in three phases distinguished by peaks in the numbers of infections and shifting geographical distribution. We investigated the viral genetic diversity in each phase using sequences publicly available prior to December 15th, 2020, when vaccination was initiated in the U.S. In Phase 1 (winter/spring), sequences were already dominated by the D614G Spike mutation and by Phase 3 (fall), genetic diversity of the viral population had tripled and at least 54 new amino acid changes had emerged at frequencies above 5%, several of which were within known antibody epitopes. These findings highlight the need to track the evolution of SARS-CoV-2 variants in the U.S. to ensure continued efficacy of vaccines and antiviral treatments.


Genetic characterization and mutation analyses
Using in-house bioinformatic pipelines available at https://github.com/aacapoferri/COV2, SARS-COV-2 mutation frequencies were determined for each clade/Phase compared to either the majority-rule Phase 1 consensus sequence described next, or the Wuhan-Hu-1 reference genome (GenBank accession, NC_045512.2) or the VOCs (37). To exclude clade-defining mutations and amplification/sequencing errors, several steps were taken. First, majority-rule consensus sequences for each clade were generated from all genomes in Phase 1 and used as a reference in downstream analyses. This approach allowed majority clade-associated mutations to be omitted for the detection of new mutations only. Second, a threshold ³5% frequency was used to eliminate mutations that were rarely detected and, therefore, could be PCR, sequencing errors, or real but not indicative of an emerging variant (see also "Number of sequences to detect mutations" section of methods). Mutation frequencies were plotted and annotated using the "Mutation frequency for SARS.R" script (an example provided on https://github.com/aacapoferri/COV2). Mutations that were above the ³5% threshold were noted for each G-based clade and phase. The heatmaps for clades G, GH, and GR were generated to visualize the persistence and emergence of mutations present at frequencies ³5% using GraphPad Prism V.8.4.3.
All sequences for each clade and Phase were included in each analysis with the exception of clade GH, where the number of sequences was too high for measurements of genetic diversity and divergence and, therefore, 2,500 sequences were randomly subsampled for those analyses.
Mutation distributions were determined by assessing the number of mutations per sequence for each clade during each Phase. Statistical shifts in population structure (divergence) were determined using a test for panmixia with a statistical cut-off at p<10 -3 (38). Population genetic for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258185 doi: medRxiv preprint diversity was calculated as average pair-wise distance (APD) in MEGAX for each clade/Phase (39). These calculations were repeated with all clades during each month with at least 10 sequences or randomized sub-sampling of 50 sequences in triplicate to determine the APD. In cases where a particular clade had less than 10 sequences in a given month, they were excluded from analysis.
Identical SARS-COV-2 genomes were collapsed to determine the number of different variants in the dataset. A simple linear regression was determined for clades G, GH, and GR in GraphPad Prism V.8.4.3 with the linear equation and goodness of fit R 2 reported. The slope was understood as the rate of change in %APD/month. The length of the SARS-CoV-2 genome is ~30,000 base pair, which when multiplied by the slope, gave an approximate number of nucleotide changes/month for a given G-based clade.
To examine the number of mutations per sequence in the G-based clades during each phase, the distribution of the number of mutations relative to the Wuhan-Hu-1 reference genome was plotted using in-house pipelines available at https://github.com/Wei-Shao/COV2-Analysis by Hamming distances.

Analysis of Variants of Concern (VOCs)
The four VOCs used in this study included the Pango lineage B.1.1.7 (Nexstrain 20I/501Y.V1, GISAID clade GR, originally isolated in the U.K.), P.1 (20J/501Y.V3, GR, Brazil), B.1.351 (20H/501Y.V2, GH, South Africa), and the B.1.427+429 (20C/S:452R, GH, California) sampled at locations in the U.S. After sequence sample processing, the number of mutations per sequence was determined for each VOC. The collection sampling on GISAID was set to any submitted sequences that spanned 5 months (November 2020-March 2021). We wanted to ensure that all for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The distribution of the number of mutations per sequence was compared to each respective derivative GISAID clade distribution during Phase 3, which was closest to the emergence of the VOC. The overall APD for each VOC dataset was calculated and compared to the genetic diversity of the G-based clades in the U.S. during 2020.
Due to the interest in the Spike protein of SARS-CoV-2 for vaccine and therapeutic strategies, VOC defining mutations in the S gene were specifically explored and were compared to sequences obtained from each Phase of infections in 2020. For this analysis mutations at frequencies less than 5% were included.

Potential effect of mutations
Majority-rule consensus sequences were generated for the G-based clades at each phase and were aligned to the Wuhan-Hu-1 reference genome. Previously mapped T-cell and B-cell epitopes from Spike and Nucleocapsid were annotated on the reference genome and included in Table S5 Nonsynonymous mutations observed in the G-based clades that differed from the reference in either T-cell or B-cell epitopes were noted.

Viral genetic surveillance resources
for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.   Table S1.
for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Average pairwise distance (APD) was calculated per month for sequences within each clade. Time points with less than 10 sequeunces were excluded. Three randomized subsamples of 50 sequences each were analyzed and the APD was calculated to ensure consistency between each subsampling. If there were 11-50 sequences for a clade at a given month, no subsampling was performed. A standard linear regression was run for clades G, GH, and GR. The rate of change in %APD over time is noted in the figure with the goodness of fit (R 2 ) reported. The %APD was plotted according to each respective month. The rate of change was 6.50x10 -3 APD/month (clade G), 9.51x10 -3 APD/month (clade GH), and 7.39x10 -3 APD/month (clade GR); whch corresponded to 1.

A) B)
for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  . Where (N) is the number of sequences of any given sample that is required to be sampled to determine (f), the frequency the mutant detected, at a given (p), the probability of detecting the mutation at the provided frequency. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Mutations in G clades that are located in MHC-II HLA-DR T-cell and B-cell epitopes.
Nucleocapsid MHC-I/II T-cell and Spike B-cell epitopes with detected mutations are shown. Nonsynonymous mutations were detected in Clades GH and GR starting in Phase 2 through Phase 3 (highlighted). The amino acid position of the epitope is denoted next to N (Nucleocapsid) and S (Spike). The U.S. frequency of detection is noted for each Phase. Epitopes examined are defined by Table S5.   This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

A L T G I A V E Q D K N T Q E F A Q V
The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258185 doi: medRxiv preprint Table S1 Demographics and Regional Division based on the U.S. Census Bureau. The 2019 estimated residential population for each State is reported. The first reported confirmed plus probable cases of COVID-19 are included (https://covidtracking.com/). The U.S. is generally divided into the Northeast, South, West, and Midwest. Where each major region can be divided into sub-regions. These estimated populations were then used to normalize the incidence of COVID-19 cases and deaths in the U.S. at the sub-regional level. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258185 doi: medRxiv preprint Table S2. Non-associated Clade G mutations that either persisted or emerged during 2020. Data were extracted where, during at least one Phase period, the frequency of a particular mutation was ³5% compared to the clade Phase 1 majority consensus. The mutation specifies the nucleotide change as well as the amino acid change with coordinates of the gene and its product. Non-synonymous mutations are shown in red. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258185 doi: medRxiv preprint Table S3. Non-associated Clade GH mutations that either persisted or emerged during 2020. Data were extracted where, during at least one Phase period, the frequency of a particular mutation was ³5% compared to the clade Phase 1 majority consensus. The mutation specifies the nucleotide change as well as the amino acid change with coordinates of the gene and its product. Non-synonymous mutations are shown in red, This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258185 doi: medRxiv preprint Table S4. Non-associated Clade GR mutations that either persisted or emerged during 2020. Data were extracted where, during at least one Phase period, the frequency of a particular mutation was ³5% compared to the clade Phase 1 majority consensus. The mutation specifies the nucleotide change as well as the amino acid change with coordinates of the gene and its product. Non-synonymous mutations are shown in red. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.