Genomic Surveillance of SARS-CoV-2 in Erie County, New York

Early in the SAR-CoV-2 pandemic, we established a whole genome sequencing pipeline to assess lineages circulating in Western New York. Initial sequences revealed entry into the region via Europe, similar to observations in New York City. However, as the pandemic progressed and variants of concern emerged, we observed distinct patterns in lineages relative to NYC. Notably, B.1.427 became dominant in Western New York, before it was displaced by B.1.1.7. Our hierarchical cluster analysis of B.1.1.7 lineages, which by May 2021 made up ~ 80% of all cases, indicated both multiple introductions and community spread. Our work highlights the importance of widespread, regional surveillance of SARS-CoV-2 across the United States.


Introduction
As SARS-CoV-2 moves through populations around the world, viral genomes accumulate genetic variants as a result of endogenous, low level of replication errors. Genomic information generated during the ongoing pandemic provides important insights into the mode and rate of viral evolution and can assist epidemiological efforts. It can also help support vaccine and diagnostic test development and can be a tool in assessing treatments. Faster and broader generation of genomic sequence can thus contribute to improved public health and can help to prepare for future outbreaks of this or other viruses.
As the SARS-CoV-2 virus spreads from person-to-person, it undergoes multiple rounds of replication, accumulating more mutations, which are specific to that lineage. Therefore, the pattern and combination of mutations provide information about 1) the molecular evolution of the virus and 2) its relatedness to other versions of the virus. Both allow us to develop phylogenetic trees, with which to infer the global path the virus has taken to infect an individual in a particular community. In this way, it is possible to predict that the virus that infected one person in the United States is derived from a lineage that came from China and is distinct from a virus that arrived via Europe. Additional mutations will accumulate within a community, making it possible to monitor community spread via viral genome sequencing. 1 However, we do not yet have a clear picture of the mutation profile of SARS-CoV-2 or rate of genome instability.
The mutations that have been observed through global genome sequencing efforts are typically single nucleotide variants (SNVs) or nucleotide insertions or deletions (in/dels) that lead to frame shift mutations; there is some evidence for mutation clusters (nextstrain.org). 2 There is little information about structural rearrangements that may be occurring as the virus spreads, although it is known that this type of virus does undergo recombination with some frequency. The presence of rearrangements would have significant implications for understanding the phylogeny and spread of the virus, as well as having potential functional consequences. The prediction is that beneficial (for the virus) mutations will become more prevalent in a population. However, their presence may simply be a result of genetic drift.
In coding regions, SNVs may result in synonymous changes (no change in coding sequence) or nonsynonymous changes (change in amino acid residue), which may have functional, biological relevance for viral entry into the cell, infection and disease, and ultimately transmission to a new host. 3 For example, there is some evidence that the increasingly prevalent D614G change in the Spike (S) protein, which arose in European populations, 4 affects the stability of the protein and thus infectivity, using both pseudovirus and live, infectious virus. [4][5][6][7] Non-coding mutations in 5' or 3' UTRs may also play a role in viral function by altering gene . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 6, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 expression of genome replication signals and has not been assessed. Furthermore, combinations of mutations could have additive or synergistic effects on viral biological function. Similarly, structural rearrangements may impact viral activities.

SARS-CoV-2 accumulates mutations as it infects new individuals -each round of infection (replication)
has the potential to introduce novel mutations. Those mutations that result in viable virus will be dispersed and allow the virus to continue to spread. The rate of mutation and the variant types have functional consequences for the virus, as well as for development of vaccines against the virus. Regional variations in mutation profiles will functionally distinguish viral lineages and are important for developing public health policy and clinical decisions based on the presence and prevalence of different variants of concern and variants of interest.

Methods and Materials
All SARS-CoV-2 samples were de-identified. This study was reviewed by the University at Buffalo Institutional Review Board and determined to be "Not Human Research" (IRB ID: STUDY00004515).

SARS-CoV-2 Amplicon Tiling, Library preparation and Sequencing
Patient nasopharyngeal swabs are collected from various locations throughout Erie County, New York and processed at the Erie County Public Health Lab (ECPHL) following standard pipelines for COVID PCR testing.
Positive samples are de-identified, re-numbered and the positive RNA samples are delivered to the University at Buffalo Genomics and Bioinformatics Core (UB GBC) for library preparation and sequencing. To ensure efficient sequencing library preparation, we presort positive samples with Ct values between 15-30 and proceed with cDNA manufacture. Samples with these Ct values provide enough template for optimal multiplex PCR whereas Ct values above 30 typically do not.

RNA extraction from ECPHL and Kaleida Health samples
Nasopharyngeal swabs were placed into viral storage media (KHL samples) or in viral lysis buffer (ECPHL) and delivered to the UB Genomics Core either fresh or frozen. All viral RNA extractions were performed in a BL2.1 level Biosafety cabinet. An aliquot (200 μl) of the viral media was removed and spun for 10 minutes at 1500 x g to pellet contaminating cells. Following spindown, 140 ul of the viral supernatant was transferred to a new tube with lysis buffer for RNA extraction using the Qiagen Viral RNA extraction kit following the manufacturer's protocol. Carrier RNA was added to facilitate efficient viral RNA isolation.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 6, 2021. ; https://doi.org /10.1101/10. /2021 Extracted RNA was eluted from the final column using EB buffer (60 μl) and RNA was stored at -80 o C until used for cDNA synthesis and library prep. Once quantified by qPCR, pooled libraries (4 nM) were denatured using 0.2 N sodium hydroxide.
Denatured libraries were diluted to a final concentration in Illumina HT1 buffer(12.5 pM for MiSeq and 1.5 pM for NextSeq) . The diluted libraries were loaded onto an Illumina MiSeq sequencer PE300 V3 sequencing or the NextSeq for PE75 V1.5 sequencing in mid-output mode. Control PhiX was added at 1% for loading alignment control. Upon completion of the sequencing run, data were transferred to the high-performance computing facility (Center for Computational Research) located in the Center of Excellence building at the University at Buffalo.

UB GBC SARS-COV-2 Bioinformatics Analysis
The GBC SARS-COV-2 analysis pipeline (https://github.com/UBGBC/fastq-to-consensus) is modelled off of the recommendations provided by the CDC SARS-COV-2 spheres working group (https://github.com/CDCgov/SARS-CoV-2_Sequencing), and is written in the python pipeline framework Snakemake. This rule-based framework establishes a complete chain of custody of all commands issued with version control, while integrating directly into CCR's SLURM workload manager. To briefly describe the pipeline, sequencer results are first demultiplexed using illumina's bcl2fastq utility, producing FASTQ format files. Then, reads are checked for initial quality using fastqc, fastq_screen, and multiqc, prior to adapter removal analysis via the tool Cutadapt. Once samples are fully evaluated for sequencing run quality, each sample is aligned to the SARS-COV-2 genome build MN908947.2 using the bwa alignment tool. Next, IVAR filters ARTIC primer locations, followed by bcftools variant calling, normalization, and filtering. The GBC pipeline produces highly accurate variant call set in VCF format, capable of detecting SNPS, insertions, and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 6, 2021. ; https://doi.org /10.1101/10. /2021 deletions. In addition to calling consensus variants, low frequency variants are also captured in separate VCF format files to facilitate more in-depth downstream analysis.
Following consensus sequence generation, variant calls are inspected using Broad's Integrated Genomics Viewer (IGV) browser, files are added to a local instance of Nextstrain, 2 evaluated for quality and clade variants in Nextclade, and uploaded to the GISAID 10, 11 repository for public access. In addition to Nextclade, every sample is also analyzed using the online tool Pangolin COVID-19 Lineage Assigner in order to resolve lineage placement. 12

Hierarchical clustering
To characterize variants that are unique within distinct lineages, hierarchical cluster analysis was performed in the R statistical programming language. Variant call files for every sample belonging to the B.1.1.7 lineage were subset, and clustered with other members of the corresponding lineage. To detect inter-lineage variation, we compared each sample's spread of variants utilizing the Bedtools jaccard function, which generates a Jaccard index score between every sample. The Jaccard index is a measure of similarity and diversity between finite sets. The resulting similarity matrix was then used as input for hierarchical cluster analysis in RStudio. 13

Phylogenetic analysis
Consensus genomes were aligned using the command line version of the MAFFT multiple sequencing alignment algorithm. 14 The resulting alignment was then used as input into the FastTree algorithm [price, 2009; price, 2010}, inferring maximum-likelihood phylogeny using the jukes-cantor distance model of nucleotide evolution, generating a newick formatted phylogenetic tree. For data visualizations, the R packages TreeIO, 15 and ggtree, 16 using the pangolin lineage calls 12 as data overlays to highlight lineage specific branchpoints of the phylogenetic tree. In addition to R based phylogenetic trees, screen captures using our local instance of Nextstrain was used to highlight different variant spreads.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Results and Discussion
Early infections derived from Europe. In April 2020, through a partnership with Erie County Public Health Laboratories (ECPHL), 50 de-identified COVID-19+ patient samples from Erie County were prepared for sequencing. From the initial set of 50 samples, we obtained robust sequence data from 32 genomes. These data were integrated in to the Nextstrain database 2 to predict phylogenetic relationships with viral genomes worldwide. The majority of samples are predicted to have arrived in the United States via Europe (Fig.   1), while a handful arrived via China or Singapore. This was consistent with European origins of SARS-CoV-2 cases in New York City 17 and was distinct from migration patterns on the west coast of the United States, where early infections were predicted to have arrived via Asia. 18 Notably, all of the viruses derived from Europe contained the D614G Spike protein mutation (Fig. 2). 19 We received  Notably, positive SARS-CoV-2 cases were very low in Erie County and WNY during the summer of 2020. The Wadsworth Center, which was doing spot  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 6, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 sequencing from around New York State, similarly saw a significant drop-off in samples from Erie County to sequence over the summer. As a result, there is a significant gap in genomic sequence information.
Unfortunately, cases began to increase again in the fall of 2020, with significant spikes in late November and December, providing samples for genomic surveillance. Sequencing from late 2020 and into 2021 revealed that the majority of samples in Erie County could be assigned to the 20G clade, which is derived from 20C and is concentrated in the United States (Figure 3). 20G is characterized by mutations resulting in the following residues changes: ORF1b N1653D, ORF3a G172V, N67S and N199L.

Detection of variants of concern and variants of interest in early 2021
In 2021, we adopted the pangolin nomenclature in our bioinformatic pipeline, which provided more granularity as SARS-CoV-2 lineages have continued to emerge. A subset of those lineages, e.g. B.1.1.7, B.1.351, P.1, and so forth, are so-called variants of concern (VOC), as defined by WHO and the CDC. These lineages have been associated with increased transmissibility and resistance to some treatment options 20   Transitions from 20C/20G to variants of concern have occurred rapidly over the span of five months.

Fig. 4. Detection of VOCs in Erie County clade breakdown of Erie
County samples. Rapid transitions to variants of interest/concern occurred, with B.1.1.7 becoming the dominant variant as of April 2021.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 6, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 identical. We also identified one P.2 variant, currently designated a variant of interest by WHO and the CDC, originally identified in Brazil.
Through February and March, the number of variants continued to increase in WNY. We also began to sequence SARS-CoV-2 samples from Kaleida Health. In February, an additional thirteen B.1.427 samples and one B.1.429 samples were collected out of 99 samples (14%) from Erie County Public Health Laboratory (ECPHL); no other variants of concern were observed. From Kaleida Health Laboratory (KHL), three of the 40 samples sequenced (7.5%) were B.427, for an overall frequency of 12.2% variants of concern -all B.1.427/429. In March, the proportion of variants of concern increased significantly and additional variants emerged in the samples. From ECPHL, 59/85 (69%) were variants of concern (24 x B. 29/55 (52%) were variants of concern (13 x B.1.427, 11 x B.1.1.7, 5 x B.351 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 6, 2021. ; Although increases in VOCs and VOIs were observed across the country, there were clearly regional differences in the types of variants observed. For example, in WNY, primarily B. This highlights the importance of regional surveillance to provide the most relevant data to clinicians in an area.
We note that the samples sequenced are essentially "samples of convenience" and are therefore not necessarily representative of the absolute percentages of variants of concern and variants of interest in WNY.
We were able to sequence samples to which we had access, which was only a fraction of positive cases in the region in this time period. Indeed, out of the nearly 90,000 positive diagnoses in Erie County, we have profiled just over 1% of these cases. Nonetheless, this data does provide insight into COVID-19 strain dynamics in Erie County, and makes clear that there was a sudden increase in VOCs and VOIs in early 2021, similar to other regions of the United States. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 6, 2021. ; https://doi.org /10.1101/10. /2021 In addition to standard phylogenetic analysis, to gain additional insights into the heterogeneity within  in Erie County, we performed hierarchical cluster analysis as an alternative mode for data visualization. Cluster analysis allowed for the identification of groups of similar variants, which combined with collection date can inform how viral evolution occurred on a local scale. We evaluated the B.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)