Abstract
The use of SARS-CoV-2 metagenomics in wastewater can allow the detection of variants circulating at community level. After comparing with clinical databases, we identified three novel variants in the spike gene, and six new variants in the spike detected for the first time in Spain. We finally support the hypothesis that this approach allows the identification of unknown SARS-CoV-2 variants or detected at only low frequencies in clinical genomes.
The appearance of SARS-CoV-2 and its rapid spread worldwide is causing serious human and economic loss. Transmission of the virus occurs mainly through aerosol and respiratory secretions1, but it has also been seen that, due to its replication capacity in the gastrointestinal tract2, it is excreted in feces and urine, as previously reported for Severe Acute Respiratory Syndrome (SARS) and Middle-East Respiratory Syndrome (MERS) viruses. For this reason, it has been possible to detect the genetic material of the virus in the feces of both symptomatic and asymptomatic people3. These findings have led to the use of wastewater in SARS-CoV-2 monitoring. As with other pathogens, the use of Wastewater-Based Epidemiology (WBE) is a very useful tool for large-scale epidemiological control4–7. One of the reasons for the success of WBE is that wastewater samples are a non-invasive and inexpensive source of information to investigate the spread of different genetic variants of SARS-CoV-2 within a community. It provides real-time information on the circulating strains of SARS-CoV-2, which is essential for the development of vaccines and drugs. This is particularly relevant in view of the current situation where the worldwide population is being vaccinated against specific SARS-CoV-2 strains. Massive sequencing techniques allow us to analyze a large number of SARS-CoV-2 genomes, including those present in symptomatic and asymptomatic persons. Through the analysis of sequences, it is possible to detect low-frequency variants (LFV) and to know the gene variants that are circulating at a certain time and place8,9. These analyses will make it possible to detect the entry of new lineages into the population, as well as the appearance of polymorphic sites8,9.
In this work, 40 grab samples were collected from April to October 2020 from 14 wastewater treatment plants (WWTPs) located in three geographical regions (north: n=5, center: n=21, and south: n=14) within the Spanish peninsula (Table S1). For each sample, 200 mL of influent wastewater samples were concentrated following an aluminum-based adsorption precipitation method10,11. Nucleic acid was extracted from wastewater concentrates using an automated method, the Maxwell RSC Pure Food GMO and Authentication Kit (Promega) with slight modifications12. SARS-CoV-2 nucleic acid was detected by RT-qPCR using One Step PrimeScript™ RT-PCR Kit (Perfect Real Time) (Takara Bio, USA) targeting three genomic regions of the virus corresponding to N1 region of the nucleocapsid gene (N1), the envelope gene (E), and IP4 region of the RNA-dependent RNA polymerase gene (IP4) using primers, probes and conditions previously described13–15. Genomic sequencing of SARS-CoV-2 from wastewater samples was carried out following ARTIC protocol version 3 for retrotranscription and amplification by multiplex PCR (https://www.protocols.io/view/ncov-2019-sequencing-protocol-v3-locost-bh42j8ye). Sequencing libraries were built using the Nextera DNA Flex Library Prep kit (Illumina) and sequenced on the Illumina MiSeq platform by paired-end reads (2×200 bp). Raw reads were cleaned for adaptors and low quality nucleotides by using cutadapt software16 and reformat.sh from bbmap (sourceforge.net/projects/bbmap/), respectively. Nucleotides with Phred score lower than 30 were discarded. Clean reads were aligned to the genome of SARS-CoV-2 isolate Wuhan-Hu-1 (MN908947.3) using the Burrows-Wheeler Aligner v0.7.17-r118817 and indexed by samtools18. For the analysis of genomic coverage for each sample, only nucleotides with at least 20X depth were taken into account. Nucleotide variants of SARS-CoV-2 isolate Wuhan-Hu-1 (MN908947.3) were detected with the aligned reads using mpileup from samtools19 and the command variants of ivar software20. For the assumption of one variant, at least a 50X depth and quality score higher than 30 were used as cutoff. Alignments were manually curated to avoid nucleotide variants that corresponded to incorrectly trimmed adaptors8.
Among the 40 sequenced samples, 28 samples (70%) showed percentages of 20X coverage values higher than 18% of the reference genome length (Figure 1), while the remaining samples ranged from 1.3% to 13.6% (Figure S1). In order to study the potential correlation between viral loads and genome coverage in wastewater samples, Pearson correlation analyses between RT-qPCR outputs (genome copies per liter) versus genome coverage were carried out for each sample. No strong correlations were found for any of the analyzed targets.
Variant analysis showed a total of 238 nucleotide substitutions and 6 deletions in comparison with the reference genome of SARS-CoV-2 isolate Wuhan-Hu-1 (MN908947.3). Among detected nucleotide variants, 101 polymorphic sites were found in ORF1a polyprotein, 67 in ORF1b, 21 in the spike glycoprotein, 8 in ORF3a, one in the envelope protein, 13 in the membrane glycoprotein, one in ORF7a, 3 in ORF8, 10 in the nucleocapsid gene, one in ORF10, and 12 in intergenic regions (Table 1). In some samples, these nucleotide variants were found in combination with the reference nucleotide present in the genome MN908947.3, referred to as ‘mixed samples’ in Table 1. The percentage of non-synonymous substitutions ranged from 46.1% (in membrane gene) to 100% (in ORF7a and ORF10) (Table 1). Among the nucleotide variants detected in the spike glycoprotein (n=21), 13 of them corresponded to non-synonym substitutions already described (https://mendel.bii.a-star.edu.sg) with the exception of 3 variants that corresponded to the amino acid substitutions G648V, A893T, and L1152S. Seven out of 13 polymorphic sites detected in the spike glycoprotein were not previously described in Spanish genomes (Table 2) according to the database of the Agency of Science, Technology and Research of Singapore (https://mendel.bii.a-star.edu.sg). However, three of these nucleotide changes (amino acid substitutions G404V, G648V, and S884F) have been found at low frequencies among the reads obtained in the sequencing of Spanish genomes from clinical samples. Regarding nucleotide deletions, a total of 6 deletions were found, 4 of them in the ORF1a region (ΔL21/Q22/V23, ΔG82/H83/V84, ΔV84/M85/V86, Δ141K/142S/143), one in the spike glycoprotein (ΔK385), and one in the ORF3a (from amino acid G11 to I20).
Nucleotide variants were detected in positions that define SARS-CoV-2 clades, according to the classification of nextstrain.org, even though all the samples carried sequences corresponding to clade 20A (nucleotide substitutions C241T, C3037T, C14408T, and, A23403G). However, samples C1 and C10 showed mixed sequences of clades 20A and 20C at genomic position 25563 (G25563T). Unfortunately, the other nucleotide position that defines clade 20C (position 1059) was not sequenced in sample C10 and presented a low coverage in sample C1, avoiding the further verification of nucleotide substitution (C1059T) at this position.
These results confirm the potential of sewage sequencing to detect clades and new mutations and variants of SARS-CoV-2, which is of utmost relevance for the monitoring efforts for emerging vaccine-escape SARS-CoV-2 mutants in the forthcoming post-vaccination era. Additionally, genomic sequencing of viruses found in wastewater, yields complementary results to those of clinical laboratories, as has been demonstrated with the three novel nucleotide changes in the spike gene identified in wastewater, or the ones initially detected in low number of reads on genomes from clinical specimens and confirmed in wastewater samples. Nevertheless, the different coverage of the genome within individual samples suggests that analysis through massive sequencing focused on genomic regions of interest, such as the spike or clade-defining nucleotide positions, would allow a more robust characterization of the genomic variants spread in a defined geographical area or community.
Data Availability
Data availability under demand
Contributions
AP-C, AC-O, EC-F, AD-R: Investigation, formal analysis, writing, and reviewing. IF-F, IG-G: Resources, writing, and reviewing. WR, AA, MAB, IC: Writing, and reviewing. AP-C, GS: Conceptualization. GS: Funding acquisition, writing, and reviewing. All authors have read and agreed to the published version of the manuscript.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Names of specific vendors, manufacturers, or products are included for informational purposes only and does not imply endorsement by Authors or their affiliations.
Acknowledgements
This study was supported by projects “VIRIDIANA” (AGL2017-82909/ AEI/FEDER, UE) funded by Spanish Ministry of Science, Innovation and Universities; CSIC (202070E101), and Generalitat Valenciana (Covid_19-SCI). EC-F is recipient of a predoctoral contract from the MICINN, Call 2018. We acknowledge NILSA, FACSA, and MITECO for authorizing the sampling. The authors thank Agustin Garrido Fernández, and Andrea Lopez de Mota for their technical support.