Variant abundance estimation for SARS-CoV-2 in wastewater using RNA-Seq quantification

Effectively monitoring the spread of SARS-CoV-2 variants is essential to efforts to counter the ongoing pandemic. Wastewater monitoring of SARS-CoV-2 RNA has proven an effective and efficient technique to approximate COVID-19 case rates in the population. Predicting variant abundances from wastewater, however, is technically challenging. Here we show that by sequencing SARS-CoV-2 RNA in wastewater and applying computational techniques initially used for RNA-Seq quantification, we can estimate the abundance of variants in wastewater samples. We show by sequencing samples from wastewater and clinical isolates in Connecticut U.S.A. between January and April 2021 that the temporal dynamics of variant strains broadly correspond. We further show that this technique can be used with other wastewater sequencing techniques by expanding to samples taken across the United States in a similar timeframe. We find high variability in signal among individual samples, and limited ability to detect the presence of variants with clinical frequencies <10%; nevertheless, the overall trends match what we observed from sequencing clinical samples. Thus, while clinical sequencing remains a more sensitive technique for population surveillance, wastewater sequencing can be used to monitor trends in variant prevalence in situations where clinical sequencing is unavailable or impractical.

variant genome, as well as a collection of background (non-variant of concern/interest) sequences, such 133 that the variant abundance ranges from 0.05% to 100%. Analogously, we created a second series of 134 benchmarks, simulating reads only from the Spike gene of each SARS-CoV-2 genome. We refer to the 135 first set of benchmarks as "whole genome" and to the second set of benchmarks as "Spike-only". Finally,

136
we performed these benchmarking experiments at different sequencing depths: 100x and 1000x 137 coverage for the whole genome benchmarks, and 100x, 1000x, and 10,000x coverage for the Spike-only 138 benchmarks (Table 1 and Fig 2).

140
Predicting variant abundance can be difficult when a variant is present at very low frequency, because of 141 the high degree of similarity between lineages. On our simulated datasets, where we know the true 142 frequency of each variant, we observe a background noise of 0.01--0.09% (Fig S2), meaning that some 143 sequences are falsely predicted to be present at 0.01--0.09% abundance. These false positives are likely 144 due to shared mutations or conserved sequences between lineages. The level of background noise tends 145 to be higher for whole genome benchmarks than for Spike-only benchmarks, because the majority of 146 defining variant mutations are in the Spike gene (Figs S1 and 2). In both cases, we apply a threshold of 147 0.1% abundance to include a sequence in the lineage abundance computation and we only report the   additional results are shown in Figure S3. In general, variant frequencies tend to be underestimated, in 155 particular on whole genome data. This is another consequence of shared polymorphisms between 156 relatively closely related lineages: a fraction of the reads is assigned to other, locally identical genomes, lineages and the more unique polymorphisms are associated with it, the lower the number of false and estimated frequency, relative to the true frequency. We observe that relative frequency estimation 163 errors are highest at low frequencies, where a small deviation in the absolute sense makes a large 164 relative difference but are relatively stable at true variant frequencies of at least 1%.

166
Besides underestimation, we also notice overestimation, particularly at low variant frequencies. This is 167 due to shared sequence between the variant and the background lineages: in datasets where the variant 168 is present at a low frequency, the background lineages must be present at relatively high frequencies.

169
Any shared sequence between a background lineage and the variant will then lead to more reads being 170 assigned to the variant, hence overestimation. This effect only applies when background lineages with 171 shared sequence are more abundant than the variant. In Figure 2

183
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint

204
To evaluate false positive and false negative predictions, we computed for each experiment the overall 205 false positive rate (FPR), false negative rate (FNR), precision, and recall (

262
We then compared our wastewater abundance predictions to variant frequency estimates from data 263 generated by sequencing remnant clinical diagnostic samples (mostly nasal swabs) in New Haven

264
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint with stronger underestimation for lower frequencies. This is consistent with what we see in Figure  instead of the whole SARS-CoV-2 genome (Fig 2). In practice, however, using only reads aligning to the 273 Spike gene captures too little information due to incomplete genome coverage and amplification bias.

275
Kallisto offers a bootstrapping feature, through which the sequencing data is resampled at least 100 times

280
However, this type of analysis captures only computational noise, and not technical noise (e.g. sampling 281 bias). The fact that our predictions are more variable than the clinical data while confidence intervals from 282 bootstrapping are narrow suggests that wastewater sequencing data is highly stochastic and not always 283 representative of the infections in the population.

284
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

296
but reduced genome coverage compared to sludge samples. Nevertheless, the same quality filtering 297 parameters apply: Ct < 31 or Ct < 34 with at least 0.5M reads aligned to select for samples with high 298 coverage (Fig S7).

300
After this filtering step, we predicted variant abundance for the 30 remaining datasets (corresponding to 301 16 different locations across 8 states).

305
this is the only statistic we can compare against across all states. We observe that, while individual 306 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint primarily found in California; and B.1.526 was predicted most abundantly in New York and Connecticut.

309
Other variants (B.1.351, P.1) were not observed in GISAID for these states at the time of sampling and 310 our predictions for these variants agree: B.1.351 was predicted to be present at very low frequency in 4 311 samples and absent in all other samples; P.1 was predicted present in a single dataset at 1% abundance 312 and absent in all others (Fig S8). Although these predictions may be false positives, at the time P1 was 313 thought to be likely at such low prevalence that these cases were not picked up by the sequencing efforts 314 in place.

315
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

320
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint Our results show that methods for RNA transcript quantification can be applied to wastewater sequencing data to obtain consistent and relevant variant abundance estimates. This technique can be readily applied not appreciably more difficult than setting up a "reference set" of potential variants and running existing 325 tools on whatever the sequencing data happens to be. While this reference set approach allows easy 326 updating as new variants appear, it also means that this approach cannot be used to detect new variants, 327 but only to near-optimally impute the mixture of known variants most likely responsible for the observed 328 data. Further, the approach may be readily extended to the detection and quantification of other pathogen 329 lineages present in wastewater.

330
Detecting variants in this fashion appears to be near-optimal on simulated data, with a detection limit 331 under 1%, though when handling real data which is subject to multiple factors that can alter the quality of

364
In applying RNA-seq tools to variant prediction, reference set composition is central. We were able to 365 reduce variant-specific biases substantially by including multiple reference sequences per lineage, thus 366 capturing within-lineage variation to identify variants with highly similar genomes. Additional experiments 367 (data not shown) comparing a US-specific reference set to a global reference set showed that, 368 unsurprisingly, the US-specific reference set gives more accurate predictions for benchmarks with 369 sequences from US origin. This suggests that using a state-or county-specific reference set construction 370 could further improve results and that identification of variants is likely being aided by the presence of 371 hitchhiker mutations which are locally over-represented or non-defining.

372
In conclusion, we present a computational approach for estimating the percent abundances of SARSmunicipality in Connecticut matched those defined by compiled clinical sequence data, and sequencing 375 metrics were interrogated to define the Ct value and other confidence thresholds that ensure optimal 376 performance. In settings across the world where strong clinical variant sequencing programs do not exist, 377 wastewater sequencing can be an effective tool for low cost, efficient monitoring of variant abundance. As 378 this is unlikely to be the last viral pandemic, nor the last with variants of concern, extending these 379 approaches to other viruses and other sample types may allow broader monitoring of real-time pandemic

567
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

596
True predictions occur at higher abundances (beyond the x-axis limit of 1.0%). 597 598 599 600 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint 603 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint Figure S4: salmon versus kallisto abundance estimates per variant. 610 611 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint

624
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint  628  629  630  631  632  633  634  635  636  637  638  639  640 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint Figure S8: Raw predictions per variant with confidence intervals based on bootstrap analysis for samples across the 642 US. Note that in all subplots the y-axis is capped at 1% for improved readability.

643
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262938 doi: medRxiv preprint