Abstract
The use of RNA sequencing from wastewater samples is proven to be a valuable way for estimating infection dynamics and circulating lineages of SARS-CoV-2. This approach has the advantage of being independent from patient population testing and symptomatic disease courses. However, it is equally important to develop easily accessible and scalable tools which can highlight critical changes in infection rates and dynamics over time across different locations given sequencing data from wastewater. Here we provide an analysis of variant dynamics in Germany using wastewater sequencing and present PiGx SARS-CoV-2, a highly reproducible end-to-end pipeline with comprehensive reports. This complete pipeline includes all steps from raw-data to shareable reports, additional taxonomic analysis, deconvolution and geospatial time series analysis. Using our pipeline on a dataset of wastewater samples from different locations across Berlin from February 2021 to June 2021, we could reconstruct the dynamic of the Variant of Concern (VoC) B.1.1.7 (alpha). Additionally, we detected the unique signature mutation M:T26767C for the VoC delta B.1.617.2 (delta) and its rise in late May. This is around 1 week earlier than the increase of the proportion of detected delta cases with 6% in the first week of June and 18% in the second week. We also show that SARS-CoV-2 mutation load measured from wastewater sequencing is correlated with actual case numbers and it has potential to be used in a predictive manner. All in all, our study provides additional evidence that systematic wastewater analysis using sequencing and computational methods can be used for modeling the infection dynamics of SARS-CoV-2. In addition, the results show that our tool can be used to identify new mutations and to detect any emerging new lineages of concern. Our approach can support efforts for establishing continuous monitoring and early-warning projects for detecting SARS-CoV-2 or any other detectable pathogen.
Introduction
The ongoing COVID-19 pandemic highlighted the need for better monitoring systems for emerging pathogens and pathogenic variants in order to quickly respond to changing epidemic dynamics. Acknowledging the importance and potential impact of wastewater-borne epidemiological analysis, the European Commission has recently recommended to implement continuous monitoring on SARS-CoV-2 through wastewater in all member states [1]. SARS-CoV-2 is a positive strand RNA virus from the family Coronaviridae, genus Betacoronavirus [2, 3] and several studies showed that it can be shed in feces, urine, and saliva [4–6], raising concerns about possible environment-based transmission [7] but also opening up the possibility of Wastewater Based Epidemiology (WBE) vigilance. An alternative to individual patient tests that are expensive and have privacy consent issues, WBE has been used, on a small scale, for different enteric microorganisms such as vaccine and wildtype polioviruses [8], rotaviruses, hepatitis A, astroviruses, adenoviruses, and noroviruses [9]. In the COVID-19 pandemic, wastewater monitoring has been shown to be an effective tool for monitoring incidence rates. Multiple studies showed that it is possible to detect viral RNA even before widespread clinical reports [10–13], suggesting a potential as a monitoring and early alert system.
Several WBE initiatives for SARS-CoV-2 monitoring were established worldwide, and currently, the COVIDpoops19 initiative [14] lists 88 dashboards from 263 universities monitoring 2302 sites. However, most studies are based on RT-qPCR analyses, limited to quantifying the viral titer and/or tracking a few known variants, correlating the results with the reported number of cases in the area. A few studies have been using amplicon sequencing or metagenomics covering the whole viral genome, allowing to track the change of proportions on signature mutations [15–17]. However, quantifying Variants of Concern (VoC) by NGS reads remains challenging, because phasing is difficult with the fragmented sequences generated. Moreover, sequencing and quantifying variants are just the first steps in understanding the dynamics of the outbreaks. The sequencing results should be easily analyzed and combined with geospatial time series analysis. Tracking of VoCs over time and space, can inform policy-making decisions in order to control new outbreaks.
Overall, we aimed to build computational and methodological capacity to monitor emerging SARS-CoV-2 lineages and mutations via wastewater sequencing. For monitoring purposes, we sampled wastewater treatment plants from February 19th to June 10th, 2021. We used the ARTIC protocol [18] for Illumina amplicon sequencing covering the whole SARS-CoV-2 genome to sequence the samples. Finally, we developed a reproducible, open-source pipeline for analyzing continuous sampling of wastewater treatment plants to track signature mutations and Variants of Concern.
Results
A reproducible computational pipeline for tracking SARS-CoV-2 in wastewater
We developed a new pipeline - PiGx SARS-CoV-2 - in the framework of our previously published set of pipelines called PiGx [19]. They are designed with a special focus on usability and reproducibility. The new pipeline was added to the PiGx collection of pipelines and it is distributed together, using GNU Guix (See Figure 1 for a diagram of the workflow). The pipeline comes with all the needed tools and their dependencies and can thus be reproduced on different systems independent of any other installed software.
General description of the PiGx SARS-CoV-2 pipeline
The PiGx SARS-CoV-2 pipeline provides end-to-end data processing and analysis for wastewater RNA sequencing. The pipeline takes the input of targeted sequencing of SARS-CoV-2 RNA with geo-tagged samples. The pipeline takes a set of raw fastq read files, additional processing information for the reads and information about the lineages that should be tracked. After quality check and alignment, the variants are called and annotated. The samples from different timepoints are used to produce time-series reports that track trending mutations over time. We use a particular deconvolution step to also track the proportions of lineages representing Variants of Concern over time. Overall, the pipeline returns a set of reports that provide overviews over lineage and single-mutation abundance in each sample, a taxonomic classification analysis of unaligned reads, and detailed quality control information. Furthermore, all per-sample results are summarized as tables and also combined to visualize time-series and geo-location plots, making the pipeline suitable for continuous sampling.
The pipeline needs local databases (downloaded by the user) for some of the annotation and alignment tools, such as Ensembl VEP, Kraken2, and Krona tools, while the tools themselves are automatically installed. Furthermore, the user needs to provide: (i) a sample sheet (CSV format) containing information about sampling date and location; (ii) a settings file (YAML format) for specifying the experimental setup and optional custom parameter adjustments, (iii) a mutation sheet containing the lineages of interest and their signature mutations in nucleotide notation and and BED file containing their genomic coordinates; (iv) the reference genome of the target species (see Methods for a detailed description); (v) BED file containing the PCR primer locations (provided with the pipeline for ARTIC protocol).
To ensure reliable variant calling and robust lineage abundance prediction, the sample has to match stringent quality control measures. For this, information about the sequencing primers, adapters, and also a BED file containing the sites of the signature mutations is necessary. Specifically the latter is important to ensure comparability of the called variants across all processed samples.
Given these input files, the pipeline executes a series of quality check, alignment, variant calling, deconvolution and mutation trend analysis steps. In the end, it provides interactive reports with quality check, geospatial and time-series information for mutations and lineages, as well as downloadable files for the downstream analysis.
Berlin wastewater SARS-CoV-2 sequencing and analysis
We sequenced a total of 67,783,582 reads from 38 samples collected at four different wastewater treatment plants in Berlin operated by “Berliner Wasserbetriebe’’ during a 125 days interval from 06th of February to 10th of June 2021. We analyzed these RNA sequencing results using our pipeline.
The average number of covered signature mutation sites per sample was 81.3 (from a total of 94 tracked signature mutations, see mutations sheet in the Supplementary Table S1). From the 38 samples, 13 samples did not pass the defined quality control threshold (< 90% of the signature mutation sites covered). The Supplementary Table S2 shows the quality control results for each sample.
We were able to align from 22.3% to 99% of the reads to the reference SARS-CoV-2 genome, and the resulting alignments were used for variant calling. We were able to detect a total of 1,907 mutations (from those, 55 are signature mutations) across all the samples (See methods for details on alignment and variant calling). The overall frequency of mutations per sample is shown on Supplementary Table S3. All counts for the found signature mutations (per sample and pooled) can be found in Supplementary Table S4.
Relationship between genome coverage from wastewater sequencing and case numbers
In order to obtain comparable results for the lineage and mutation analysis over time, a minimum reference genome coverage despite varying infection dynamics is needed. To test the minimal coverage necessary, we compared how much genome coverage we get from our samples over time with the number of COVID-19 cases in Berlin shown in Figure 2. Different sampling locations or their characteristics were not taken into consideration.
Figure 2 B shows that the genome coverage (samples pooled by day) does not correlate (R2 = 0.023) with the infection dynamic. The genome coverage matches our minimal requirement for the analysis (>=90%) in 56% of the days (see Supplementary Table S5). It is also >=90% (with few exceptions) for most of the sample time (see Supplementary Figure 1) and consistentl drops below 90% when cases are fewer than ∼150 (Supplementary Table S5). From the data, it is not clear whether the dips in mid-March and mid-April are related to the following decreases in case numbers or sample quality.
Emerging mutations can be teased out from time-series analysis
The time-series nature of the data can be also used to identify trending mutations for SARS-CoV-2 in wastewater. By tracking the frequencies of mutations over time we were able to highlight any mutation which shows strong increasing trends. We applied a linear regression model for each mutation using the date of sampling as the independent variable to identify mutations with strong increasing or decreasing trends over time (see methods for details). We considered significant mutations where t-test p-value was smaller than 0.05 (Supplementary Table S6). Overall, nine mutations were significantly changing over time (Figure 3).
Seven of those mutations (denoted here with the resulting protein mutation as translated by Ensembl VEP in the following pattern gene:protein-mutation::nt-mutation) S:A570D::C23271A, ORF1ab:P4804-::C14676T, ORF8:Q27*::C27972T, S:D1118H::G24914C, N:D3E::T28282A, N:D3H::G28280C, N:D3V::A28281T are mutations characterizing uniquely for the alpha variant. The mutation M:I82T::T26767C is characterized uniquely for delta and ORF1ab:D1893-::C5944T is a mutation that is not tracked as a signature mutation for any Variant of Concern. However it is reported as a mutation characterizing a subclade of B.1.1.7.
Deconvolution of mutation frequencies infers SARS-CoV-2 lineage frequencies
We have also developed the capability to deconvolve the frequencies of VoCs from pooled sequencing reads. Briefly, the deconvolution method uses signature mutations for each VoC and tries to discern the proportions of these variants making up the observed mutation frequencies in the pooled (bulk) sequencing reads obtained from the wastewater. In this study, we tracked four lineages which are currently classified as VoCs: B.1.1.7 (alpha), B.1.351 (beta), P1 (gamma) and B.1.1617.2 (delta). We characterized the lineages with a mutation sheet (Supplementary Table S1) containing signature nucleotide mutations from covidCG [20]. We took a list of mutations with a sequence consensus threshold of 70%. We included mutations that are unique for each lineage as well as mutations that are shared by two or more lineages. However, the pipeline is flexible and can track more variants if the signature mutations are provided in nucleotide format.
Next, we applied this deconvolution method (based on the frequencies of the signature mutations) to infer the proportions of each lineage on each sample (Supplementary Table S7). The lineage frequencies are predicted using a regression model based on the observed frequencies of the signature mutations for each lineage. Additionally, during the deconvolution process, we weighted the tracked lineages differently based on how many signature mutations were found for each of them for a given sample. This step is necessary in order to get more precise predictions of lineages with low abundance and for which only few or only shared mutations were found (See Methods for details).
Figure 4A shows VoC proportion changes over time for each wastewater treatment plant in Berlin. Overall we can see an increase in B.1.1.7 (alpha) that had 29% on February 19th (beginning of sampling) and increased to 92% on June 10th (end of sampling) with a peak of 99% on May 25. Also B.1.351 (beta) increased from zero detection in February to 2% on June 10 peaking on May 27 with 6.8%. The B.1.617.2 (delta) lineage was barely detected with 3% over the sampling time increasing to 4% on June 10. For P1 we can detect a decrease from 8.6 % on February 19 to zero detection in June. Similarly the proportion of the calculated reference strain (labeled as “WT”, see Methods “Signature matrix construction” for details) decreased from 55% on February 19th to 3% on June 10 with an intermediate peak of 16% on May 12th. Unpooled results for single locations are attached as Supplemental material.
In order to see if the predicted results can represent the reality we compared the deconvolution results with lineage analysis data published by the Robert Koch-Institute (RKI) for Germany (Figure 4B). Hereby, lineage dynamics for Germany are very comparable to the dynamics within the city of Berlin only. We can see that our predicted lineage frequencies are very similar to the reported frequencies. Only B.1.1.7 shows mostly higher predicted values, but with very similar trends. Most importantly also the lineages with very low abundance and for which only very few signature mutations where found (data shown in Supplementary Table S4) could be predicted accurately.
SARS-CoV-2 mutation load in wastewater is correlated with the incidence rate
We hypothesized that more infections would lead to more mutations in the SARS-CoV-2 genome and therefore these two quantities will be correlated even though we calculate the mutation load from wastewater samples rather than the genetic material obtained from patients. In order to test that, we calculated mutation load as the number of non-signature mutations obtained from SARS-CoV-2 wastewater sequencing experiments. We correlated Berlin/Germany case numbers as a measure of incidence rate and mutation load and found a significant association between incidence rate and mutation load obtained from wastewater (adjusted R2 = 0.35, Pearson’s correlation coefficient = 0.63, t-test p-value = 0.03, see Figure 5 and Supplementary Table S8). We have also performed a cross-correlation analysis between case numbers and mutation load with different time lags. The highest correlation with different time lags were not better than the ones without the lag (See Supplementary Figure 1).
We also checked if RT-qPCR results behave the same way using the correlation analysis. For RT-qPCR, we used four pairs of primers for SARS-CoV-2 detection (RT-qPCR) on the wastewater samples. Due to the very low amount of viral particles present overall, we decided for a semi-quantitative approach, instead of using the Ct values, calculating the number of positive detections divided by the number of total reactions carried, grouping all the samples for each day (See Methods for details). The daily percentage of positive qPCR reactions ranges from 0 to 62.5% (Supplementary Table S9). We used the same approach as with the mutation load analysis. We also found positive but no significant correlation with RT-qPCR results and incidence rates (adjusted R2 = 0.15, coefficient =0.46, t-test p-value = 0.07, See Figure 5 and Supplementary Table S8). In addition, we have also repeated the cross-correlation analysis between incidence rate and RT-qPCR results with different time lags. In this case, lag= −1 week also had positive correlation with the incidence rate (adjusted R2 = 0.25, coefficient = 0.5, t-test p-value = 0.03, See Supplementary Figure 1).
The proportion of samples with SARS-CoV-2 RNA starts to strongly increase to 25% on February 25th from 8% on February 19th. The number of cases however are only starting to increase from ∼360 on March 11th to ∼548 on March 18th. This is an offset of nearly three weeks. The second increase in proportion was from 25% on March 25th to 58% on April 8th. And the following case increase was shown from ∼547 on April 8th to ∼841 cases on April 15th which is an offset of one week. So on average the increase of SARS-CoV-2 RNA in wastewater was shown two weeks earlier than the increase in detected COVID-19 cases (Supplementary Table S9).
Discussion
In many countries like Germany, epidemiological monitoring of SARS-CoV-2 is largely dependent on PCR-based methods without sequencing which is applied on patient samples. These techniques can be used for variant detection only after a concerning new lineage is detected and an appropriate assay was developed. In order to discover new lineages, we need to be able to call mutations of the SARS-CoV-2 genome which can be done using sequencing. However, sequencing-based techniques are deployed on only a fraction of the patient population. Wastewater monitoring emerged as a viable option to track the prevalence of COVID-19 and also for the emergence of different lineages [21] at the population level not only because it is faster and cheaper than sequencing of samples derived from testing but it can also be more representative (without bias through the choice of which samples are going to be sequenced). Furthermore it can also be used to track emerging mutations or lineages of SARS-CoV-2. However, sequencing of SARS-CoV-2 material obtained from wastewater presents data analysis challenges as the samples are potentially from numerous patients, and have lower quality than material obtained directly from patients. In addition, the analytical workflows should be able to deal with samples from multiple locations and time points and combine the information in an easily accessible manner.
In order to address these challenges, we have built a reproducible analytics pipeline that takes in raw sequencing reads and provides sharable interactive reports with geospatial information, and mutation and lineage tracking features over time. In comparison to other commonly used pipelines for variant analysis like V-pipe [22] or the recommended ARTIC bioinformatics pipeline [23], PiGx SARS-CoV-2 additional features (discussed below) improved usability, reproducibility, and application for environmental samples like wastewater. In addition, the geospatial tracking allows to compare and monitor infection dynamics from different locations (See example reports in Data access section). In terms of usability, the novelty with PiGx SARS-CoV-2 is that the output reports include result visualization for each sample individually and also for the overview and summary of all samples with a choice of visualization methods that are straightforward to interpret. Furthermore, all outputs relevant for the assessment for lineages, quality control and mutations are produced in human-readable format such as HTML reports from which CSV files can be extracted. That makes further data analysis easier by providing formatted tables. Last but not least, PiGx SARS-CoV-2 offers state-of-the-art software reproducibility thanks to GNU Guix [19].
The pipeline comes with built-in flexible quality control metrics since samples from wastewater can have more frequent quality issues. In our analysis, we applied a strict cutoff for genome coverage (>=90%) to reduce noise in our predictions. Our pipeline also allows the user to input their own reference genome and their own set of signature mutations and lineages. As an additional step for QC, we implemented a taxonomic classification of reads that did not align to the SARS-CoV-2 reference genome. Since we used a PCR based protocol, we expect some degree of nonspecific amplifications, so it is of great help to have an additional control by means of the taxonomic classification to assess potential biases [24]. Also since Kraken2 is a k-mer classifier, this method can reveal reads that match SARS-CoV-2 but are not aligned by stringent alignment tools. This is important to know because it provides insights about potential loss of new mutations missed on the alignment. This step allows the user to investigate potential issues and, if necessary, to adjust the alignment stringency. In our results, the taxonomy report shows a large amount of fragments (∼2 million reads across all samples) is mapped to SARS-CoV-2 (Supplement Table S10) and we will investigate this further.
One of the primary features of our approach is built-in tracking of emerging mutations. This feature allowed early prediction of lineages such as B.1.617.2 from a single signature mutation M:I82T::T26767C (Figure 3) in our dataset. We were able to detect the lineage before it was detected in the population (Figure 4 B). This specific mutation was described to be associated with critically increased viral fitness [25]. This analysis and results are also visualized without the need for any additional steps directly in the summarizing report. We showed that our pipeline and its reports can be a valuable tool for early warning predictions and to guide targeted further analysis.
Another key feature of our approach is the deconvolution method that helps us identify the proportion of lineages present in a pooled sample such as wastewater samples. By making use of a weighted regression method we were able to provide accurate estimates of lineage proportions for our samples over time. For the four VoCs that we tracked with signature mutations, we showed in Figure 4 that our model can accurately predict the composition of lineages when comparing with the number of cases reported during the same time frame, even with very low frequencies. It is important to note that the mutations commonly used for tracking B.1.1.7 in other studies, S:N501Y::A23063T and del69/70 [26, 27] were rare or not found, respectively. However, mutations S:A570D::C23271A, ORF1ab:P4804-::C14676T, ORF8:Q27*::C27972T, S:D1118H::G24914C, N:D3E::T28282A, N:D3H::G28280C, and N:D3V::A28281T were found and mostly important for our predictions. The method predicts the increase of the VoC B.1.617.2 (delta) in June 2021. Supplement Figure 2 shows that the single mutation M:I82T::T26767C—a unique signature mutation of B.1.617.2—can predict the following increase in delta-related case numbers two weeks earlier (see Supplementary Table 11). This increase cannot be seen with similar scale with the results from the deconvolution. Possible reasons are that the deconvolution is influenced by PCR bias and inaccuracies because only few (at most four) signature mutations for the B.1.617.2 lineage were found in June. Conclusively it can be said that there is room for improvement for the model’s accuracy on lineages with low abundance, but having the individual mutation detection in place can still guide early detection of increasing appearance of potential new lineages of concerning mutations.
As reported in previous studies in other cities around the globe [28], we showed that also for Berlin the quantification from wastewater can reveal increasing but also decreasing infection dynamics potentially earlier than it is possible from clinical testing. Although RT-qPCR results are not fully quantitative, observing this expected trend was important and paved the way for more robust lineage and mutation trend analysis using sequencing.
Interestingly, we have also shown that mutation load calculated from SARS-CoV-2 wastewater sequencing is predictive of incidence rates. The fact that mutation load is associated with case numbers (incidence rate) is also shown previously on mutation analysis of patient-derived SARS-CoV-2 samples [29]. However, to our knowledge this is the first time we are able to show the link between mutation load derived from wastewater samples and the incidence rates. In addition, in our dataset, mutation load has a slight advantage over RT-qPCR results when we compare the variance explained by the two models. We showed that mutation load from wastewater SARS-CoV-2 samples is at least as predictive as RT-qPCR for incidence rates. However, to make more conclusive statements we need to sample wastewater for a much longer duration and compare the results. Regardless of the methods used on wastewater, as previously published reports also indicate, wastewater monitoring may provide early warning for future case numbers and emerging mutations before these patients hit the healthcare system.
In conclusion, we present a reproducible, comprehensive workflow with a high level of usability that has features for tracking mutations and Variants of Concern over time and geographical locations. We highlight these points using real-world data from Berlin wastewater sequencing samples, and demonstrate the potential to provide more detailed and conclusive insights for SARS-CoV-2 wastewater sequencing efforts.
Methods
Experimental methods
Enrichment of viral particles from raw wastewater and RNA extraction
Raw wastewater samples were collected from four different wastewater treatment plants across Berlin, serving a population of approximately 3.4 mio in total. They were collected as composite two hour samples (8-10pm and 10-12pm) at the primary influent collector at the indicated wastewater treatment plants. Typical characteristics of Berlin wastewater treatment plant influents are 500-1500 mg/L chemical oxygen demand, 200-600 mg/L suspended solids, 40-80 mg/L ammonium-N, 2-8 mg/L orthophosphate-P, 1500-2000 µS/cm electrical conductivity.
Samples were stored and transported at four degrees, and processed about 12 hours after collection. Virus particles were enriched as previously described [30]. About 100ml sample was filtered through 2 glass fiber and 0.2 µM PVDF filters (Millipore, cat# AP2007500 and S2GVU02RE). Of this filtrate, 60 ml was transferred to a 10 kDa cutoff centricon unit, that was previously pre-conditioned with 50 ml ultrapure water and centrifugation with 3000 g for 15 minutes at 4 °C. After centrifugation of the samples for 30 minutes at 4 °C and again 3000 g, the unit was inverted and about 400 µl concentrate was collected by centrifugation for 1000g at 4 °C for 3 minutes. The concentrate was mixed with 3 volumes of Trizol LS (ThermoFisher cat# 10296-010), and the RNA extracted using the Direct-zol RNA miniprep kit (Zymo cat# R2052) including the DNase treatment and elution with 50 µl ultrapure water according to the manufacturer’s instruction. Absence of PCR inhibitors was confirmed by mixing the sample 1:1 with total RNA from human cells followed by amplification of a human transcript by RT-qPCR.
Reverse transcription / quantitative polymerase chain reaction (RT-qPCR)
The extracted RNA was amplified using the LunaScript reverse transcription mix (NEB cat# E3010L), with 16 µl RNA and 4 µl reaction master mix according to the manufacturer’s instructions, except for a 20 minutes incubation at 55 °C instead of 10 minutes. Afterwards, the cDNA was diluted 1:10 with ultrapure water, and 3.75 µl diluted cDNA used per qPCR reaction, using a SYBR green master mix (ThermoFisher cat# 43-643-46), and final concentrations of 250 nM of the primers on Supplementary Table S12. The presence of the proper amplicon was verified using a 2.5% TAE agarose gel.
ARTIC-seq of the SARS-CoV-2 genome
Amplicon sequencing libraries of the SARS-CoV-2 genome were generated using the ARTIC v3 protocol [18], using 6 µl of the cDNA generated as described above as an input. The primer pools were obtained from IDT. Amplicon libraries were sequenced on an Illumina Miseq or Novaseq device with 2×250 paired-end sequencing and 20% phiX spike-in.
Computational Methods
General Pipeline description
In the first step the pipeline takes the raw reads and the additional information about used primers and adapters to perform extensive quality control. Primer trimming is done with iVAR [31] and fastp [32] is used for adapter trimming and filtering. In order to make the read quality process comprehensible, fastQC reports are generated after each step and summarized with additional MultiQC reports. The processed reads are aligned to the reference genome by BWA Mem [33] and various coverage statistics are taken by SAMtools coverage/bedcov [34]. The alignment is used further for single nucleotide variant (SNV) calling using LoFreq [35]. For predicting the lineage abundances, a deconvolution matrix is generated by matching the set of mutations called by LoFreq against the provided mutation sheet. The SNVs are translated to protein mutations by Ensemble VEP [36]. Kraken2 [37] is used to get taxonomic classification of the unaligned reads as an additional quality measure and further insight in the samples. A deconvolution method is used to calculate the proportion of lineages (more details in the section Deconvolution analysis) for each sample. For summarizing and visualizing the deconvolution results as a time series, samples with SARS-CoV-2 genome coverage below 90% are discarded. For each mutation, linear regression models are used (more details in the section Regression analysis for mutations) to detect if any mutation is significantly increasing over time. Here discarded samples were also not included.
For each sample a set of four reports (multiQC, general qc report, taxonomic classification report, lineage report) is generated using R - markdown and knitr. The R-package of plotly is used for generating interactive visualizations. The relevant results across all provided samples are summarized by an extra report that provides insightful visualizations and accessible navigation linking to all the single reports. In this way the pipeline output provides both - an easily accessible overview about lineage and mutation dynamics in a communicable format but also enables extensive data exploration and access to sample wise tables and summaries without the need for running extra scripts. PiGx SARS-CoV-2 uses snakemake [38] for automatic workflow management.
Pipeline accessibility
The pipeline can be installed over GNU Guix and runs with the command [pigx-sars-cov2-ww –s {sample_sheet} {settings_file}]. A cloud version that does not require any installation is also already under development. Alternatively, the pipeline will be available through Docker packages. However, to ensure reproducibility using GNU Guix is recommended [39].
Deconvolution analysis
Model description
With m being a system of linear equations build by using B being a signature matrix constructed from the signature mutations provided as input and f being the proportions for the lineages the deconvolution approach can be represented as m = f x B. Similar to what has been shown before for deconvolution of cell types from gene expression profiles or methylation profiles [40], we follow the assumption that the frequency of signature mutations corresponds with the frequency of the actual lineage which is characterized by it. The difference in our approach is that we use sequence mutations and apply weights to the signature matrix in order to get more realistic prediction results.
Signature matrix construction
The signature matrix is obtained by matching the set of mutations found in the sample against the set of signature mutations provided as input. In case the mutation sheet contains mutations that are shared between lineages, it is possible that multiple lineages can not be distinguished from each other. In this case, the signature matrix will be deduplicated leaving only one column of the duplicated lineages which will be renamed with the grouped names of all lineages showing this duplicated signature mutation “pattern”.
To make the matrix more robust additional “reference mutations” are added as well as a reference column denoted as “WT”. Bulk frequencies for the “reference mutations” are the difference between 1 and the value of the related signature mutation.
We propose the assumption that the more signature mutations can be found for a specific lineage the higher the probability that this lineage is present with a higher proportion within the sample. We therefore weigh the signature matrix (without the reference mutations) for each lineage with the proportion of signature mutations that has been found for each specific lineage from the total number of signature mutations that was given to characterize it. Applying weights results into less variation and more accurate predictions.
Regression
To deconvolute the lineage abundance we performed robust regression analysis on the signature matrix and the bulk frequency values of the signature mutations using the “Robust Fitting of Linear Models” - rlm() function from the R library MASS [41] (default settings, maxit = 100). Similar to the deconvolution method CIBERSORT [40], we set negative coefficients to 0 and normalized all coefficients to add up to 1 which then form the output value providing the predicted lineage frequency values for the provided lineages and an additional “WT” (reference strain) estimation.
PCR bias as well as the number of signature mutations found influences the robustness of the results. We therefore added the additional constraint to only perform the deconvolution analysis on samples matching a minimum quality score.
Dealing with indistinguishable variants
After deconvolution grouped indistinguishable lineages have to be split again. There are three possible outcomes for those groups:
Firstly, when no signature mutations for a lineage could be found the group includes the “WT” column and is in fact “WT” only. So the grouped lineages are getting the proportion value 0, “WT” get’s the deconvoluted value. Secondly, the grouped lineages are deconvoluted to 0. In this case both lineages are assigned with the value 0. Thirdly, the grouped lineages are not equal to “WT” and are getting a deconvolution value above 0. In this case the assumption for normal distribution of the lineage abundances is applied and the deconvolution value is divided by the number of grouped lineages. Each lineage is assigned this adjusted value.
Regression analysis for mutations
For the regression analysis on mutation frequencies we applied a linear regression model using the “Fitting Linear Models” - lm - function of R base. The test was only performed on mutations if N(x>0) > 5 being the number of frequency values x that are above 0 across all samples. To get only increasing trends, the coefficient values were filtered for values x > 0 only. P-values were calculated by the lm-function using t-test and were filtered for p < 0.05. We report the mutation trend analysis together with and sorted by the regression coefficient as a comparable value for unstandardized effect size.
Pooling of samples for time series analysis and plots
For summarizing across daytime and location, the lineage frequencies are pooled by calculating the weighted average using the total number of reads of each sample as weights. The mutation frequencies are pooled by using the simple mean setting removal of missing values to FALSE. Figures and deconvolution plots are done with ggplot2 [42]. For the cross-correlation analysis samples were pooled by week and the pooled unique set of non-signature mutations was counted.
All details and code can be found on the pipelines repository: https://github.com/BIMSBbioinfo/pigx_sarscov2_ww
Sample scoring for quality check
With the provided BED file for the signature mutations listed in the mutation sheet a coverage analysis is performed using BEDtools coverage [43] within the pipeline. For the regression analysis and time series plots only samples are taken in concern that cover more than 90% of all provided signature mutation sites.
Correlation of mutation load analysis to case numbers
We checked if the number of non-signature mutations (here referred to mutation load) correlates with the number of cases in Berlin. For that, we performed a cross correlation analysis using the “CrossCorrelation Function” - ccf() from R “tseries” package (see Supplementary Figure 1) up to h = ± 7. Later we performed a time series intersection to calculate values for cases and mutations or qPCR for lag = −1. We did a standard linear regression on the intersection results using pearson correlation to get R2.
Reproducible environment
The presented results were produced using PiGx SARS-CoV-2 version 0.03 commit 2603b275106a2a96a422dcfba61554f4d9c0d780. The manifest and guix-version file to create a reproducible environment are provided as supplemental material.
Data Availability
Data is available within the links provided in the manuscript
Data access
The data will be deposited to the European Nucleotide Archive (ENA). The interactive report that was used and produced for this pipeline can be found here: https://bimsbstatic.mdc-berlin.de/akalin/AAkalin_pathogenomics/sarscov2_ww_reports/211104_pub_version/index.html
Acknowledgements
We thank Mrs. Burzyk, Cytner, Darre, Göldner, Grunow, Heinig, Horn, Kapczynski, Klawonn, Koch, Krug, Meyer, Neideck, Schmidt, Schwarzenberg, Stroede, Zühlsdorff, and Messrs. Armbrecht, Dombrowski, Frankenstein, Halatta, Hambarsomian, Linnek, Muss, Flatau, of the Berliner Wasserbetriebe for sampling and logistic support; and Dr. Meixner of amedes as well as Dr. Selinka of the Umweltbundesamt and also Mrs. Schumacher for helpful discussions. Also, we would like to thank Friederike Dündar for consultation on best practices for visualization techniques and Jonas Freimuth for code discussions and support with code development.
Footnotes
Text edited for clarity and methodological details