Unusual SARS-CoV-2 intra-host diversity reveals lineages superinfection

The SARS-CoV-2 has infected almost 200 million people worldwide by July 2021 and the pandemic has been characterized by infection waves of viral lineages showing distinct fitness profiles. The simultaneous infection of a single individual by two distinct SARS-CoV-2 lineages provides a window of opportunity for viral recombination and the emergence of new lineages with differential phenotype. Several hundred SARS-CoV-2 lineages are currently well characterized but two main factors have precluded major coinfection/codetection analysis thus far: i) the low diversity of SARS-CoV-2 lineages during the first year of the pandemic which limited the identification of lineage defining mutations necessary to distinguish coinfecting viral lineages; and the ii) limited availability of raw sequencing data where abundance and distribution of intrasample/intrahost variability can be accessed. Here, we have put together a large sequencing dataset from Brazilian samples covering a period of 18 May 2020 to 30 April 2021 and probed it for unexpected patterns of high intrasample/intrahost variability. It enabled us to detect nine cases of SARS-CoV-2 coinfection with well characterized lineage-defining mutations. In addition, we matched these SARS-CoV-2 coinfections with spatio-temporal epidemiological data confirming their plausibility with the co-circulating lineages at the timeframe investigated. These coinfections represent around 0.61% of all samples investigated. Although our data suggests that coinfection with distinct SARS-CoV-2 lineages is a rare phenomenon, it is likely an underestimation and coinfection rates warrants further investigation.

coinfection with well characterized lineage-defining mutations. In addition, we matched these SARS-CoV-2 coinfections with spatio-temporal epidemiological data confirming their plausibility with the co-circulating lineages at the timeframe investigated. These coinfections represent around 0.61% of all samples investigated. Although our data suggests that coinfection with distinct SARS-CoV-2 lineages is a rare phenomenon, it is likely an underestimation and coinfection rates warrants further investigation.

DATA SUMMARY
The raw fastq data of codetection cases are deposited on gisaid.org and correlated to is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

INTRODUCTION
The SARS-CoV-2, the etiological agent of the COVID-19 pandemic, has a relatively low mutation rate compared to other RNA viruses 1 , and most viral lineages are normally defined by only a few synapomorphic SNPs (n < 10) 2 . However, the pervasiveness of SARS-CoV-2 infections during the COVID-19 pandemic provided substantial opportunities for the virus to explore the fitness landscape through single nucleotide substitutions and/or indels, giving birth to a range of more transmissible variants of concern (VOCs). These lineages are characterized by an unusual pattern of lineage-defining SNPs along the genome (n > 15) 3,4,5 .
Coinfection is defined as a single cell/host infection by more than one virus lineage simultaneously. Despite a rare phenomenon, it may provide opportunity for genetic recombination, an event known to occur in viruses of the Coronaviridae family 6,7 .
Recombinant viruses may, in turn, trigger the emergence of new lineages with enhanced biological properties, including the capacity to infect new hosts (expansion of viral host range) [8][9][10][11] . The frequency of coinfected patients and its role to promote recombination-driven SARS-CoV-2 evolution and the emergence of SARS-CoV-2 lineages is still poorly understood. The low variability found in SARS-CoV-2 lineages and the few well-defined lineage-specific SNPs until the second half of 2020 probably hindered the identification of coinfection and recombination events of SARS-CoV-2 lineages so far. In contrast the emergence of VOCs lineages carrying a substantial number of additional SNPs may provide enough markers to currently detect these events. A number of coinfection cases were reported for SARS-CoV-2, including lineages B.  14,15 .
In this study, we assessed amplicon sequencing reads of 2,263 SARS-CoV-2 samples from Brazilian patients generated by the Fiocruz Genomic Surveillance Network. We is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint were also co-circulating at the time of sampling, thus providing further plausibility for our findings.

SARS-CoV-2 sequences and ethical aspects.
The sequencing data was obtained through the genomic survey of SARS-CoV-2 positives samples sequenced by Fiocruz COV-19 Genomic Surveillance Network between 18 May 2020 and 30 April 2021. The SARS-CoV-2 genomes were recovered using previously described Illumina protocols [16][17][18] (Table S1). The frequency of lineages by Brazilian states was evaluated using data recovered from GISAID (gisaird.org) on 23

Genome assembly and intrahost variant analysis.
The Fastq reads were submitted in an in house workflow available at https://github.com/dezordi/IAM_SARSCOV2 that performs the following steps: The remotion of duplicated reads, adapters and read extremities with less than 20 of phred score quality with the fastp tool 19 ; A genome assembly guided by reference was performed with BWA 20 mapping reads against the SARS-CoV-2 Wuhan reference genome (NC_045512.2); The consensus genomes were generated with samtools mpileup 21 and iVar 22 , using a threshold quality score of 30 and calling SNPs and indels present as major allele frequencies; After the consensus generation, the bam-readcount tool 23  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 23, 2021. ; positions; The consensus genomes were submitted to PangoLineage tool v1.1.23 and pangoLEARN update at 28 May 2021 24 and to Nextclade 25 tools. Only genomes with more than 95% coverage breadth and 100 reads of average coverage depth (Table S2)

Phylogenetic Analysis.
A reference alignment was created using MAFFT 27 with the 6,167 genomes, which represents the genomes present in the nextstrain 28  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

RESULTS AND DISCUSSION
Our initial analysis revealed that 1,462 out of 2,263 genomes had enough sequencing breadth and depth to be able to consistently detect and characterize the viral genomic variability at the sequencing reads level. 1,150 out of 1,462 SARS-CoV-2 positive samples investigated showed at least one genomic site with supported intra-host variability, that is, at least one genomic position with more than 100 reads supporting a minimum of two alternative nucleotides. Those samples showed an average coverage depth of 1817.46 (stdev = 908.59) and an average coverage breath supported by at least 100 reads of 99.66 (stdev = 1.10) ( Table S2). In addition, we estimated a mean of 2.57 genomic sites showing intra-host variants ( Table S3). Major and Minor consensus sequences were generated for all samples bearing well supported alternative nucleotides. These alternative consensus genomes, representing the viral genome variability found in each sample, were then assessed for lineage assignment using the PangoLineage tool. If the same lineage was recovered for both genomes, this represents that the Major and Minor variants did not differ in relation to lineage-defining SNPs and that the variability observed likely resulted from de novo intra-host variants that emerged during viral replication. Conversely, if Major and Minor genomic variants were assigned to different lineages, the intra-host variability observed is more likely derived from a codetection event. We detected 16 instances in which Major and Minor variants were assigned to distinct lineages (intra-host sites: mean = 24, stdev = 9.75), including former Variants for Further Monitoring (VFM) N.9 and P.2 as well as the high circulating VOC P.1 (Table S4) (Figure 2A, Table S5).
Seven out of nine putative coinfection events involve the VOC Gamma (P.1 lineage) ( Table 1) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint SNPs characteristic of this lineage that facilitate the distinction between coinfecting SARS-CoV-2 lineages. As more distinct lineages, bearing many lineage-defining SNPs, coinfect the same host, it becomes increasingly more likely to objectively distinguish the coinfecting lineages through the reconstruction of alternative intrasample viral genomes.
In order to assess if codetection could be a result of sample contamination we reassessed sample AM-FIOCRUZ-21142481RG from RNA extraction, library preparation and sequencing. We confirmed the intrahost variability for 25 out of 31 sites present in the first sequencing run (Table S6). Moreover, lineage assignment, phylogenetic reconstruction and the detection of SNP defining mutations confirmed the codetection status of that sample ( Table S4). This study reports that codetection/coinfection events occurred at a low rate in Brazil (0.61% -9 samples from 1462). This is certainly an underestimation due to the limitation of detecting true coinfection events of earlier low diverging SARS-CoV-2 lineages that dominated the first year of the pandemic. Despite that, considering the lower bound of recorded SARS-CoV-2 cases worldwide until July 2021 were around 190 million (https://coronavirus.jhu.edu/map.html), we can infer that at least 1,1 million patients have been coinfected across the world, which in turn provides a substantial window of opportunity . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.18.21263755 doi: medRxiv preprint for SARS-CoV-2 recombination events. Moreover, this estimate is certainly downwardly biased because the number of asymptomatic infections is largely not accounted for.

Conclusions
In line with other studies, we showed that SARS-CoV-2 has an apparent low intrahost variability overall. Our in-depth analysis revealed at least nine codetection events which are corroborated by epidemiological data from co-circulating lineages in different Brazilian states. Moreover, the lineages identified revealed the early emergence of cryptically

Conflicts of interest
The authors declare that there are no conflicts of interest. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint   Table S5. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Funding information
The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.18.21263755 doi: medRxiv preprint Figure 3. SARS-CoV-2 lineage proportion through time in different Brazilian states with codetection cases. Data were recovered from GISAID on 23 July 2021, raw data can be accessed in Table S7. Upper triangles colored with the lineage of major consensus genomes and lower triangles with minor consensus genomes lineages. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.18.21263755 doi: medRxiv preprint