PT - JOURNAL ARTICLE AU - Farkas, Carlos AU - Mella, Andy AU - Haigh, Jody J. TI - Large-scale population analysis of SARS-CoV-2 whole genome sequences reveals host-mediated viral evolution with emergence of mutations in the viral Spike protein associated with elevated mortality rates AID - 10.1101/2020.10.23.20218511 DP - 2020 Jan 01 TA - medRxiv PG - 2020.10.23.20218511 4099 - http://medrxiv.org/content/early/2020/10/27/2020.10.23.20218511.short 4100 - http://medrxiv.org/content/early/2020/10/27/2020.10.23.20218511.full AB - Background We aimed to further characterize and analyze in depth intra-host variation and founder variants of SARS-CoV-2 worldwide up until August 2020, by examining in excess of 94,000 SARS-CoV-2 viral sequences in order to understand SARS-CoV-2 variant evolution, how these variants arose and identify any increased mortality associated with these variants.Methods and Findings We combined worldwide sequencing data from GISAID and Sequence Read Archive (SRA) repositories and discovered SARS-CoV-2 hypermutation occurring in less than 2% of COVID19 patients, likely caused by host mechanisms involved APOBEC3G complexes and intra-host microdiversity. Most of this intra-host variation occurring in SARS-CoV-2 are predicted to change viral proteins with defined variant signatures, demonstrating that SARS-CoV-2 can be actively shaped by the host immune system to varying degrees. At the global population level, several SARS-CoV-2 proteins such as Nsp2, 3C-like proteinase, ORF3a and ORF8 are under active evolution, as evidenced by their increased πN/πS ratios per geographical region. Importantly, two emergent variants: V1176F in co-occurrence with D614G mutation in the viral Spike protein, and S477N, located in the Receptor Binding Domain (RBD) of the Spike protein, are associated with high fatality rates and are increasingly spreading throughout the world. The S477N variant arose quickly in Australia and experimental data support that this variant increases Spike protein fitness and its binding to ACE2.Conclusions SARS-CoV-2 is evolving non-randomly, and human hosts shape emergent variants with positive fitness that can easily spread into the population. We propose that V1776F and S477N variants occurring in the Spike protein are two novel mutations occurring in SARS-CoV-2 and may pose significant public health concerns in the future.Competing Interest StatementThe authors have declared no competing interest.Funding StatementPowered@NLHPC: This research was partially supported by the supercomputing infrastructure of the NLHPC (ECM-02). This research was partially funded by research funding from the CIHR, Research Manitoba and the CancerCare MB Research Foundation. Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:not applyAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesData and Code Availability 76,553 FASTA genomes and associated sequencing metadata were downloaded from GISAID database from January 1, 2019 until August 3, 2020, specifying human as source host (https://www.gisaid.org/). The associated sequencing metadata including major variants per sample are available at Supplementary Table 1. Aggregated variants in VCF format for the latter genomes and associated consequence predictions are available here: https://usegalaxy.org/u/carlosfarkas/h/sars-cov-2-variants-gisaid-august-03-2020. 974 Brazilian FASTA sequences were downloaded from GISAID database from January 1, 2019 until September 25, 2020, specifying human as source host and South America / Brazil as location. These FASTA sequences and associated aggregated variants are available here: https://usegalaxy.org/u/carlosfarkas/h/brazil-genome-sequences-from-gisaid-sept25-2020. FASTA sequences from GISAID genomes containing associated metadata until September 28, 2020, including the results from snpFreq program (containing Deceased-Released SNP associations) are available here: https://usegalaxy.org/u/carlosfarkas/h/gisaid-patient-metadata-sept28-2020. Acknowledgements to all laboratories/consortia involved in the generation of GISAID genomes used in this study are listed in Supplementary Table 2. 17,560 sequencing datasets were downloaded from Sequence Read Archive Repository (SRA, https://www.ncbi.nlm.nih.gov/sars-cov-2/) From December 1, 2019 until July 28, 2020. Associated sequencing run accessions, sequencing metadata and related BioProjects are listed in Supplementary Table 3. The code generated during this study to replicate most of the computational calculations performed in this manuscript is available at the following github repository: https://github.com/cfarkas/SARS-CoV-2-freebayes. https://github.com/cfarkas/SARS-CoV-2-freebayes