Emergence of genomic diversity and recurrent mutations in SARS-CoV-2

https://doi.org/10.1016/j.meegid.2020.104351Get rights and content

Highlights

  • Phylogenetic estimates support that the COVID-2 pandemic started sometimes around 6 October 2019 – 11 December 2019.

  • The diversity of SARS-CoV-2 strains in many countries recapitulates its entire global diversity.

  • 198 sites in the SARS-CoV-2 genome appear to have already undergone recurrent, independent mutations.

  • Detected recurrent mutations may indicate ongoing adaptation of SARS-CoV-2 to its novel human host.

  • Monitoring the build-up of genetic diversity in SARS-CoV-2 has potential to inform on targets for drugs and vaccines.

Abstract

SARS-CoV-2 is a SARS-like coronavirus of likely zoonotic origin first identified in December 2019 in Wuhan, the capital of China's Hubei province. The virus has since spread globally, resulting in the currently ongoing COVID-19 pandemic. The first whole genome sequence was published on January 5 2020, and thousands of genomes have been sequenced since this date. This resource allows unprecedented insights into the past demography of SARS-CoV-2 but also monitoring of how the virus is adapting to its novel human host, providing information to direct drug and vaccine design. We curated a dataset of 7666 public genome assemblies and analysed the emergence of genomic diversity over time. Our results are in line with previous estimates and point to all sequences sharing a common ancestor towards the end of 2019, supporting this as the period when SARS-CoV-2 jumped into its human host. Due to extensive transmission, the genetic diversity of the virus in several countries recapitulates a large fraction of its worldwide genetic diversity. We identify regions of the SARS-CoV-2 genome that have remained largely invariant to date, and others that have already accumulated diversity. By focusing on mutations which have emerged independently multiple times (homoplasies), we identify 198 filtered recurrent mutations in the SARS-CoV-2 genome. Nearly 80% of the recurrent mutations produced non-synonymous changes at the protein level, suggesting possible ongoing adaptation of SARS-CoV-2. Three sites in Orf1ab in the regions encoding Nsp6, Nsp11, Nsp13, and one in the Spike protein are characterised by a particularly large number of recurrent mutations (>15 events) which may signpost convergent evolution and are of particular interest in the context of adaptation of SARS-CoV-2 to the human host. We additionally provide an interactive user-friendly web-application to query the alignment of the 7666 SARS-CoV-2 genomes.

Introduction

On December 31 2019, China notified the World Health Organisation (WHO) about a cluster of pneumonia cases of unknown aetiology in Wuhan, the capital of the Hubei Province. The initial evidence was suggestive of the outbreak being associated with a seafood market in Wuhan, which was closed on January 1 2020. The aetiological agent was characterised as a SARS-like betacoronavirus, later named SARS-CoV-2, and the first whole genome sequence (Wuhan-HU-1) was deposited on NCBI Genbank on January 5 2020 (Wu et al., 2020). Human-to-human transmission was confirmed on January 14 2020, by which time SARS-CoV-2 had already spread to many countries throughout the world. Further extensive global transmission led to the WHO declaring COVID-19 as a pandemic on March 11 2020.

Coronaviridae comprise a large number of lineages that are found in a wide range of mammals and birds (Shaw et al., 2020), including the other human zoonotic pathogens SARS-CoV-1 and MERS-COV. The propensity of Betacoronaviridiae to undergo frequent host jumps supports SARS-CoV-2 also being of zoonotic origin. To date, the genetically closest-known lineage is found in horseshoe bats (BatCoV RaTG13) (Zhou et al., 2020). However, this lineage shares 96% identity with SARS-CoV-2, which is not sufficiently high to implicate it as the immediate ancestor of SARS-CoV-2. The zoonotic source of the virus remains unidentified at the date of writing (April 23 2020).

The analysis of genetic sequence data from pathogens is increasingly recognised as an important tool in infectious disease epidemiology (Rambaut et al., 2008; Grenfell et al., 2004). Genetic sequence data sheds light on key epidemiological parameters such as doubling time of an outbreak/epidemic, reconstruction of transmission routes and the identification of possible sources and animal reservoirs. Additionally, whole-genome sequence data can inform drug and vaccine design. Indeed, genomic data can be used to identify pathogen genes interacting with the host and allows characterisation of the more evolutionary constrained regions of a pathogen genome, which should be preferentially targeted to avoid rapid drug and vaccine escape mutants.

There are thousands of global SARS-CoV-2 whole-genome sequences available on the rapid data sharing service hosted by the Global Initiative on Sharing All Influenza Data (GISAID; https://www.epicov.org) (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017). The extraordinary availability of genomic data during the COVID-19 pandemic has been made possible thanks to a tremendous effort by hundreds of researchers globally depositing SARS-CoV-2 assemblies (Table S1) and the proliferation of close to real time data visualisation and analysis tools including NextStrain (https://nextstrain.org) and CoV-GLUE (http://cov-glue.cvr.gla.ac.uk).

In this work we use this data to analyse the genomic diversity that has emerged in the global population of SARS-CoV-2 since the beginning of the COVID-19 pandemic, based on a download of 7710 assemblies. We focus in particular on mutations that have emerged independently multiple times (homoplasies) as these are likely candidates for ongoing adaptation of SARS-CoV-2 to its novel human host. After filtering, we characterise homoplasies at 198 sites in the SARS-CoV-2 genome. We identify a strong signal of recurrent mutation at nucleotide position 11,083 (Codon 3606 Orf1a), together with two further sites in Orf1ab encoding the non-structural proteins Nsp11 and Nsp13. These, together with a mutation in the Spike protein (21,575, Codon 5), comprise the strongest putative regions under selection in our dataset.

The current distribution of genomic diversity as well as ongoing allele frequency changes both between isolates and along the SARS-CoV-2 genome are publicly available as an open access and interactive web-resource available here:

https://macman123.shinyapps.io/ugi-scov2-alignment-screen/.

Section snippets

Data acquisition

7710 SARS-CoV-2 assemblies flagged as “complete (>29,000 bp)”, “high coverage only”, “low coverage excl” were downloaded from the GISAID Initiative EpiCoV platform as of April 19 2020 (11:30 GMT). A full acknowledgements table of those labs which generated and uploaded data is provided in Table S1. Filtering was performed on the downloaded assemblies to exclude those deriving from animals (bat, pangolin), those with more than 1% missing sites, and otherwise spurious assemblies as also listed by

Emergence of SARS-CoV-2 genomic diversity over time

The 7666 SARS-CoV-2 genomes offer an excellent geographical and temporal coverage of the COVID-19 pandemic (Fig. 1a-b). The genomic diversity of the 7666 SARS-CoV-2 genomes is represented as Maximum Likelihood phylogenies in a radial (Fig. 1c) and linear layout (Fig. S1-S2). There is a robust temporal signal in the data, captured by a statistically significant correlation between sampling dates and ‘root-to-tip’ distances for the 7666 SARS-CoV-2 (Fig. S3; R2 = 0. 20, p < .001). Such positive

Discussion

Pandemics have been affecting humanity for millennia (Balloux and van Dorp, 2017). Over the last century alone, several global epidemics have claimed millions of lives, including the 1957/58 influenza A (H2N2) pandemic, the sixth (1899–1923) and seventh ‘El Tor’ cholera pandemic (1961–1975), as well as the HIV/AIDS pandemic (1981-today). COVID-19 acts as an unwelcome reminder of the major threat that infectious diseases represent in terms of deaths and disruption.

One positive aspect of the

Author contributions

L.v.D., and F.B. conceived and designed the study; L.v.D., M.A, D.R L.P.S., C.E.F., L.O., C.J.O., J.P., C.C.S.T., F.A.T.B., and A.T.O analysed data and performed computational analyses; L.v.D., and F.B. wrote the paper with inputs from all co-authors.

Acknowledgments and funding

L.v.D and F.B. acknowledge financial support from the Newton Fund UK-China NSFC initiative (grant MR/P007597/1) and the BBSRC (equipment grant BB/R01356X/1). Computational analyses were performed on UCL Computer Science cluster and the South Green bioinformatics platform hosted on the CIRAD HPC cluster. We thank Jaspal Puri for insights and assistance on the development of the alignment visualisation tool and Nicholas McGranahan and Rachel Rosenthal for their comments on the manuscript. We

Declaration of Competing Interest

The authors have no competing interests to declare.

References (40)

  • X.G. Li et al.

    Transmission dynamics and evolutionary history of 2019-nCoV

    J. Med. Virol.

    (2020)
  • E.J. Snijder et al.

    Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage

    J. Mol. Biol.

    (2003)
  • R.M. Anderson et al.

    Infectious Diseases of Humans. Dynamics and Control

    (1991)
  • F. Balloux et al.

    Q&a: what are pathogens, and what have they done to and for us?

    BMC Biol.

    (2017)
  • R. Cagliani et al.

    Computational inference of selection underlying the evolution of the novel coronavirus, SARS-CoV-2

    J. Virol.

    (2020)
  • J. Crispell et al.

    HomoplasyFinder: a simple tool to identify homoplasies on a phylogeny

    Microbial Genom.

    (2019)
  • B.L. Dearlove et al.

    A SARS-CoV-2 vaccine candidate would likely match all currently circulating strains

    bioRxiv

    (2020)
  • X. Didelot et al.

    Bayesian inference of ancestral dates on bacterial phylogenetic trees

    Nucleic Acids Res.

    (2018)
  • P. Domingo-Calap et al.

    An unusually high substitution rate in transplant-associated BK polyomavirus in vivo is further concentrated in HLA-C-bound viral peptides

    PLoS Pathog.

    (2018)
  • G. Dudas et al.

    MERS-CoV recombination: implications about the reservoir and potential for adaptation

    Virus Evol.

    (2016)
  • S. Elbe et al.

    Data, disease and diplomacy: GISAID’s innovative contribution to global health

    Global Chall.

    (2017)
  • W.M. Fitch

    Toward defining course of evolution - minimum change for a specific tree topology

    Syst. Zool.

    (1971)
  • C. Fraser et al.

    Pandemic potential of a strain of influenza A (H1N1) : early findings

    Science

    (2009)
  • M. Giovanetti et al.

    The first two cases of 2019-nCoV in Italy: where they come from?

    J. Med. Virol.

    (2020)
  • D.E. Gordon et al.

    A SARS-CoV-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing

    Nature

    (2020)
  • M. Gouy et al.

    SeaView version 4: a multiplatform graphical user Interface for sequence alignment and phylogenetic tree building

    Mol. Biol. Evol.

    (2010)
  • B.T. Grenfell et al.

    Unifying the epidemiological and evolutionary dynamics of pathogens

    Science

    (2004)
  • A. Grifoni et al.

    A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2

    Cell Host Microbe

    (2020)
  • D.T. Hoang et al.

    MPBoot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation

    BMC Evol. Biol.

    (2018)
  • E.C. Holmes et al.

    The evolution of Ebola virus: insights from the 2013-2016 epidemic

    Nature

    (2016)
  • Cited by (0)

    1

    Equal contribution.

    View full text