Emergence of genomic diversity and recurrent mutations in SARS-CoV-2
Introduction
On December 31 2019, China notified the World Health Organisation (WHO) about a cluster of pneumonia cases of unknown aetiology in Wuhan, the capital of the Hubei Province. The initial evidence was suggestive of the outbreak being associated with a seafood market in Wuhan, which was closed on January 1 2020. The aetiological agent was characterised as a SARS-like betacoronavirus, later named SARS-CoV-2, and the first whole genome sequence (Wuhan-HU-1) was deposited on NCBI Genbank on January 5 2020 (Wu et al., 2020). Human-to-human transmission was confirmed on January 14 2020, by which time SARS-CoV-2 had already spread to many countries throughout the world. Further extensive global transmission led to the WHO declaring COVID-19 as a pandemic on March 11 2020.
Coronaviridae comprise a large number of lineages that are found in a wide range of mammals and birds (Shaw et al., 2020), including the other human zoonotic pathogens SARS-CoV-1 and MERS-COV. The propensity of Betacoronaviridiae to undergo frequent host jumps supports SARS-CoV-2 also being of zoonotic origin. To date, the genetically closest-known lineage is found in horseshoe bats (BatCoV RaTG13) (Zhou et al., 2020). However, this lineage shares 96% identity with SARS-CoV-2, which is not sufficiently high to implicate it as the immediate ancestor of SARS-CoV-2. The zoonotic source of the virus remains unidentified at the date of writing (April 23 2020).
The analysis of genetic sequence data from pathogens is increasingly recognised as an important tool in infectious disease epidemiology (Rambaut et al., 2008; Grenfell et al., 2004). Genetic sequence data sheds light on key epidemiological parameters such as doubling time of an outbreak/epidemic, reconstruction of transmission routes and the identification of possible sources and animal reservoirs. Additionally, whole-genome sequence data can inform drug and vaccine design. Indeed, genomic data can be used to identify pathogen genes interacting with the host and allows characterisation of the more evolutionary constrained regions of a pathogen genome, which should be preferentially targeted to avoid rapid drug and vaccine escape mutants.
There are thousands of global SARS-CoV-2 whole-genome sequences available on the rapid data sharing service hosted by the Global Initiative on Sharing All Influenza Data (GISAID; https://www.epicov.org) (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017). The extraordinary availability of genomic data during the COVID-19 pandemic has been made possible thanks to a tremendous effort by hundreds of researchers globally depositing SARS-CoV-2 assemblies (Table S1) and the proliferation of close to real time data visualisation and analysis tools including NextStrain (https://nextstrain.org) and CoV-GLUE (http://cov-glue.cvr.gla.ac.uk).
In this work we use this data to analyse the genomic diversity that has emerged in the global population of SARS-CoV-2 since the beginning of the COVID-19 pandemic, based on a download of 7710 assemblies. We focus in particular on mutations that have emerged independently multiple times (homoplasies) as these are likely candidates for ongoing adaptation of SARS-CoV-2 to its novel human host. After filtering, we characterise homoplasies at 198 sites in the SARS-CoV-2 genome. We identify a strong signal of recurrent mutation at nucleotide position 11,083 (Codon 3606 Orf1a), together with two further sites in Orf1ab encoding the non-structural proteins Nsp11 and Nsp13. These, together with a mutation in the Spike protein (21,575, Codon 5), comprise the strongest putative regions under selection in our dataset.
The current distribution of genomic diversity as well as ongoing allele frequency changes both between isolates and along the SARS-CoV-2 genome are publicly available as an open access and interactive web-resource available here:
Section snippets
Data acquisition
7710 SARS-CoV-2 assemblies flagged as “complete (>29,000 bp)”, “high coverage only”, “low coverage excl” were downloaded from the GISAID Initiative EpiCoV platform as of April 19 2020 (11:30 GMT). A full acknowledgements table of those labs which generated and uploaded data is provided in Table S1. Filtering was performed on the downloaded assemblies to exclude those deriving from animals (bat, pangolin), those with more than 1% missing sites, and otherwise spurious assemblies as also listed by
Emergence of SARS-CoV-2 genomic diversity over time
The 7666 SARS-CoV-2 genomes offer an excellent geographical and temporal coverage of the COVID-19 pandemic (Fig. 1a-b). The genomic diversity of the 7666 SARS-CoV-2 genomes is represented as Maximum Likelihood phylogenies in a radial (Fig. 1c) and linear layout (Fig. S1-S2). There is a robust temporal signal in the data, captured by a statistically significant correlation between sampling dates and ‘root-to-tip’ distances for the 7666 SARS-CoV-2 (Fig. S3; R2 = 0. 20, p < .001). Such positive
Discussion
Pandemics have been affecting humanity for millennia (Balloux and van Dorp, 2017). Over the last century alone, several global epidemics have claimed millions of lives, including the 1957/58 influenza A (H2N2) pandemic, the sixth (1899–1923) and seventh ‘El Tor’ cholera pandemic (1961–1975), as well as the HIV/AIDS pandemic (1981-today). COVID-19 acts as an unwelcome reminder of the major threat that infectious diseases represent in terms of deaths and disruption.
One positive aspect of the
Author contributions
L.v.D., and F.B. conceived and designed the study; L.v.D., M.A, D.R L.P.S., C.E.F., L.O., C.J.O., J.P., C.C.S.T., F.A.T.B., and A.T.O analysed data and performed computational analyses; L.v.D., and F.B. wrote the paper with inputs from all co-authors.
Acknowledgments and funding
L.v.D and F.B. acknowledge financial support from the Newton Fund UK-China NSFC initiative (grant MR/P007597/1) and the BBSRC (equipment grant BB/R01356X/1). Computational analyses were performed on UCL Computer Science cluster and the South Green bioinformatics platform hosted on the CIRAD HPC cluster. We thank Jaspal Puri for insights and assistance on the development of the alignment visualisation tool and Nicholas McGranahan and Rachel Rosenthal for their comments on the manuscript. We
Declaration of Competing Interest
The authors have no competing interests to declare.
References (40)
- et al.
Transmission dynamics and evolutionary history of 2019-nCoV
J. Med. Virol.
(2020) - et al.
Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage
J. Mol. Biol.
(2003) - et al.
Infectious Diseases of Humans. Dynamics and Control
(1991) - et al.
Q&a: what are pathogens, and what have they done to and for us?
BMC Biol.
(2017) - et al.
Computational inference of selection underlying the evolution of the novel coronavirus, SARS-CoV-2
J. Virol.
(2020) - et al.
HomoplasyFinder: a simple tool to identify homoplasies on a phylogeny
Microbial Genom.
(2019) - et al.
A SARS-CoV-2 vaccine candidate would likely match all currently circulating strains
bioRxiv
(2020) - et al.
Bayesian inference of ancestral dates on bacterial phylogenetic trees
Nucleic Acids Res.
(2018) - et al.
An unusually high substitution rate in transplant-associated BK polyomavirus in vivo is further concentrated in HLA-C-bound viral peptides
PLoS Pathog.
(2018) - et al.
MERS-CoV recombination: implications about the reservoir and potential for adaptation
Virus Evol.
(2016)
Data, disease and diplomacy: GISAID’s innovative contribution to global health
Global Chall.
Toward defining course of evolution - minimum change for a specific tree topology
Syst. Zool.
Pandemic potential of a strain of influenza A (H1N1) : early findings
Science
The first two cases of 2019-nCoV in Italy: where they come from?
J. Med. Virol.
A SARS-CoV-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing
Nature
SeaView version 4: a multiplatform graphical user Interface for sequence alignment and phylogenetic tree building
Mol. Biol. Evol.
Unifying the epidemiological and evolutionary dynamics of pathogens
Science
A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2
Cell Host Microbe
MPBoot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation
BMC Evol. Biol.
The evolution of Ebola virus: insights from the 2013-2016 epidemic
Nature
Cited by (0)
- 1
Equal contribution.