Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has produced five variants of concern (VOC) to date. The important Spike mutation ‘N501Y’ is common to Alpha, Beta, Gamma and Omicron VOC, while the ‘P681R’ is key to Delta’s spread. We have analysed circa 4.2 million SARS-CoV-2 genome sequences from the world’s largest repository ‘Global Initiative on Sharing All Influenza Data (GISAID)’ and demonstrated that these two mutations have cooccurred on the Spike ‘D614G’ mutation background at least 3,678 times from 17 October 2020 to 1 November 2021. In contrast, the Y501-H681 combination, which is common to Alpha and Omicron VOC, is present in circa 1.1 million entries. Two-thirds of the 3,678 cooccurrences were in France, Turkey or US (East Coast), and the rest across 57 other countries. 55.5% and 4.6% of the cooccurrences were Alpha’s Q.4 and Gamma’s P.1.8 sub-lineages acquiring the P681R; 10.7% and 3.8% were Delta’s B.1.617.2 lineage and AY.33 sub-lineage acquiring the N501Y; the remaining 10.2% were in other variants. Despite the selective advantages individually conferred by N501Y and P681R, the Y501-R681 combination counterintuitively did not outcompete other variants in every instance we have examined. While this is a relief to worldwide public health efforts, in vitro and in vivo studies are urgently required in the absence of a strong in silico explanation for this phenomenon. This study demonstrates a pipeline to analyse combinations of key mutations from public domain information in a systematic manner and provide early warnings of spread.
1. Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes the ongoing novel coronavirus disease 19 (COVID-19) pandemic, has a single-stranded positive-strand ribonucleic acid genome (ssRNA(+) genome) of size 26–32 kb, with high fidelity replication due to 3′-to-5′ exoribonuclease ‘proof-reading’ mechanism [P. Zhou 1, D.C. Bauer 2]. As this virus adapts to its human host, we have seen it evolve and present quasispecies diversity [D.C. Bauer 2, C.O. Wilke 3]. While most of the several thousands of mutations catalogued to date aren’t substantial functional changes they have proven aetiologically useful [N.D. Grubaugh 4, E. Callaway 5].
Two years on since the start of the COVID-19 pandemic, we now have a good idea of the key mutations, especially in the Spike protein, which are punctuations in the evolutionary story of this virus to date. The D614G mutation reported in April 2020 resulted in the ‘G-strain’ with increased infectivity replacing the genomic background of this virus globally, although there was no impact on vaccine efficacy [E. Callaway 5, B. Korber 6, A.J. McAuley 7]. However, the N501Y mutation common to the Alpha, Beta, Gamma and Omicron variants of concern (VOC), has contributed to enhanced infection and transmission, reduced vaccine efficacy, and the ability of SARS-CoV-2 to infect new species such as wild type mice [Y. Liu 8, S.S. Abdool Karim 9, S. Riddell 10, M.J. Kuiper 11]. Another key mutation is the P681R which alters the furin cleavage site, and has been responsible for increased infectivity, transmission and global impact of the Delta variant [E. Callaway 12, Y. Liu 13, T.P. Peacock 14].
2. Methods
2.1 Study Objectives
Our primary objective is to investigate the risk of the three aforementioned ‘mutations of current interest’ (MOCI) cooccurring naturally due to convergent evolution and resulting in a SARS-CoV-2 variant that is of greater concern than those declared to date, noting that the latest Omicron VOC has the N501Y but not the P681R mutation. Our secondary objective is to demonstrate a methodology and pipeline to analyse the spread of variants containing combinations of important mutations. We have achieved this by mining data from GISAID, the Global Initiative on Sharing All Influenza Data, which is the largest and the most comprehensive repository of SARS-CoV-2 sequences [S. Elbe 15, D.C. Bauer 16]. We have used quasispecies theory and in silico modelling to interpret our findings to the extent possible, and recommended future research directions.
2.2 Bioinformatics Analysis
We mined GISAID data on 3 November 2021 containing 4,835,087 entries, 99.9% (4,831,865) sequenced from people who tested COVID-19 positive. Of these, 4,748,462 contained over 29 thousand nucleotides and were deemed ‘complete’. GISAID has also defined a subset of these 4.8 million sequences, containing 3,519,085 entries, as ‘high coverage’. This is because 99% of their bases are defined; there is no unverified deletion or insertion; and unique mutations not seen in other sequences constitute fewer than 0.05% in these entries. Consequently, some of our analysis of the data downloaded on 1 November 2021 (4,766,139 sequences) required manual curation whenever we encountered incorrect dates, or low coverage sequencing that could skew our results. Supplementary Table S1 provides GISAID’s break down of these 4.8 million entries by variants containing the D614G (98.8%), N501Y (27.01%) and P681R (48.2%) mutations; by Alpha (23.9%), Beta (0.8%), Gamma (2.4%), Delta (47.8%), Omicron (0.0%) VOC; and by Lambda (0.2%) and Mu (0.3%) variants of interest (VOI).
Human-origin SARS-CoV-2 sequences over 29 kb length were aligned to the nominal reference (EPI_ISL_402123) using an in-house alignment pipeline that generated a file in the ‘variant call format’ (VCF) containing all the mutations as of 1 November 2021 [P. Zhou 1, D.C. Bauer 2]. In the VCF containing 4,294,469 genome sequences, we searched for entries with MOCI, for example, the Spike N501Y was searched at nucleotide position 23403 for an A to G transition. The EPI_ISL sample IDs containing our MOCI were merged with their respective GISAID metadata (location, date of collection, lineage) to create an annotated database. Supplementary Table S1 shows the split up of these 4.3 million sequences by variants containing the D614G (98.8%), N501Y (29.2%) and P681R (44.2%) mutations; and by Alpha (26.3%), Beta (0.9%), Gamma (2.2%), Delta (43.8%), Omicron (0.0%), Lambda (0.2%) and Mu (0.3%) VOC/VOI.
Using a custom Python code, we identified the presence of the D614G, N501Y and P681R mutations, and combinations thereof (c.f. ‘Data Availability’ for our code). Our final dataset contained 4,177,098 entries after filtering out GISAID records with incomplete dates, i.e. retaining those with a ‘YYYY-MM-DD’ format. We ascertained whether these isolates are one of the VOC (and if yes, which sub-lineage), their location (country, state, and city depending on information available), and sample collection date. Using the latter, we calculated rolling averages over 14-days, as this window has been shown from previous experience to be an optimal period to reduce background noise, especially as the data is discrete and highly variable in size across countries [M.J. Kuiper 11]. However, unlike our previous study in which we were interested in macroscopic trends, here we have looked at every instance of key mutations cooccurring; therefore, we did not use a threshold or filter based on minimal sample size. Our analysis presents the spread of these mutations from the notional start of the SARS-CoV-2 pandemic (31 December 2019) to the end of our observation period (01 November 2021) with appropriate classifications (e.g. Alpha, Beta, Gamma, Delta).
2.3 Biomolecular Modelling
Fully glycosylated in silico models of the SARS-CoV-2 Spike protein were constructed based on the ‘6VSB’ protein databank (PDB) structure [D. Wrapp 17], minus the transmembrane domain (residues 1161-1272) for additional computational efficiency as described previously [A.J. McAuley 7]. Models were simulated in aqueous solution (TIP3, water, 0.15M ions, NVT ensemble, 310K) using NAMD 2.14 software [J.C. Phillips 18]. Models were visualized with the visual molecular dynamics (VMD) program for large biomolecular systems which uses 3-D graphics and built-in scripting to assess the spatial arrangement of mutations and any complementary or inhibitory interactions [W. Humphrey 19].
3. Results: Frequency and Distribution of D614G, N501Y and P681R Mutations
The earliest record of these three MOCI coming together was in Slovenia on 17 October 2020, however, no further observations were recorded since then in that country. Mining of GISAID data found 3,678 entries (0.1%) containing the MOCI and a majority of these were in current VOC (Error! Reference source not found.) – Alpha (61.3%), Beta (0.4%), Gamma (6.4%), Delta (21.2%), Omicron (0.0%) – and a small proportion in current or former VOI – Mu (0.3%), Kappa (0.2%) and Iota (0.02%). A small proportion (10.2%) was also observed in other variants not classified as VOC, VOI, Variant Under Monitoring (VUM), or specific sub-lineages.
Figure 2 illustrates the MOCI frequency in countries with at least 50 recorded sequences – Brazil (5.8%), Denmark (1.9%), France (22.7%), Germany (3.6%), Sweden (5.2%), Turkey (21.4%), UK (3.6%) and USA (22.1%). These eight countries represent 86.4% of the cases containing the MOCI (Figure 3A, which shows a continuous timeline). Fifty-seven other countries recorded less than 50 entries for the period 17 October 2020 to 15 September 2021 (Supplementary Table S2).
Three countries – France, USA and Turkey – stand out as they each contribute to over 22% of the total instances of the MOCI cooccurring. It’s unsurprising that these trends overlap with the spread of VOC in these nations (Figures 3B-3D), because there would have been more opportunities for Delta to acquire N501Y, and for Alpha or Gamma to acquire P681R (Figures 3B-3F). Such cooccurrences also appear to have happened over short periods of time, for instance during March to May 2021 in France and USA with the Alpha VOC; and between June and August 2021 in Turkey with the Delta VOC (Figures 2 and 3B-3D). This could be due to founder effects, but we cannot establish this by mining GISAID data alone; it will require detailed epidemiological investigations by national public health authorities, which is beyond the scope of this work. Nevertheless, from Figure 2, we can be reasonably sure that a number of independent events of convergent evolution (i.e. MOCI cooccurring) have taken place. It is worth investigating why we do not see as many instances of Delta acquiring N501Y in France and USA, even though they had a very high number of this VOC from July 2021.
4. Discussion
4.1 Y501-R681 almost always in the G614 background
From Supplementary Table S3 we see that 3,688 entries contain N501Y+P681R, just ten more than the 3,678 entries that contain D614G+N501Y+P681R; in other words, the instances of N501Y cooccurring with P681R have almost always happened on the D614G background as predicted [E. Callaway 5, B. Korber 6, A.J. McAuley 7]. Although these three mutations are positioned sufficiently far apart in the 3-dimensional protein structure to suggest they don’t interact (Figure 4), further studies are required to understand whether there may be indirect functional links that could enhance viral efficiency.
4.2 Predominance of Alpha Q.4 acquiring P681R and Delta acquiring N501Y
We see that the cooccurrence of the MOCI is largely due to; (a) the Alpha VOC acquiring the P681R (61.3%), and this has happened overwhelmingly in Alpha’s Q.4 sub-lineage (89.9%) by definition; (b) the Delta VOC acquiring the N501Y (21.2%), and this has mainly happened in Delta’s parent lineage (B.1.617.2; 50.6%) and also the sub-lineage AY.33 (17.9%); (c) the Gamma VOC acquiring the P681R (6.4%), predominantly in its P.1.8 sub-lineage (71.3%). It is worth noting that the Q.4 sub-lineage has the P681R by definition (https://outbreak.info/situation-reports?pango=Q.4). The Beta VOC acquiring the P681R only constituted 0.4% of the total occurrences, and this was almost entirely in the B.351.2 parent lineage (93.3%). Supplementary Table S4 shows the results for other key Spike mutations in comparison to those obtained from GISAID. For H69del and Y145del, more samples were identified using our pipeline than on GISAID – 1,138,302 versus 1,123,439 before date filtering, and 2,563 versus 1,364 after date filtering.
4.3 Alpha acquiring P681R ahead of other VOC could be due to P/H681R
Alpha, Beta and Gamma each have N501Y, however Alpha was the first VOC to acquire P681R, both in terms of absolute (Figure 3E) and relative (Figure 3F) timelines, although Beta was reported four months before Alpha in May 2020. Alpha, especially its Q.4 sub-lineage, also contributes to most instances of the MOCI cooccurring. There is a possible biomolecular basis for this; P681H is present in Alpha, but not in the other two VOC. The Grantham scores associated with P681R (103) and P681H (77) are comparable and the H681R substitution is thus a conservative change (Grantham distance 29). There is only one way to get to Histidine (H) or Arginine (R) from Proline (P); and for H and R there is only one way to get to each other; and we see no structural reason from our in silico model as to why one should be preferred to the other. Thus, we could infer that some of the instances of Alpha acquiring P681R could have been due to a substitution of H with R, which is the signature of the Q.4 sub-lineage. Spike’s 681st position is the fifth substrate sequence for cleavage recognition; both furin and the transmembrane serine protease 2 (TMPRSS2) cleave the Spike at 685/686 position with H (and possibly R) enhancing this process [A.J. McAuley 7, A.G. Remacle 20, M. Mahoney 21]. It is worth noting that to penetrate host cells, the SARS-CoV-2 Spike protein must be cut twice by host proteins. In the SARS-CoV-1 (SARS), both incisions occur after the virus has locked on to a cell. But with SARS-CoV-2, the presence of the furin cleavage site enables host enzymes like furin to make the first cut as newly formed viral particles emerge from an infected cell. These pre-activated viral particles can then go on to infect cells more efficiently compared to particles requiring two cuts. Thus, the P681R increases the susceptibility of the furin cleavage site in Delta VOC, and allows the exposure of the Spike’s S2 subunit for better cell integration.
4.4 Cooccurrences have been reported predominantly in eight countries
Cooccurrences of the MOCI were observed with Alpha, Beta, Gamma and Delta VOC in 41, 8, 11 and 47 countries respectively, indicating convergent evolution, especially as much of the world was under lockdown during 2020-2021. Curiously, two thirds of the observations have been reported from just France, Turkey and USA; 86.3% from these three countries plus Brazil, Denmark, Sweden, UK and Germany (Supplementary Table 2). Several instances involve proximal regions, suggesting multiple founder effects (Figures 2 and 5). For example, in Denmark and Sweden, variants with the MOCI were concentrated in highly populous Hovedstaden and Svealand regions (Sörmland, Stockholm, Uppsala and Västmandland) [Statista website 22]; the MOCI frequencies were also correlated with the most common lineage in respective countries and regions, suggesting community transmission, although these MOCI did not go on to become the dominant variant in those locations. This is in line with most within-host variants getting lost during transmission, and only a few founding infections being maintained in a given population [K.A. Lythgoe 23].
Cooccurrence of the MOCI in the UK occurred mostly with the Alpha VOC’s B.1.1.7 and Q.4 from late 2020 to May 2021; followed by the second wave of the Delta VOC since April 2021, especially in AY.4 that had the highest (58%) prevalence (Figure 2). The former period had ten times the death rate (∼1200 per week between September 2020 and March 2021) compared to the latter (∼120 per week; probably due to lockdowns and 30% of the population already double-vaccinated). In the UK which has a very high rate of sequencing, the overall numbers of cooccurring MOCI were still low (133 records, 88% in England). Of these, there were respectively 52, 39 and 18 instances of the AY.4 acquiring N501Y, and the B.1.1.7 and Q.4 acquiring P681R [outbreak.info 24, Our World in Data 25]. It is possible that the lockdown protocols and vaccination coverage may have influenced the observed viral transmission dynamics [Institute for Government 26]. In the Americas, most of 1,078 cooccurrences of the MOCI were detected either in the US East Coast (673 entries), or in the Sul (Santa Catarina) and Sudeste (São Paulo, Rio de Janeiro) regions in the South and Southeast of Brazil (284 entries). These trends could be due to founder effects and lack of travel restrictions from early 2021, especially in the United States [Ballotpedia 27] (Figure 2 and 5). In Brazil, the cooccurrences of the MOCI also corresponded with the prevalence of P.1 until August 2021 after which the AY.99.2 became the dominant variant (Supplementary Figure S3). Unfortunately, regional and city level information was not available for Turkey on GISAID, preventing further analysis and insights and highlighting the importance of metadata [D.C. Bauer 16].
4.5 Y501-R681 doesn’t outcompete other variants
From our analysis, SARS-CoV-2 isolates containing Y501-R681 in Spike did not outcompete other variants in every single instance we have examined (Figures 2 and 3), which is both counterintuitive and a relief to worldwide public health efforts. Although individually these two mutations have been reported to confer selective advantage [Y. Liu 8, S.S. Abdool Karim 9, S. Riddell 10, M.J. Kuiper 11, E. Callaway 12, Y. Liu 13, T.P. Peacock 14], their combination apparently has not and we have not found any compelling biomolecular reason for this (Figure 4). It would be desirable to conduct in vitro and in vivo studies to gain a better understanding, either using infectious clones and reverse genetics [T.T.N. Thao 28, X. Xie 29, Y.J. Hou 30, X. Xie 31], or using naturally occurring comparable isolates with and without these mutations [see for instance A.J. McAuley 7].
4.6 Implications for the Omicron VOC
Alpha and Omicron each have the Y501-H681 combination which is present in 1,085,434 entries out of the 4.2 million we have examined in this study (i.e. over 25%); this contrasts with the Y501-R681 of which we have only found 3,688 observations. Just as the Q.4 sub-lineage with Y501-R681emerged when the Alpha VOC started to spread, it is possible that a new sub-lineage of Omicron with Y501-R681 could emerge given the rapid rise of this latest VOC [N. Ferguson 32, S. Cele 33, J.R.C. Pulliam 34]. The spread of VOC in immunocompromised patients could lead to mutations with a selective advantage for antibody escape and/or transmissibility (e.g. N501Y and P681H/R), in addition to a number of deletions in the N terminal domain [L. Corey 35], and this aspect needs further investigation given Omicron’s large number of mutations (many of them yet to be studied in depth). The adaptation and evolution of this virus in new hosts, for instance mice and rats, also pose additional risks such as additional reservoirs and reinfection of humans [M.J. Kuiper 11]. Investigations into the dynamics affecting the SARS-CoV-2 pandemic in each of the eight countries are also warranted, in the wider context of travel restrictions, population demographics and dynamics [K.A. Grépin 36, J. Liebig 37]. Lack of patient-deidentified metadata in a consistent format further complicates meaningful analysis as emphasized before [D.C. Bauer 16].
4.7 SARS-CoV-2 evolution and selective pressures on quasispecies
Any virus which is new to a host will undergo a period of adaptation, as numerous selective pressures come to bear. RNA viruses, such as coronaviruses, are error-prone in their replication and exist as a cloud of variants, with the dominant clade representing the majority population, but with a number of diverse variants also represented within the population, known as quasispecies [D.C. Bauer 2, C.O. Wilke 3, E. Domingo 38, T.W. Drew 39]. Among viruses which are established in their host, the dominant clade is generally relatively ‘fit’ in its ability to replicate and out-competes all the other clades. However, when a virus is new to a host, its relative fitness may be quite low, so, in any particular environment, different clades can have similar relative fitness. This means subtle differences might give a particular clade a small advantage enabling it to predominate as the population continues its host adaptation. The environment may comprise of host factors, such as host genetics, tissue tropism, age, immune status or competence, presence of other infections, as well as external factors, such as temperature and humidity. With SARS-CoV-2, the extraordinary public health measures in some regions of the world are likely exerting their own selective pressures on its evolution and creating numerous local environments. This phenomenon, called the ‘survival of the flattest’ [C.O. Wilke 3] can lead to the emergence of multiple clades dominating under different environments. These is also an increasing appreciation that, within the host, different clades may predominate in different tissues at the same time and that there may be interactions among the diverse quasispecies that facilitate the infection [F. Sun 40].
The pathway of adaptation is not always towards severe disease, since this may not lead to maximum replication and transmission. Indeed, many viruses with a long history of acute infection in an established host species, the disease may be mild or inapparent, reflecting a relationship more towards stasis, where the impact on the host does not compromise the chances of the progeny virus being transmitted to a new host. For many coronaviruses, we have seen an evolution towards milder infection, albeit with greater transmission rates than has been seen with SARS-CoV-2 to date. Certainly, some level of reduction in the efficacy of vaccine-induced antibody in neutralising the Delta and Omicron variants has been demonstrated, but the other feature of these viruses, which is less often discussed, is the ability of these VOC to replicate by means of syncytial formation, rather than the simple apoptotic cycle seen with earlier variants [L. Lin 41]. This may confer an important immune evasive aspect to replication, which might offer a distinct advantage, in immune avoidance. It might also explain why these clades are more productive and less pathogenic since systemic infection is not always involved.
4.8 Error catastrophe, Muller’s ratchet and implications for SARS-CoV-2
A phenomenon related to quasispecies evolution is that of error catastrophe, which may occur if the mutational rate exceeds the ability of the fitness landscape to accommodate the resultant evolving clades [H. Tejero 42]. The key triggers for this are not fully understood, but it is thought that the application of strict biosecurity measures may play a part by restricting the availability of new hosts. In such cases, the quasispecies becomes increasingly attenuated, with mutations accumulating via a mechanism called ‘Muller’s ratchet’, to the point where the epidemic may die out [J.G.G. de Alcañíz 43]. This has been seen in many prior outbreaks of Ebola, where the infection became increasingly less pathogenic and less transmissible. While the current environment of SARS-CoV-2 is complex and highly variable, the selective pressure on a virus is towards higher transmissibility and milder disease in all environments, and the current systems of control preferentially select for such. Once a virus develops either/both of these traits, it is very unusual for the reverse to occur because viruses with higher transmissibility out-compete those with lesser transmissibility and those which are milder disable the host less, so they continue to travel and mix with susceptible hosts. It remains to be seen if this is being observed with the Omicron VOC as early reports indicate higher transmissibility and less severe disease.
In comparing variants of SARS-CoV-2, it is tempting to focus on specific mutations and their location – also perhaps with overt attention being paid to the Spike protein, especially its receptor binding domain – and not consider that the observed changes could evolve independently of each other. The secondary and tertiary folding of proteins, as well as post translational modifications, can have a profound effect on function, such that a mutation at one site might require a complimentary change at another site in order to confer advantage, or be wiped out by another, seemingly unrelated change. For the same reason, convergent evolution, might also occur, exhibited by the co-existence of identical co-mutations in virus clades of different lineage [H. Brüssow 44]. Another phenomenon known to occur in many coronaviruses is recombination, where partial genomic exchange may occur between two coronaviruses when simultaneously infecting the same host. The extent to which this may have occurred in SARS-CoV-2 is the likely subject of future analysis [D. Haddad 45, S. Pollet 46].
5. Conclusions and Limitations
SARS-CoV-2 is the most sequenced virus in the world, however, there is an inherent bias introduced by the highly variable sequencing-to-COVID-19 positivity ratio (S:P) across countries (e.g., the UK has one of the highest P:S ratios in the world), and across time (S:P has generally gone up in all countries, e.g. India). Thus, the GISAID data is likely to contain non-random samples with a skew in favour of sequences associated with epidemiologically consequent cases, albeit this is difficult to decipher as most sequences have little or no meaningful patient-deidentified information. The lockdowns and vaccination coverage associated with this viral pandemic have also been unprecedented in human history and spatio-temporally variable, further complicating our analysis. Notwithstanding these limitations, which have led to the proliferation of certain clades in certain environments, we are still able to make broad conclusions thanks to the very large number of sequences available on GISAID. We are also able to demonstrate a process and pipeline to analyse mutations of structural, functional and epidemiological consequence from public domain information in a systematic manner and provide early warnings of certain combinations starting to spread. Though it is unclear which combinations of mutations provide the best selective advantage, there is concern that mutations previously observed in other dominant lineages may arise spontaneously in contemporary strains further increasing their potential for infectivity and adverse clinical outcomes. In this case-study we analysed three key mutations (viz. D614G, N501Y and P681R) cooccurring but found no evidence of this combination spreading. Although 3,678 sequences were found on GISAID between 17 October 2020 to 1 November 2021, mainly in France, Turkey and USA, the Y501-R681 combination hasn’t outcompeted other variants and this warrants further in silico, in vitro and in vivo investigations. With the latest VOC Omicron containing a large number of mutations, many yet to be studied in-depth, our methodology will be very useful to understand whether certain combinations of mutations are more transmissible. If GISAID entries are strengthened with patient-deidentified metadata, then this approach could enable early epidemiological intelligence, for instance on case severity, mortality, and factors such as age, gender, race and co-morbidities that could increase the infection risk.
Data Availability
This paper only uses publicly accessible data that can be downloaded from GISAID, the largest repository of SARS-CoV-2 genome sequences. The lists of GISAID EPI_ISL_ sample IDs and data used by our analysis can be requested from the corresponding author. Our Python code is available at https://github.com/Carol-Lee-gh/Covid-Mutation-Pipieline.git.
https://github.com/Carol-Lee-gh/Covid-Mutation-Pipieline.git
General
We are grateful for support from our CSIRO colleagues at the Australian Centre for Disease Preparedness (https://www.grid.ac/institutes/grid.413322.5), especially Kim Blasdell, Simran Chahal, Alexander McAuley and Nagendrakumar Singanallur, and the Australian e-Health Research Centre, especially Denis Bauer, David Hansen, Yatish Jain, Brendan Hosking and Aidan Tay.
Author contributions
Describe the contributions of each author (use initials) to the paper. Conceptualization and methodology, S.S.V.; bioinformatics, C.L. and L.O.W.W.; biomolecular modelling and interpretation, M.J.K. and T.W.D.; data analysis and original draft preparation, S.M., C.L. and S.S.V.; funding acquisition, S.S.V. and L.O.W.W.; manuscript review and editing, all authors.
Competing interests
None declared.
Data Availability
This paper only uses publicly accessible data that can be downloaded from GISAID, the world’s largest repository of SARS-CoV-2 genome sequences. The lists of GISAID’s EPI_ISL_ sample IDs and data used by our analysis can be requested from the corresponding author. Our Python code is available at https://github.com/Carol-Lee-gh/Covid-Mutation-Pipieline.git.
Ethics Statement, Identifying Information and Clinical Trial Registration
This study does not involve any human samples, identifiable human data or patient records. It only involves bioinformatics analysis of novel coronavirus genome sequences openly accessible from the public repository ‘GISAID’, and these sequences do not contain any metadata that can be identified with any individual. This includes sample collection dates and anonymous sample IDs that are openly accessible. Our study does not report any exact age or photographs. Clinical trial registration is not applicable as this is neither an interventional study nor an observational study involving humans.
Funding
This work was supported by funding (Principal Investigator: S.S.V.) from the CSIRO Future Science Platforms, National Health and Medical Research Council (MRF2009092), and United States Food and Drug Administration (FDA) Medical Countermeasures Initiative contract (75F40121C00144); and by funding (Principal Investigator: L.O.W.W.) from the Australian Academy of Science and the Australian Department of Industry, Science, Energy and Resources. The article reflects the views of the authors and does not represent the views or policies of the funding agencies including the FDA.