ABSTRACT
Never before such a vast amount of data has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to answer a number of highly relevant questions, regarding the evolution of the virus and the role mutations play in its spread among the population. We focus on spike proteins, as they bear the main responsibility for the effectiveness of the virus diffusion by controlling the interactions with the host cells. Using the available temporal structure of the sequencing data for the SARS-CoV-2 spike protein in the UK, we demonstrate that every wave of the pandemic is dominated by a different variant. Consequently, the time evolution of each variant follows a temporal structure encoded in the epidemiological Renormalisation Group approach to compartmental models. Machine learning is the tool of choice to determine the variants at play, independent of (but complementary to) the virological classification. Our Machine Learning algorithm on spike protein sequencing provides a simple and unbiased way to identify, classify and track relevant virus variants without any prior knowledge of their characteristics. Hence, we propose a new tool that can help preventing and forecasting the emergence of new waves, and that can be used by decision makers to define short and long term strategies to curb the current COVID-19 pandemic or future ones.
Highlights
Objectives To study the relation between mutations of SARS-CoV-2, the emergence of relevant variants and the multi-wave pattern of the COVID-19 pandemic.
Setting Genomic sequencing of the SARS-CoV-2 spike proteins in the UK nations (England, Scotland, Wales). Epidemiological data for the number of infections in the UK nations, South Africa, California and India.
Methodology We designed a simple Machine Learning algorithm based on the Levenshtein distance on the spike protein sequences to cluster the available dataset and define variants. We set up a time-sensitive procedure that allows to define a variant as a chain of subsequent clusters. The Mutation epidemiological Renormalisation Group (MeRG) framework is used to describe the epidemiological data.
Results Our analysis of the sequencing data from England, Wales and Scotland shows that:
A Machine Learning analysis based only on the spike proteins allows to efficiently identify the variants of concern and of interest, as well as other variants relevant for the diffusion of the virus.
We identify a branching relation between variants, thus reconstructing the phylogeny of the main variants.
Comparison with the epidemiological data demonstrates that each new wave is dominated by a new emerging variant, thus confirming the hypothesis that there is a strong correlation between the emergence of variants and the multi-wave pattern.
The number of infected by each variant can be modelled via an independent logistic function (sigmoid), thus confirming the MeRG approach. Analyses of the epidemiological data for South Africa, California and India further corroborate this result.
Conclusions Using a simple Machine Learning algorithm, we are able to identify, classify and track relevant virus variants without any prior knowledge of their characteristics. While our analysis is only based on spike protein sequencing and is unbiased, the results are validated by other informed methods based on the complete genome. By correlating the variant definition to epidemiological data, we discover that each new wave of the COVID-19 pandemic is driven and dominated by a new emerging variant, as identified by our Machine Learning analysis. The results are seminal to the development of a new strategy to study how SARS-CoV-2 variants emerge and to predict the characteristics of future mutations of the spike proteins. Furthermore, the same methodology can be applied to other viral diseases, like influenza, if sufficient sequencing data is available. Hence, we provide an effective and unbiased method to identify new emerging variants that can be responsible for the onset of a new epidemiological wave. Our Machine Learning strategy is, in fact, a new tool that can help preventing and forecasting the emergence of new waves, and it can be used by decision makers to define short and long term strategies to curb the current COVID-19 pandemic or future ones.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
None
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Not applicable.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
All raw data used in this work are obtained from open-source repositories: https://www.gisaid.org for the sequencing and https://ourworldindata.org/ for the epidemiological data. The Machine Learning code is available at https://github.com/AdeledeHoffer/ML-Covid