Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Variant-driven multi-wave pattern of COVID-19 via a Machine Learning analysis of spike protein mutations

Adele de Hoffer, View ORCID ProfileShahram Vatani, View ORCID ProfileCorentin Cot, View ORCID ProfileGiacomo Cacciapaglia, Maria Luisa Chiusano, Andrea Cimarelli, View ORCID ProfileFrancesco Conventi, Antonio Giannini, View ORCID ProfileStefan Hohenegger, View ORCID ProfileFrancesco Sannino
doi: https://doi.org/10.1101/2021.07.22.21260952
Adele de Hoffer
1Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shahram Vatani
2Institut de Physique des 2 Infinis (IP2I), CNRS/IN2P3, UMR5822, 69622 Villeurbanne, France
3Université de Lyon, Université Claude Bernard Lyon 1, 69001 Lyon, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shahram Vatani
Corentin Cot
2Institut de Physique des 2 Infinis (IP2I), CNRS/IN2P3, UMR5822, 69622 Villeurbanne, France
3Université de Lyon, Université Claude Bernard Lyon 1, 69001 Lyon, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Corentin Cot
Giacomo Cacciapaglia
2Institut de Physique des 2 Infinis (IP2I), CNRS/IN2P3, UMR5822, 69622 Villeurbanne, France
3Université de Lyon, Université Claude Bernard Lyon 1, 69001 Lyon, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Giacomo Cacciapaglia
  • For correspondence: g.cacciapaglia@ipnl.in2p3.fr
Maria Luisa Chiusano
4Department of Agricultural Sciences, Università degli Studi di Napoli Federico II, Portici, 80055 Italy
5Department of Research Infrastructures for Marine Biological Resources (RIMAR), Stazione Zoologica “Anton Dohrn”, 80121 Napoli, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Andrea Cimarelli
6Centre International de Recherche en Infectiologie (CIRI), Univ Lyon, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, ENS de Lyon, 46 Allée d’Italie, 69007 Lyon, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Francesco Conventi
7INFN sezione di Napoli, Complesso Universitario di Monte S. Angelo Edificio 6, via Cintia, 80126 Napoli, Italy
8Dipartimento di Ingegneria Università degli studi di Napoli Parthenope, Centro Direzionale di Napoli, Isola C 4, lato Sud, 80143 Napoli, Italy
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Francesco Conventi
Antonio Giannini
9University of Science and Technology of China (USTC), No.96, JinZhai Road, Baohe District, Hefei, Anhui, 230026, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stefan Hohenegger
2Institut de Physique des 2 Infinis (IP2I), CNRS/IN2P3, UMR5822, 69622 Villeurbanne, France
3Université de Lyon, Université Claude Bernard Lyon 1, 69001 Lyon, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stefan Hohenegger
Francesco Sannino
7INFN sezione di Napoli, Complesso Universitario di Monte S. Angelo Edificio 6, via Cintia, 80126 Napoli, Italy
10Dipartimento di Fisica E. Pancini, Università di Napoli Federico II, Complesso Universitario di Monte S. Angelo Edificio 6, via Cintia, 80126 Napoli, Italy
11Scuola Superiore Meridionale, Largo S. Marcellino 10, 80138 Napoli, Italy
12CP3-Origins & the Danish Institute for Advanced Study, University of Southern Denmark, Campusvej 55, DK-5230 Odense, Denmark
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Francesco Sannino
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Never before such a vast amount of data, including genome sequencing, has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to trace the virus evolution and to assess the role mutations play in its spread within the population, in real time. To this end, we focused on the Spike protein for its central role in mediating viral outbreak and replication in host cells. Employing the Levenshtein distance on the Spike protein sequences, we designed a machine learning algorithm yielding a temporal clustering of the available dataset. From this, we were able to identify and define emerging persistent variants that are in agreement with known evidences. Our novel algorithm allowed us to define persistent variants as chains that remain stable over time and to highlight emerging variants of epidemiological interest as branching events that occur over time. Hence, we determined the relationship and temporal connection between variants of interest and the ensuing passage to dominance of the current variants of concern. Remarkably, the analysis and the relevant tools introduced in our work serve as an early warning for the emergence of new persistent variants once the associated cluster reaches 1% of the time-binned sequence data. We validated our approach and its effectiveness on the onset of the Alpha variant of concern. We further predict that the recently identified lineage AY.4.2 (‘Delta plus’) is causing a new emerging variant. Comparing our findings with the epidemiological data we demonstrated that each new wave is dominated by a new emerging variant, thus confirming the hypothesis of the existence of a strong correlation between the birth of variants and the pandemic multi-wave temporal pattern. The above allows us to introduce the epidemiology of variants that we described via the Mutation epidemiological Renormalisation Group (MeRG) framework.

Highlights

  • Objectives To study the relation among Spike protein mutations, the emergence of relevant variants and the multi-wave pattern of the COVID-19 pandemic.

  • Setting Genomic sequencing of the SARS-CoV-2 Spike proteins in the UK nations (England, Scotland, Wales). Epi-demiological data for the number of infections in the UK nations, South Africa, California and India.

  • Methodology We design a machine learning algorithm, based on the Levenshtein distance on the Spike protein sequences, that leads to a temporal clustering of the available dataset, from which we define emerging persistent variants. The above allows us to introduce the epidemiology of variants that we described via the Mutation epidemiological Renormalisation Group (MeRG) framework.

  • Results We show that:

    1. Our approach, based only on the Spike protein sequence, allows to efficiently identify the variants of concern (VoCs) and of interest (VoIs), as well as other emerging variants occurring during the diffusion of the virus.

    2. Within our time-ordered chain analysis, a branching relation emerges, thus permitting to reconstruct the evolutionary diversification of Spike variants and the establishment of the epidemiologically relevant ones.

    3. Our analysis provides an early warning for the emergence of new persistent variants once its associated dominant Spike sequence reaches 1% of the time-binned sequence data. Validation on the onset of the Alpha VoC shows that our early warning is triggered 6 weeks before the WHO classification decision.

    4. Comparison with the epidemiological data demonstrates that each new wave is dominated by a new emerging variant, thus confirming the hypothesis that there is a strong correlation between the emergence of variants and the multi-wave temporal pattern depicting the viral spread.

    5. A theory of variant epidemiology is established, which describes the temporal evolution of the number of infected by different emerging variants via the MeRG approach. This is corroborated by empirical data.

  • Conclusions Applying a ML approach to the temporal variability of the Spike protein sequence enables us to identify, classify and track emerging virus variants. Our analysis is unbiased, in the sense that it does not require any prior knowledge of the variant characteristics, and our results are validated by other informed methods that define variants based on the complete genome. Furthermore, correlating persistent variants of our approach to epidemiological data, we discover that each new wave of the COVID-19 pandemic is driven and dominated by a new emerging variant. Our results are therefore indispensable for further studies on the evolution of SARS-CoV-2 and the prediction of evolutionary patterns that determine current and future mutations of the Spike proteins, as well as their diversification and persistence during the viral spread. Moreover, our ML algorithm works as an efficient early warning system for the emergence of new persistent variants that may pose a threat of triggering a new wave of COVID-19. Capable of a timely identification of potential new epidemiological threats when the variant only represents 1% of the new sequences, our ML strategy is a crucial tool for decision makers to define short and long term strategies to curb future outbreaks. The same methodology can be applied to other viral diseases, influenza included, if sufficient sequencing data is available.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

None

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Not applicable.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

  • ↵* sannino{at}cp3.sdu.dk

  • Extended analysis with early warning performance results and study of the spike protein diversification.

Data Availability

All raw data used in this work are obtained from open-source repositories: https://www.gisaid.org for the sequencing and https://ourworldindata.org/ for the epidemiological data. The Machine Learning code is available at https://github.com/AdeledeHoffer/ML-Covid

https://github.com/AdeledeHoffer/ML-Covid

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted October 22, 2021.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Variant-driven multi-wave pattern of COVID-19 via a Machine Learning analysis of spike protein mutations
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Variant-driven multi-wave pattern of COVID-19 via a Machine Learning analysis of spike protein mutations
Adele de Hoffer, Shahram Vatani, Corentin Cot, Giacomo Cacciapaglia, Maria Luisa Chiusano, Andrea Cimarelli, Francesco Conventi, Antonio Giannini, Stefan Hohenegger, Francesco Sannino
medRxiv 2021.07.22.21260952; doi: https://doi.org/10.1101/2021.07.22.21260952
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Variant-driven multi-wave pattern of COVID-19 via a Machine Learning analysis of spike protein mutations
Adele de Hoffer, Shahram Vatani, Corentin Cot, Giacomo Cacciapaglia, Maria Luisa Chiusano, Andrea Cimarelli, Francesco Conventi, Antonio Giannini, Stefan Hohenegger, Francesco Sannino
medRxiv 2021.07.22.21260952; doi: https://doi.org/10.1101/2021.07.22.21260952

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Epidemiology
Subject Areas
All Articles
  • Addiction Medicine (227)
  • Allergy and Immunology (500)
  • Anesthesia (110)
  • Cardiovascular Medicine (1230)
  • Dentistry and Oral Medicine (205)
  • Dermatology (147)
  • Emergency Medicine (282)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (529)
  • Epidemiology (10005)
  • Forensic Medicine (5)
  • Gastroenterology (497)
  • Genetic and Genomic Medicine (2445)
  • Geriatric Medicine (236)
  • Health Economics (479)
  • Health Informatics (1635)
  • Health Policy (751)
  • Health Systems and Quality Improvement (633)
  • Hematology (248)
  • HIV/AIDS (531)
  • Infectious Diseases (except HIV/AIDS) (11855)
  • Intensive Care and Critical Care Medicine (625)
  • Medical Education (251)
  • Medical Ethics (74)
  • Nephrology (267)
  • Neurology (2273)
  • Nursing (139)
  • Nutrition (349)
  • Obstetrics and Gynecology (452)
  • Occupational and Environmental Health (532)
  • Oncology (1244)
  • Ophthalmology (375)
  • Orthopedics (133)
  • Otolaryngology (226)
  • Pain Medicine (154)
  • Palliative Medicine (50)
  • Pathology (324)
  • Pediatrics (729)
  • Pharmacology and Therapeutics (311)
  • Primary Care Research (282)
  • Psychiatry and Clinical Psychology (2280)
  • Public and Global Health (4824)
  • Radiology and Imaging (833)
  • Rehabilitation Medicine and Physical Therapy (488)
  • Respiratory Medicine (650)
  • Rheumatology (283)
  • Sexual and Reproductive Health (237)
  • Sports Medicine (225)
  • Surgery (266)
  • Toxicology (44)
  • Transplantation (124)
  • Urology (99)