PT - JOURNAL ARTICLE AU - Adele de Hoffer AU - Shahram Vatani AU - Corentin Cot AU - Giacomo Cacciapaglia AU - Maria Luisa Chiusano AU - Andrea Cimarelli AU - Francesco Conventi AU - Antonio Giannini AU - Stefan Hohenegger AU - Francesco Sannino TI - Variant-driven multi-wave pattern of COVID-19 via a Machine Learning analysis of spike protein mutations AID - 10.1101/2021.07.22.21260952 DP - 2021 Jan 01 TA - medRxiv PG - 2021.07.22.21260952 4099 - http://medrxiv.org/content/early/2021/10/22/2021.07.22.21260952.short 4100 - http://medrxiv.org/content/early/2021/10/22/2021.07.22.21260952.full AB - Never before such a vast amount of data, including genome sequencing, has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to trace the virus evolution and to assess the role mutations play in its spread within the population, in real time. To this end, we focused on the Spike protein for its central role in mediating viral outbreak and replication in host cells. Employing the Levenshtein distance on the Spike protein sequences, we designed a machine learning algorithm yielding a temporal clustering of the available dataset. From this, we were able to identify and define emerging persistent variants that are in agreement with known evidences. Our novel algorithm allowed us to define persistent variants as chains that remain stable over time and to highlight emerging variants of epidemiological interest as branching events that occur over time. Hence, we determined the relationship and temporal connection between variants of interest and the ensuing passage to dominance of the current variants of concern. Remarkably, the analysis and the relevant tools introduced in our work serve as an early warning for the emergence of new persistent variants once the associated cluster reaches 1% of the time-binned sequence data. We validated our approach and its effectiveness on the onset of the Alpha variant of concern. We further predict that the recently identified lineage AY.4.2 (‘Delta plus’) is causing a new emerging variant. Comparing our findings with the epidemiological data we demonstrated that each new wave is dominated by a new emerging variant, thus confirming the hypothesis of the existence of a strong correlation between the birth of variants and the pandemic multi-wave temporal pattern. The above allows us to introduce the epidemiology of variants that we described via the Mutation epidemiological Renormalisation Group (MeRG) framework.HighlightsObjectives To study the relation among Spike protein mutations, the emergence of relevant variants and the multi-wave pattern of the COVID-19 pandemic.Setting Genomic sequencing of the SARS-CoV-2 Spike proteins in the UK nations (England, Scotland, Wales). Epi-demiological data for the number of infections in the UK nations, South Africa, California and India.Methodology We design a machine learning algorithm, based on the Levenshtein distance on the Spike protein sequences, that leads to a temporal clustering of the available dataset, from which we define emerging persistent variants. The above allows us to introduce the epidemiology of variants that we described via the Mutation epidemiological Renormalisation Group (MeRG) framework.Results We show that:Our approach, based only on the Spike protein sequence, allows to efficiently identify the variants of concern (VoCs) and of interest (VoIs), as well as other emerging variants occurring during the diffusion of the virus.Within our time-ordered chain analysis, a branching relation emerges, thus permitting to reconstruct the evolutionary diversification of Spike variants and the establishment of the epidemiologically relevant ones.Our analysis provides an early warning for the emergence of new persistent variants once its associated dominant Spike sequence reaches 1% of the time-binned sequence data. Validation on the onset of the Alpha VoC shows that our early warning is triggered 6 weeks before the WHO classification decision.Comparison with the epidemiological data demonstrates that each new wave is dominated by a new emerging variant, thus confirming the hypothesis that there is a strong correlation between the emergence of variants and the multi-wave temporal pattern depicting the viral spread.A theory of variant epidemiology is established, which describes the temporal evolution of the number of infected by different emerging variants via the MeRG approach. This is corroborated by empirical data.Conclusions Applying a ML approach to the temporal variability of the Spike protein sequence enables us to identify, classify and track emerging virus variants. Our analysis is unbiased, in the sense that it does not require any prior knowledge of the variant characteristics, and our results are validated by other informed methods that define variants based on the complete genome. Furthermore, correlating persistent variants of our approach to epidemiological data, we discover that each new wave of the COVID-19 pandemic is driven and dominated by a new emerging variant. Our results are therefore indispensable for further studies on the evolution of SARS-CoV-2 and the prediction of evolutionary patterns that determine current and future mutations of the Spike proteins, as well as their diversification and persistence during the viral spread. Moreover, our ML algorithm works as an efficient early warning system for the emergence of new persistent variants that may pose a threat of triggering a new wave of COVID-19. Capable of a timely identification of potential new epidemiological threats when the variant only represents 1% of the new sequences, our ML strategy is a crucial tool for decision makers to define short and long term strategies to curb future outbreaks. The same methodology can be applied to other viral diseases, influenza included, if sufficient sequencing data is available.Competing Interest StatementThe authors have declared no competing interest.Funding StatementNoneAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Not applicable.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll raw data used in this work are obtained from open-source repositories: https://www.gisaid.org for the sequencing and https://ourworldindata.org/ for the epidemiological data. The Machine Learning code is available at https://github.com/AdeledeHoffer/ML-Covid https://github.com/AdeledeHoffer/ML-Covid