Abstract
Profiling the genetic evolution and dynamic spreading of viruses is a crucial task when responding to epidemic outbreaks. We aim to devise novel ways to model, visualise and analyse the temporal dynamics of epidemic outbreaks in order to help researchers and other people involved in crisis response to make well-informed and targeted decisions about from which geographical locations and time periods more genetic samples may be required to fully understand the outbreak. Our approach relies on the application of Transcendental Information Cascades to a set of temporally ordered nucleotide sequences, and we apply it to real-world data that was collected during the currently ongoing outbreak of the novel 2019-nCoV coronavirus. We assess information-theoretic and network-theoretic measures that characterise the resulting complex network and identify touching points and temporal pathways that are candidates for deeper investigation by geneticists and epidemiologists.
- epidemic outbreaks
- genome analysis
- bioinformatics
- complex systems
- network analysis
- sequential data mining
Introduction
How far a virus will eventually spread at local, regional, national and international levels is of central concern during an epidemic outbreak due to the severe consequences large-scale epidemics have on human well-being [5], the stability and coherence of social systems [11], and the global economy [23]. A thorough understanding of the genetic characteristics of a virus is crucial to understand the way it is (or may be) transmitted (e.g. human-to-human transmissibility) in order to be able to anticipate the potential reach of an outbreak and develop effective countermeasures. The recent outbreak of a new type of coronavirus – 2019-nCoV – provides plenty of examples of the way researchers seek to quickly approach the aforementioned problems of genetic decomposition of the virus [2, 8] and modelling of its transmission dynamics [15].
Here we present a novel approach to gain insights into the transmission dynamics of an epidemic outbreak. Our method is based on Transcendental Information Cascades [16,17] and combines gene sequence analysis with temporal data mining to uncover potential relationships between a virus’ genetic evolution and distinct occurrences of infections. We apply this new approach to publicly available genomic sequence data from the currently ongoing outbreak of the 2019-nCoV coronavirus.
Our results show pathways of similarity between certain clusters of 2019-nCoV virus samples taken in different geographical locations, and indicate coronavirus cases that are candidates for further investigation into the human touch points of these patients in order to derive a detailed understanding of how the virus may have spread, and how and why it has genetically evolved. Our analysis provides a unique view to the dynamics of the 2019-nCoV coronavirus outbreak that complements knowledge obtained using other state-of-the-art methods such as Bayesian evolutionary analysis [9].
Materials and methods
Research data and preprocessing
We obtained a total of 87 genomic sequences from human subjects that were available via the GISAID EpiFlu Database™ 1 as of Monday, February 10th 2020.
These data were preprocessed using the R programming language. First, we derived a metadata object that allows to store the sequence identifier, the collection location, the collection date, and the raw nucleotide sequence. The metadata object was then filtered to keep only those sequences for which the collection date was not empty and which featured a length that fell within a margin of 1, 000 of the estimated 30, 000 nucleotides that are currently assumed to make up the genomic code of the 2019-nCoV coronavirus (i.e. we cover the range of nucleotide sequences from 29, 000 to 31, 000 in length). This left us with a total of 82 genomic sequences. We then exported a single CSV file containing only the raw nucleotide sequences in ascending order by the collection date of the sample. This structure is the standard input format for the genes-CODON-samplesequence tokeniser featured in the Transcendental Information Cascades R toolchain that is available as free and open scientific software2.
Construction of Transcendental Information Cascades
Transcendental Information Cascades (TICs) are a method that first transforms any kind of sequential data into a network of recurring information tokens, to then exploit this network as a kind of analytical middle ground between views to the source data that would otherwise only be accessible individually through distinct analytical frameworks [16, 17]. The approach falls into the category of contemporary approaches that make use of network theory for the study of dynamical complex systems [4, 19, 21]. In the field of phylogenetic and phylogeographic analysis our approach can be compared to haplotype networks for example [1, 18] but with the benefit that the resulting network preserves access to temporal patterns at both macroscopic (e.g. whole generation time periods) and microscopic (e.g. short time period bursts) resolution.
At an abstract level a Transcendental Information Cascade is constructed from some sequentially ordered source data in the following way: At first the data from each distinct sequence step is processed to extract unique information tokens. These tokens form the identifier set for that particular sequence step. Then a directed network is generated with one vertex for each sequence step and edges between any two vertices that have a particular token from their identifier set in common and that does not occur in any identifier set of vertices that represent sequence steps between these two. In other words, identifier sets represent information token co-occurrences and edges represent information token recurrences.
As the information tokens of interest in this study we encode unique codon identifiers as (a) their position when sliding a window of size 3 in steps of size 3 over the nucleotide sequence (e.g. “pos1”, “pos2”, …), (b) a flag that indicates the reading frame (“+1”, “+2” or “+3”) that captures the respective triplet in the current window, and (c) the actual matched nucleotide triplet that constitutes the matched codon. The use of this tokenisation is motivated by previous work on phylogenetic profiling and gene sequencing [6, 13] that suggested that codons suit well for assessing similarities (and dissimilarities) at the gene, chromosome, or genome levels [6, 14, 22].
An example of the particular encoding we use is visualised in Figure 1. It is this tokenisation of the original nucleotide sequences that we use in the TIC approach to construct the information token recurrence network from. This means in our TIC instantiation the nodes represent distinct nucleotide sequences of the virus obtained from patient samples ordered by the date of their collection by researchers (random order in case of same date collections). The edges are unique codon identifier recurrences matched between different nucleotide sequences.
Analysis of Transcendental Information Cascades
The TIC network we derive in the way described before can be analysed in a variety of ways. Information-theoretic and network-theoretic measures have been introduced as the base framework to start the analysis from [16]. In particular we perform (1) an analysis of the information token entropy and evenness over time, (2) an analysis of network clusters of sequences that can be detected in the TIC network, and (3) an analysis of the intra-cluster and inter-cluster nucleotide similarity of the original sequences.
Information token entropy and evenness
We perform an analysis of information token entropy and information token evenness in order to understand whether the virus evolution can be considered an open or a closed system. Therefore we assess both measures in an accumulated fashion for each progression step through the ordered nucleotide sequences. In a closed system the information token entropy should strive towards an equilibrium of maximum entropy and not feature any wave-like up and down patterns. If we observe wave-like patterns it means nucleotide sequences at later progression stages feature codon identifiers that have not been observed in earlier nucleotide sequences.
So to measure information token entropy we assess which unique codon identifiers were seen up to a progression step and how often these were seen. This allows to compute the probability for every codon identifier, which then can be used to determine the information token entropy up to that point following the definition of Shannon entropy given as follows:
Analog to this we can then assess the information token evenness as per Pielou’s species evenness index, which is defined as where H’ is the measured Shannon entropy, and is the maximum entropy that would be measured if all tokens would be equally likely to occur.
Network analysis
The Transcendental Information Cascade network we constructed can be mathematically and visually analysed in order to find structures that are of particular significance for a single epidemic outbreak or to compare different outbreaks structurally. In this study we focus on the former due to the lack of comparative data about other epidemic outbreaks.
We first perform a cluster analysis using the random walk apporach by Blondel et al. [3] on the weighted network that we can construct from our original TIC network by collapsing all edges between the same vertices to unique edges weighted by the sum of the collapsed edges (see Figure 2 for a schematic example and Figure 4 for the actual networks we construct following this approach). Afterwards, we analyse visually the network using the open source software Gephi. In Gephi we scale vertices by their degree and edges by their weight. Furthermore, we colour vertices by their cluster membership. We then run the Yifan Hu [12] layout algorithm (optimal distance: 10,000; relative strength: 0.2; initial step size: 20; step ratio: 0.95; quadtree max level: 50; theta: 0.8; convergence threshold: 1*10−4; adaptive cooling: enabled) to visualise the network. We repeat the exact same process to construct and visualise three further networks (cf. Figures 5-7) that are representative of the TIC when exclusively focusing on the codon identifiers of each of the reading frames individually.
Cluster evaluation
In order to evaluate the meaningfulness of the TIC network against some baseline, we computer the intra-cluster and inter-cluster similarity of the raw nucleotide sequences. This is done by computing all pairwise comparisons of sequences within individual clusters (intra-cluster similarity) and comparison between all sequences from a cluster with all sequences that are not within that cluster (inter-cluster similarity) using the compareStrings function provided by the Biostrings R package [20]. The similarity of two sequences is then simply defined as follows:
It is important to note that not all sequences we obtained had the exact same length (mean length of nucleotide sequences: 29861.95; standard deviation: 28.40705). Hence, we performed this comparison only on the sub-sequence from the first nucleotide position to the maximum nucleotide position of the shorter of the two sequences a and b.
Results and discussion
Information-theoretic analysis
Figure 3 shows the graphs for the information token entropy and information token evenness. As one can see, these feature the aforementioned wave-like patterns, which means the system can be regarded open in the sense that over time the number of unique codon identifiers grows.
It is important to note that the evenness of the distribution of codon identifiers is relatively high, converging around a value of 0.95. This means that those codon identifiers that have been observed are quite evenly distributed.
These results allow the interpretation that there is evolving variation in the genomic sequences we studied, but (b) there is no indication for the dominance of a particular variance.
Network analysis
The weighted network derived from the original TIC network features a modularity of 0.762 and a diameter of 7, which mainly reflects its small size and its particular structure that results from preserving temporal order in the sense that edges can only ever direct forward in time (i.e. to vertices that represent a later stage). However, with an average degree of 8.671 we find that there is certainly structure that differs from just simple step-by-step forward paths that lead to the consecutively next node. This latter result suggests that some codon paths branch and merge, emphasising that there is an evolving variation in the studied genomic sequences.
From the cluster analysis in combination with a visual inspection of the resulting graph (please refer to Figure 6 as well as Table 11 in the appendix) we find the following patterns that are noteworthy:
At the center of the network one cluster emerges (dark red, cluster 0) that has only few stronger connections within the cluster but features connections to all other clusters and within itself through lower weighted edges. This cluster has no temporal focus, it comprises samples that were collected early as well as later during the outbreak. While the genome sequences that represent the beginning of the progression stage path in this cluster are from China, it then transitions geographically to Australia, Japan, Korea, Germany, Taiwan and Singapore.
The light red cluster (cluster 1) comprises a coherent sequence of samples taken from patients in the original outbreak region (Wuhan) and is temporally focused on the early progression stages.
The orange cluster starting from the bottom center of the network (cluster 2) is again first a sequence of samples from the original outbreak region in Wuhan, transitioning to Thailand (Nonthaburi), then back to China (Guangdong), before finishing in France.
Cluster 3 (yellow) features samples taken in various places in China but also two later connections to the USA.
Cluster 4 (light green) is one of the smallest clusters but is spread fairly wide temporally. It features 4 samples taken in China and then transitions geographically to Singapore and France.
In dark green (cluster 5) we can identify a cluster of genome data that was collected in Chonggqing, the USA and Singapore.
The light blue cluster (cluster 6) is a temporally quite coherent sequence that links samples taken in Guangdong, Guangzhou and Foshan.
A kind of international cluster (samples from Australia, the USA, England, Finland and Belgium) is represented by the dark blue cluster (cluster 7).
We suggest that while there are some obvious pathways of high similarity due to temporal order of the samples taken, there are still good indications for geographical-dependencies and transition phases that deserve further investigation. Transition points into and out of cluster 0 are particularly interesting in this regard, especially when the clusters linking in our out feature samples taken outside of China. We suggest that it may be worth to qualitatively assess the human-to-human connections the respective subjects had from which these samples were taken, and the places they have visited if there are no known links between the samples so far.
Triangulating the full TIC network with the networks that focus on individual reading frame similarity only (please refer to Figures 6, 7 and 8 as well as Tables 13, 15 and 17 in the appendix) shows that the macroscopic structure is mostly consistent but much sparser in terms of low-weighted edges.
Inter- and intra-cluster similarity
From our measurement of the intra-cluster similarity (see Table 1) we find that the genomic similarity at the level of nucleotides is low in the central cluster 0, which is expected because this is the largest of all clusters with low temporal coherence. However, cluster 5 also features a slightly lower average similarity with comparably high standard deviation, which indicates that this cluster may be slightly less coherent.
The barchart of the inter-cluster analysis results shown in Figure 5 reveals that there are some cluster pairs with high inter-cluster similarity, that the inter-cluster similarity is different between the +3 reading frame network and the other two reading frame networks, and that the pattern of the +3 reading frame network inter-cluster similarity is similar to the one for the entire TIC network. In particular we find that in the +1 and +2 reading frame networks the clusters 1, 4, 6 and 7 are quite similar at the nucleotide level compared to all other clusters that are pairwise distinct. For the entire TIC network and the +3 reading frame network instead we find that clusters 1, 3, 5 and 7 show this kind of similarity. Altogether, these results indicate that (a) in some instances the clustering based on structural characteristics of the TIC network overrides actual genomic similarity, and (b) the choice of the reading frames during TIC construction has an impact on the resulting network that needs to be further investigated.
Conclusion
In this paper we presented a novel way to model, visualise and analyse the temporal dynamics of an epidemic outbreak. We applied Transcendental Information Cascades to capture the recurrence and co-occurrence of unique codon identifiers over distinct and temporally ordered nucleotide sequences derived from genetic samples taken from different human subjects. Our case study was focused on the recent outbreak of the novel 2019-nCoV coronavirus. We find clusters and evolution pathways of virus genomes in the constructed Transcendental Information Cascade that are candidates to be investigated further for potential human-to-human transmission paths and genetic evolution of the 2019-nCoV coronavirus.
Our study was performed as a snapshot case study on a limited sample of virus data. While this was at the time of this study all data that was available, we will need to re-execute this analysis continuously as more data becomes available to check whether any of our results change. Furthermore, our results indicate that experiments with alternative tokenisations are needed, such as weighting the +1, +2 and +3 reading frames differently (or omitting certain frames entirely) depending on their functional role [10]. We also suggest that it will be important to perform a wider variety of analysis of the generated TIC data (e.g. testing of different network clustering methods, application of dynamical systems analyses techniques) to get more profound insights. Our analysis also does not take measurement biases into account that may results from different techniques used by different laboratories for genome extraction and analysis [7].
In summary, the approach presented in this paper is exploratory and hypotheses generating in nature. It does not confirm or reject any of the hypotheses we generated about the potential transmission and evolution pathways nor does it provide causal explanation for the patterns found. However, the purpose of our work is to provide an alternative view to the epidemic dynamics of the 2019-nCoV outbreak (and potential other outbreaks in the future) that is hidden to state-of-the-art methods, and that may trigger geneticists and epidemiologists to look at pathways in epidemic outbreaks they would not normally be aware of.
Data Availability
This study uses secondary data that was obtained from the GISAID EpiFlu™ Database. The software used in this study is available as open source software to allow for reproduction of our analysis.
Acknowledgements
We want to thank the laboratories and researchers who contributed the genomic research data used in this study and obtained via the GISAID EpiFlu Database™ 3 on Monday, February 10th 2020. The following tables contain information about the origination of the research data used in this study and give credit to the collecting and submitting researchers and laboratories.
Appendix
The software developed for this particular analysis is part of the open scientific software “R toolchain to construct and analyse Transcendental Information Cascades” that is available via https://github.com/vuw-c2lab/transcendental-information-cascades/tree/master/input/2020-02-03-2019nCoV. For reproduction purposes it is necessary to obtain the raw genomic data used in this study from the GISAID EpiFlu Database™. For further reference we provide the raw results of some of our analysis. We particularly point to the tables that provide references for relationships between nucleotide sequences and the cluster membership of nucleotide sequences.
Footnotes
↵* markus.luczak-roesch{at}vuw.ac.nz
↵2 https://github.com/vuw-c2lab/transcendental-information-cascades