Networks of information token recurrences derived from genomic sequences may reveal hidden patterns in epidemic outbreaks: A case study of the 2019-nCoV coronavirus

Markus Luczak-Roesch

doi:10.1101/2020.02.07.20021139

Abstract

Profiling the genetic evolution and dynamic spreading of viruses is a crucial task when responding to epidemic outbreaks. We aim to devise novel ways to model, visualise and analyse the temporal dynamics of epidemic outbreaks in order to help researchers and other people involved in crisis response to make well-informed and targeted decisions about from which geographical locations and time periods more genetic samples may be required to fully understand the outbreak. Our approach relies on the application of Transcendental Information Cascades to a set of temporally ordered nucleotide sequences, and we apply it to real-world data that was collected during the currently ongoing outbreak of the novel 2019-nCoV coronavirus. We assess information-theoretic and network-theoretic measures that characterise the resulting complex network and identify touching points and temporal pathways that are candidates for deeper investigation by geneticists and epidemiologists.

epidemic outbreaks
genome analysis
bioinformatics
complex systems
network analysis
sequential data mining

Introduction

How far a virus will eventually spread at local, regional, national and international levels is of central concern during an epidemic outbreak due to the severe consequences large-scale epidemics have on human well-being [5], the stability and coherence of social systems [11], and the global economy [23]. A thorough understanding of the genetic characteristics of a virus is crucial to understand the way it is (or may be) transmitted (e.g. human-to-human transmissibility) in order to be able to anticipate the potential reach of an outbreak and develop effective countermeasures. The recent outbreak of a new type of coronavirus – 2019-nCoV – provides plenty of examples of the way researchers seek to quickly approach the aforementioned problems of genetic decomposition of the virus [2, 8] and modelling of its transmission dynamics [15].

Here we present a novel approach to gain insights into the transmission dynamics of an epidemic outbreak. Our method is based on Transcendental Information Cascades [16,17] and combines gene sequence analysis with temporal data mining to uncover potential relationships between a virus’ genetic evolution and distinct occurrences of infections. We apply this new approach to publicly available genomic sequence data from the currently ongoing outbreak of the 2019-nCoV coronavirus.

Our results show pathways of similarity between certain clusters of 2019-nCoV virus samples taken in different geographical locations, and indicate coronavirus cases that are candidates for further investigation into the human touch points of these patients in order to derive a detailed understanding of how the virus may have spread, and how and why it has genetically evolved. Our analysis provides a unique view to the dynamics of the 2019-nCoV coronavirus outbreak that complements knowledge obtained using other state-of-the-art methods such as Bayesian evolutionary analysis [9].

Materials and methods

Research data and preprocessing

We obtained a total of 87 genomic sequences from human subjects that were available via the GISAID EpiFlu Database™ ¹ as of Monday, February 10th 2020.

These data were preprocessed using the R programming language. First, we derived a metadata object that allows to store the sequence identifier, the collection location, the collection date, and the raw nucleotide sequence. The metadata object was then filtered to keep only those sequences for which the collection date was not empty and which featured a length that fell within a margin of 1, 000 of the estimated 30, 000 nucleotides that are currently assumed to make up the genomic code of the 2019-nCoV coronavirus (i.e. we cover the range of nucleotide sequences from 29, 000 to 31, 000 in length). This left us with a total of 82 genomic sequences. We then exported a single CSV file containing only the raw nucleotide sequences in ascending order by the collection date of the sample. This structure is the standard input format for the genes-CODON-samplesequence tokeniser featured in the Transcendental Information Cascades R toolchain that is available as free and open scientific software².

Construction of Transcendental Information Cascades

Transcendental Information Cascades (TICs) are a method that first transforms any kind of sequential data into a network of recurring information tokens, to then exploit this network as a kind of analytical middle ground between views to the source data that would otherwise only be accessible individually through distinct analytical frameworks [16, 17]. The approach falls into the category of contemporary approaches that make use of network theory for the study of dynamical complex systems [4, 19, 21]. In the field of phylogenetic and phylogeographic analysis our approach can be compared to haplotype networks for example [1, 18] but with the benefit that the resulting network preserves access to temporal patterns at both macroscopic (e.g. whole generation time periods) and microscopic (e.g. short time period bursts) resolution.

At an abstract level a Transcendental Information Cascade is constructed from some sequentially ordered source data in the following way: At first the data from each distinct sequence step is processed to extract unique information tokens. These tokens form the identifier set for that particular sequence step. Then a directed network is generated with one vertex for each sequence step and edges between any two vertices that have a particular token from their identifier set in common and that does not occur in any identifier set of vertices that represent sequence steps between these two. In other words, identifier sets represent information token co-occurrences and edges represent information token recurrences.

As the information tokens of interest in this study we encode unique codon identifiers as (a) their position when sliding a window of size 3 in steps of size 3 over the nucleotide sequence (e.g. “pos1”, “pos2”, …), (b) a flag that indicates the reading frame (“+1”, “+2” or “+3”) that captures the respective triplet in the current window, and (c) the actual matched nucleotide triplet that constitutes the matched codon. The use of this tokenisation is motivated by previous work on phylogenetic profiling and gene sequencing [6, 13] that suggested that codons suit well for assessing similarities (and dissimilarities) at the gene, chromosome, or genome levels [6, 14, 22].

An example of the particular encoding we use is visualised in Figure 1. It is this tokenisation of the original nucleotide sequences that we use in the TIC approach to construct the information token recurrence network from. This means in our TIC instantiation the nodes represent distinct nucleotide sequences of the virus obtained from patient samples ordered by the date of their collection by researchers (random order in case of same date collections). The edges are unique codon identifier recurrences matched between different nucleotide sequences.

Figure 1. Example of the unique codon tokenisation used in this study.

The example sequence comprises three window steps and shows how the +1 (marked in blue), +2 (marked in orange) and +3 (marked in black) reading frames lead to seven unique codon tokens extracted from the nucleotide sequence.

Analysis of Transcendental Information Cascades

The TIC network we derive in the way described before can be analysed in a variety of ways. Information-theoretic and network-theoretic measures have been introduced as the base framework to start the analysis from [16]. In particular we perform (1) an analysis of the information token entropy and evenness over time, (2) an analysis of network clusters of sequences that can be detected in the TIC network, and (3) an analysis of the intra-cluster and inter-cluster nucleotide similarity of the original sequences.

Information token entropy and evenness

We perform an analysis of information token entropy and information token evenness in order to understand whether the virus evolution can be considered an open or a closed system. Therefore we assess both measures in an accumulated fashion for each progression step through the ordered nucleotide sequences. In a closed system the information token entropy should strive towards an equilibrium of maximum entropy and not feature any wave-like up and down patterns. If we observe wave-like patterns it means nucleotide sequences at later progression stages feature codon identifiers that have not been observed in earlier nucleotide sequences.

So to measure information token entropy we assess which unique codon identifiers were seen up to a progression step and how often these were seen. This allows to compute the probability for every codon identifier, which then can be used to determine the information token entropy up to that point following the definition of Shannon entropy given as follows:

Analog to this we can then assess the information token evenness as per Pielou’s species evenness index, which is defined as where H^’ is the measured Shannon entropy, and is the maximum entropy that would be measured if all tokens would be equally likely to occur.

Network analysis

The Transcendental Information Cascade network we constructed can be mathematically and visually analysed in order to find structures that are of particular significance for a single epidemic outbreak or to compare different outbreaks structurally. In this study we focus on the former due to the lack of comparative data about other epidemic outbreaks.

We first perform a cluster analysis using the random walk apporach by Blondel et al. [3] on the weighted network that we can construct from our original TIC network by collapsing all edges between the same vertices to unique edges weighted by the sum of the collapsed edges (see Figure 2 for a schematic example and Figure 4 for the actual networks we construct following this approach). Afterwards, we analyse visually the network using the open source software Gephi. In Gephi we scale vertices by their degree and edges by their weight. Furthermore, we colour vertices by their cluster membership. We then run the Yifan Hu [12] layout algorithm (optimal distance: 10,000; relative strength: 0.2; initial step size: 20; step ratio: 0.95; quadtree max level: 50; theta: 0.8; convergence threshold: 1*10⁻⁴; adaptive cooling: enabled) to visualise the network. We repeat the exact same process to construct and visualise three further networks (cf. Figures 5-7) that are representative of the TIC when exclusively focusing on the codon identifiers of each of the reading frames individually.

Figure 2. Schematic example of the weighted TIC network constructed from nucleotide sequences with tokenised unique codon identifiers.

Figure 3. Information-theoretic assessment.

Both entropy and evenness graphs feature wave-like patterns, indicating that the set of observed unique codon identifiers grows with increasing progression stages while the distribution of all observed unique codon identifiers remains rather even.

Figure 4. Figure 4. Weighted TIC network.

From the original TIC network we generate a weighted network by collapsing all edges between the same vertices to unique edges weighted by the sum of the collapsed edges. The colours represent the nine identified network clusters. Node size represents the degree and edge thickness represents edge weight (in the sense that higher edge weight indicates a higher amount of codon identifiers shared by two vertices). Numeric indices before the genome identifiers represent the progression stage index of the particular vertex in the sequence of all ordered nucleotide sequences we analysed.

Figure 5. Overview of inter-cluster similarities for (a) the full TIC network, (b) the +1 reading frame network, (c) the +2 reading frame network, and (d) the +3 reading frame network.

Figure 6. +1 reading frame network.

Figure 7. +2 reading frame network.

Cluster evaluation

In order to evaluate the meaningfulness of the TIC network against some baseline, we computer the intra-cluster and inter-cluster similarity of the raw nucleotide sequences. This is done by computing all pairwise comparisons of sequences within individual clusters (intra-cluster similarity) and comparison between all sequences from a cluster with all sequences that are not within that cluster (inter-cluster similarity) using the compareStrings function provided by the Biostrings R package [20]. The similarity of two sequences is then simply defined as follows:

It is important to note that not all sequences we obtained had the exact same length (mean length of nucleotide sequences: 29861.95; standard deviation: 28.40705). Hence, we performed this comparison only on the sub-sequence from the first nucleotide position to the maximum nucleotide position of the shorter of the two sequences a and b.

Results and discussion

Information-theoretic analysis

Figure 3 shows the graphs for the information token entropy and information token evenness. As one can see, these feature the aforementioned wave-like patterns, which means the system can be regarded open in the sense that over time the number of unique codon identifiers grows.

It is important to note that the evenness of the distribution of codon identifiers is relatively high, converging around a value of 0.95. This means that those codon identifiers that have been observed are quite evenly distributed.

These results allow the interpretation that there is evolving variation in the genomic sequences we studied, but (b) there is no indication for the dominance of a particular variance.

Network analysis

The weighted network derived from the original TIC network features a modularity of 0.762 and a diameter of 7, which mainly reflects its small size and its particular structure that results from preserving temporal order in the sense that edges can only ever direct forward in time (i.e. to vertices that represent a later stage). However, with an average degree of 8.671 we find that there is certainly structure that differs from just simple step-by-step forward paths that lead to the consecutively next node. This latter result suggests that some codon paths branch and merge, emphasising that there is an evolving variation in the studied genomic sequences.

From the cluster analysis in combination with a visual inspection of the resulting graph (please refer to Figure 6 as well as Table 11 in the appendix) we find the following patterns that are noteworthy:

View this table:

Table 1.

Intra-cluster similarities.

View this table:

Table 2.