Elsevier

Information Systems

Volume 53, October–November 2015, Pages 16-38
Information Systems

Time-series clustering – A decade review

https://doi.org/10.1016/j.is.2015.04.007Get rights and content

Highlights

  • Anatomy of time-series clustering is revealed by introducing its 4 main component.

  • Research works in each of the four main components are reviewed in detail and compared.

  • Analysis of research works published in the last decade.

  • Enlighten new paths for future works for time-series clustering and its components.

Abstract

Clustering is a solution for classifying enormous data when there is not any early knowledge about classes. With emerging new concepts like cloud computing and big data and their vast applications in recent years, research works have been increased on unsupervised solutions like clustering algorithms to extract knowledge from this avalanche of data. Clustering time-series data has been used in diverse scientific areas to discover patterns which empower data analysts to extract valuable information from complex and massive datasets. In case of huge datasets, using supervised classification solutions is almost impossible, while clustering can solve this problem using un-supervised approaches. In this research work, the focus is on time-series data, which is one of the popular data types in clustering problems and is broadly used from gene expression data in biology to stock market analysis in finance. This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time-series approaches during the last decade and enlighten new paths for future works.

Introduction

Clustering is a data mining technique where similar data are placed into related or homogeneous groups without advanced knowledge of the groups’ definitions [1]. In detail, clusters are formed by grouping objects that have maximum similarity with other objects within the group, and minimum similarity with objects in other groups. It is a useful approach for exploratory data analysis as it identifies structure(s) in an unlabelled dataset by objectively organizing data into similar groups. Moreover, clustering is used for exploratory data analysis for summary generation and as a pre-processing step for other data mining tasks or as a part of a complex system.

With increasing power of data storages and processors, real-world applications have found the chance to store and keep data for a long time. Hence, data in many applications is being stored in the form of time-series data, for example sales data, stock prices, exchange rates in finance, weather data, biomedical measurements (e.g., blood pressure and electrocardiogram measurements), biometrics data (image data for facial recognition), particle tracking in physics, etc. Accordingly, different works are found in variety of domains such as Bioinformatics and Biology, Genetics, Multimedia [2], [3], [4] and Finance. This amount of time-series data has provided the opportunity of analysing time-series for many researchers in data mining communities in the last decade. Consequently, many researches and projects relevant to analysing time-series have been performed in various areas for different purposes such as: subsequence matching, anomaly detection, motif discovery [5], indexing, clustering, classification [6], visualization [7], segmentation [8], identifying patterns, trend analysis, summarization [9], and forecasting. Moreover, there are many on-going research projects aimed to improve the existing techniques [10], [11].

In the recent decade, there has been a considerable amount of changes and developments in time-series clustering area that are caused by emerging concepts such as big data and cloud computing which increased size of datasets exponentially. For example, one hour of ECG (electrocardiogram) data occupies 1 gigabyte, a typical weblog requires 5 gigabytes per week, the space shuttle database has 200 gigabytes and updating it requires 2 gigabytes per day [12]. Consequently, clustering craved for improvements in recent years to cope with this incremental avalanche of data to keep its reputation as a helpful data-mining tool for extracting useful patterns and knowledge from big datasets. This review is opportune, because despite the considerable changes in the area, there is not a comprehensive review on anatomy and structure of time-series clustering. There are some surveys and reviews that focus on comparative aspects of time-series clustering experiments [6], [13], [14], [15], [16], [17] but none of them tend to be as comprehensive as we are in this review. This research work is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time-series approaches during the last decade and enlighten new paths for future works.

A special type of clustering is time-series clustering. A sequence composed of a series of nominal symbols from a particular alphabet is usually called a temporal sequence, and a sequence of continuous, real-valued elements, is known as a time-series [15]. A time-series is essentially classified as dynamic data because its feature values change as a function of time, which means that the value(s) of each point of a time-series is/are one or more observations that are made chronologically. Time-series data is a type of temporal data which is naturally high dimensional and large in data size [6], [17], [18]. Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, finance, economics, healthcare, to government [16]. While each time-series is consisting of a large number of data points it can also be seen as a single object [19]. Clustering such complex objects is particularly advantageous because it leads to discovery of interesting patterns in time-series datasets. As these patterns can be either frequent or rare patterns, several research challenges have arisen such as: developing methods to recognize dynamic changes in time-series, anomaly and intrusion detection, process control, and character recognition [20], [21], [22]. More applications of time-series data are discussed in Section 1.2. To highlight the importance and the need for clustering time-series datasets, potentially overlapping objectives for clustering of time-series data are given as follows:

  • 1.

    Time-series databases contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets.

  • 2.

    Time-series databases are very large and cannot be handled well by human inspectors. Hence, many users prefer to deal with structured datasets rather than very large datasets. As a result, time-series data are represented as a set of groups of similar time-series by aggregation of data in non-overlapping clusters or by a taxonomy as a hierarchy of abstract concepts.

  • 3.

    Time-series clustering is the most-used approach as an exploratory technique, and also as a subroutine in more complex data mining algorithms, such as rule discovery, indexing, classification, and anomaly detection [22].

  • 4.

    Representing time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets.

The problem of clustering of time-series data is formally defined as follows:

Definition 1:

Time-series clustering, given a dataset of n time-series data D={F1,F2,..,Fn}, the process of unsupervised partitioning of D intoC={C1,C2,..,Ck}, in such a way that homogenous time-series are grouped together based on a certain similarity measure, is called time-series clustering. Then, Ci is called a cluster, where D=i=1kCi and CiCj= for ij.

Time-series clustering is a challenging issue because first of all, time-series data are often far larger than memory size and consequently they are stored on disks. This leads to an exponential decrease in speed of the clustering process. Second challenge is that time-series data are often high dimensional [23], [24] which makes handling these data difficult for many clustering algorithms [25] and also slows down the process of clustering [26]. Finally, the third challenge addresses the similarity measures that are used to make the clusters. To do so, similar time-series should be found which needs time-series similarity matching that is the process of calculating the similarity among the whole time-series using a similarity measure. This process is also known as “whole sequence matching” where whole lengths of time-series are considered during distance calculation. However, the process is complicated, because time-series data are naturally noisy and include outliers and shifts [18], at the other hand the length of time-series varies and the distance among them needs to be calculated. These common issues have made the similarity measure a major challenge for data miners.

Clustering of time-series data is mostly utilized for discovery of interesting patterns in time-series datasets [27], [28]. This task itself, fall into two categories: The first group is the one which is used to find patterns that frequently appears in the dataset [29], [30]. The second group are methods to discover patterns which happened in datasets surprisingly [31], [32], [33], [34]. Briefly, finding the clusters of time-series can be advantageous in different domains to answer following real world problems:

Anomaly, novelty or discord detection: Anomaly detection are methods to discover unusual and unexpected patterns which happen in datasets surprisingly [31], [32], [33], [34]. For example, in sensor databases, clustering of time-series which are produced by sensor readings of a mobile robot in order to discover the events [35].

  • 1-

    Recognizing dynamic changes in time-series: detection of correlation between time-series [36]. For example, in financial databases, it can be used to find the companies with similar stock price move.

  • 2-

    Prediction and recommendation: a hybrid technique combining clustering and function approximation per cluster can help user to predict and recommend [37], [38], [39], [40]. For example, in scientific databases, it can address problems such as finding the patterns of solar magnetic wind to predict today’s pattern.

  • 3-

    Pattern discovery: to discover the interesting patterns in databases. For example, in marketing database, different daily patterns of sales of a specific product in a store can be discovered.

Table 1 depicts some applications of time-series data in different domains.

Reviewing the literature, one can conclude that most of clustering time-series related works are classified into three categories: “whole time-series clustering”, “subsequence clustering” and “time point clustering” as depicted in Fig. 1. The first two categories are mentioned by Keogh and Lin [242] On behalf of Ali Shirkhorshidi ([email protected]).

  • Whole time-series clustering is considered as clustering of a set of individual time-series with respect to their similarity. Here, clustering means applying conventional (usually) clustering on discrete objects, where objects are time-series.

  • Subsequence clustering means clustering on a set of subsequences of a time-series that are extracted via a sliding window, that is, clustering of segments from a single long time-series.

  • Time point clustering is another category of clustering which is seen in some papers [74], [75], [76]. It is clustering of time points based on a combination of their temporal proximity of time points and the similarity of the corresponding values. This approach is similar to time-series segmentation. However, it is different from segmentation as all points do not need to be assigned to clusters, i.e., some of them are considered as noise.

Essentially, sub-sequence clustering is performed on a single time-series, and Keogh and Lin [242] represented that this type of clustering is meaningless. Time-point clustering also is applied on a single time-series, and is similar to time-series segmentation as the objective of time-point clustering is finding the clusters of time-point instead of clusters of time-series data. The focus of this study is on the “whole time-series clustering”. A complete review on whole time-series clustering is performed and shown in Table 4. Reviewing the literature, it is noticeable that various techniques have been recommended for the clustering of whole time-series data. However, most of them take one of the following approaches to cluster time-series data:

  • 1.

    Customizing the existing conventional clustering algorithms (which work with static data) such that they become compatible with the nature of time-series data. In this approach, usually their distance measure (in conventional algorithms) is modified to be compatible with the raw time-series data [16].

  • 2.

    Converting time-series data into simple objects (static data) as input of conventional clustering algorithms [16].

  • 3.

    Using multi resolutions of time-series as input of a multi-step approach. This approach is discussed further in Section 5.6.

Beside this common characteristic, there are generally three different ways to cluster time-series, namely shape-based, feature-based and model-based.

Fig. 2 shows a brief of these approaches. In the shape-based approach, shapes of two time-series are matched as well as possible, by a non-linear stretching and contracting of the time axes. This approach has also been labelled as a raw-data-based approach because it typically works directly with the raw time-series data. Shape-based algorithms usually employ conventional clustering methods, which are compatible with static data while their distance/similarity measure has been modified with an appropriate one for time-series. In the feature-based approach, the raw time-series are converted into a feature vector of lower dimension. Later, a conventional clustering algorithm is applied to the extracted feature vectors. Usually in this approach, an equal length feature vector is calculated from each time-series followed by the Euclidean distance measurement [77]. In model-based methods, a raw time-series is transformed into model parameters (a parametric model for each time-series,) and then a suitable model distance and a clustering algorithm (usually conventional clustering algorithms) is chosen and applied to the extracted model parameters [16]. However, it is shown that usually model-based approaches has scalability problems [78], and its performance reduces when the clusters are close to each other [79].

Reviewing existing works in the literature, it is implied that essentially time-series clustering has four components: dimensionality reduction or representation method, distance measurement, clustering algorithm, prototype definition, and evaluation. Fig. 3 shows an overview of these components.

The general process in the time-series clustering uses some or all of these components depending on the problem. Usually, data is approximated using a representation method in such a way that can fit in memory. Afterwards, a clustering algorithm is applied on data by using a distance measure. In the clustering process, usually a prototype is required for summarization of the time-series. At last, the clusters are evaluated using criteria. In the following sub-sections, each component is discussed, and several related works and methods are reviewed.

In the rest of this paper, we will provide a state-of-the-art review on main components available in time-series clustering plus the evaluation methods and measures available for validating time-series clustering. In Section 2, time-series representation is discussed. Similarity and dissimilarity measures are represented in Section 3. 4 Time-series cluster prototypes, 5 Time-series clustering algorithms are dedicated to clustering prototypes and clustering algorithms respectively. In section 6 evaluation measures is discussed and finally the paper is concluded in Section 7.

Section snippets

Representation methods for time series clustering

The first component of time-series clustering explained here is dimension reduction which is a common solution for most whole time-series clustering approaches proposed in the literature [9], [80], [81], [82]. This section reviews methods of time-series dimension reduction which is known as time-series representation as well. Dimensionality reduction represents the raw time-series in another space by transforming time-series to a lower dimensional space or by feature extraction. The reason that

Similarity/dissimilarity measures in time-series clustering

This section is a review on distance measurement approaches for time-series. The theoretical issue of time-series similarity/dissimilarity search is proposed by Agrawal et al. [108] and subsequently it became a basic theoretical issue in data mining community. Time-series clustering relies on distance measure to a high extent. There are different measures which can be applied to measure the distance among time-series. Some of similarity measures are proposed based on a specific time-series

Time-series cluster prototypes

Finding the cluster prototype or cluster representative is an essential subroutine in time-series clustering approaches [3], [86], [112], [114], [146], [147]. One of the approaches to address the low quality problem in time-series clustering is remedying the issue of inaccurate prototypes of clusters, especially in partitioning clustering algorithms such as k-Means, k-Medoids, Fuzzy C-Means (FCM), or even Ascendant Hierarchical Clustering which requires a prototype. In these algorithms, the

Time-series clustering algorithms

In this section, the existing works related to clustering of time-series data are concentrated and discussed. Some of them are using raw time-series and some try to use reduction methods before clustering of time-series data. As it is demonstrated in Fig. 6, generally clustering can be broadly classified into six groups: Partitioning, Hierarchical, Grid-based, Model-based, Density-based clustering and Multi-step clustering algorithms. In the following, the application of each group in

Time-series clustering evaluation measures

In this section evaluation method for clustering algorims are discussed. Keogh and Kasetty [6] have made an interesting research on different articles in time-series mining and conclude that the evaluation of time-series mining should follow some disciplines which are recommended as:

  • The validation of algorithms should be performed on various ranges of datasets (unless the algorithm is created only for a specific set). The used dataset should be published and freely available

  • Implementation bias

Conclusion

Although different researches have been conducted on time-series clustering, the unique characteristics of time-series data are barriers that fail most of conventional clustering algorithms to work well for time-series. In particular, the high dimensionality, very high feature correlation, and typically large amount of noise that characterize time-series data have been viewed as an interesting research challenge in time-series clustering. Accordingly, most of the studies in the literature have

Acknowledgements

This research is supported by University of Malaya Research Grant no vote RP0061-13ICT.

References (242)

  • P. Rai et al.

    A survey of clustering techniques

    Int. J. Comput. Appl.

    (2010)
  • V. Niennattrakul, C. Ratanamahatana, On clustering multimedia time series data using k-means and dynamic time warping,...
  • C. Ratanamahatana, Multimedia retrieval using time series representation and relevance feedback, in: Proceedings of 8th...
  • C. Ratanamahatana, V. Niennattrakul, Clustering multimedia data using time series, in: Proceedings of the International...
  • J. Lin, E. Keogh, S. Lonardi, J. Lankford, D. Nystrom, Visually mining and monitoring massive time series, in:...
  • E. Keogh et al.

    On the need for time series data mining benchmarks: a survey and empirical demonstration

    Data Min. Knowl. Discov.

    (2003)
  • K. Haigh, W. Foslien, and V. Guralnik, Visual query language: finding patterns in and relationships among time series...
  • E. Keogh et al.

    Segmenting time series: a survey and novel approach

    Data Min. Time Ser. Databases

    (2004)
  • J. Lin, E. Keogh, S. Lonardi, and B. Chiu, A symbolic representation of time series, with implications for streaming...
  • J. Zakaria, S. Rotschafer, A. Mueen, K. Razak, E. Keogh, Mining massive archives of mice sounds with symbolized...
  • T. Rakthanmanon, A.B. Campana, G. Batista, J. Zakaria, E. Keogh, Searching and mining trillions of time series...
  • E. Keogh, A decade of progress in indexing and mining large time series databases, in: Proceedings of the International...
  • S. Laxman et al.

    A survey of temporal data mining

    Sadhana

    (2006)
  • V. Kavitha et al.

    Clustering time series data stream—a literature survey

    Int. J. Comput. Sci. Inf. Secur.

    (2010)
  • C. Antunes, A.L. Oliveira, Temporal data mining: an overview, in: KDD Workshop on Temporal Data Mining, 2001, pp....
  • S. Rani et al.

    Recent techniques of clustering of time series data: a survey

    Int. J. Comput. Appl

    (2012)
  • J. Lin et al.

    Iterative incremental clustering of time series

    Adv. Database Technol

    (2004)
  • R. Kumar, P. Nagabhushan, Time series as a point—a novel approach for time series cluster visualization in: Proceedings...
  • C. Faloutsos et al.

    Fast subsequence matching in time-series databases

    ACM SIGMOD Rec.

    (1994)
  • X. Wang et al.

    Characteristic-based clustering for time series data

    Data Min. Knowl. Discov.

    (2006)
  • M. Chiş et al.

    Clustering time series data: an evolutionary approach

    Found. Comput. Intell.

    (2009)
  • J. Lin, E. Keogh, W. Truppel, Clustering of streaming time series is meaningless, in: Proceedings of 8th ACM SIGMOD...
  • E. Keogh et al.

    A simple dimensionality reduction technique for fast similarity search in large time series databases

    Knowl. Inf. Syst.

    (2000)
  • X. Wang, K.A. Smith, R. Hyndman, D. Alahakoon, A Scalable Method for Time Series Clustering,...
  • H. Zhang et al.

    Unsupervised feature extraction for time series clustering using orthogonal wavelet transform

    Informatica

    (2006)
  • H. Wang, W. Wang, J. Yang, P.P.S. Yu, Clustering by pattern similarity in large data sets, in: Proceedings of 2002 ACM...
  • G. Das et al.

    Rule discovery from time series,

    Knowl. Discov. Data Min

    (1998)
  • T.C. Fu, F.L. Chung, V. Ng, R. Luk, Pattern discovery from stock time series using self-organizing maps, in: Workshop...
  • B. Chiu, E. Keogh, S. Lonardi, Probabilistic discovery of time series motifs, in: Proceedings of the Ninth ACM SIGKDD...
  • E. Keogh, S. Lonardi, B.Y. Chiu, Finding surprising patterns in a time series database in linear time and space, in:...
  • P.K. Chan, M.V. Mahoney, Modeling multiple time series for anomaly detection, in: Proceedings of Fifth IEEE...
  • L. Wei, N. Kumar, V. Lolla, E. Keogh, Assumption-free anomaly detection in time series, in: Proceedings of the 17th...
  • M. Leng, X. Lai, G. Tan, X. Xu, Time series representation for anomaly detection, in: Proceedings of 2nd IEEE...
  • P.M. Polz, E. Hortnagl, E. Prem, Processing and Clustering Time Series of Mobile Robot Sensory Data. Technical Report,...
  • W. He et al.

    A new method for abrupt dynamic change detection of correlated time series,

    Int. J. Climatol.

    (2011)
  • A. Sfetsos et al.

    Time series forecasting with a hybrid clustering scheme and pattern recognition

    IEEE Trans. Syst. Man Cybern

    (2004)
  • N. Pavlidis et al.

    Financial forecasting through unsupervised clustering and neural networks

    Oper. Res.

    (2006)
  • F. Ito, T. Hiroyasu, M. Miki, H. Yokouchi, Detection of Preference Shift Timing using Time-Series Clustering, 2009, pp....
  • D. Graves, W. Pedrycz Proximity fuzzy clustering and its application to time series clustering and prediction in:...
  • U. Rebbapragada et al.

    Finding anomalous periodic time series

    Mach. Learn.

    (2009)
  • Cited by (1356)

    View all citing articles on Scopus
    View full text