Time-series clustering – A decade review

doi:10.1016/j.is.2015.04.007

Information Systems

Volume 53, October–November 2015, Pages 16-38

https://doi.org/10.1016/j.is.2015.04.007 Get rights and content

Highlights

•
Anatomy of time-series clustering is revealed by introducing its 4 main component.
•
Research works in each of the four main components are reviewed in detail and compared.
•
Analysis of research works published in the last decade.
•
Enlighten new paths for future works for time-series clustering and its components.

Abstract

Clustering is a solution for classifying enormous data when there is not any early knowledge about classes. With emerging new concepts like cloud computing and big data and their vast applications in recent years, research works have been increased on unsupervised solutions like clustering algorithms to extract knowledge from this avalanche of data. Clustering time-series data has been used in diverse scientific areas to discover patterns which empower data analysts to extract valuable information from complex and massive datasets. In case of huge datasets, using supervised classification solutions is almost impossible, while clustering can solve this problem using un-supervised approaches. In this research work, the focus is on time-series data, which is one of the popular data types in clustering problems and is broadly used from gene expression data in biology to stock market analysis in finance. This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time-series approaches during the last decade and enlighten new paths for future works.

Introduction

Clustering is a data mining technique where similar data are placed into related or homogeneous groups without advanced knowledge of the groups’ definitions [1]. In detail, clusters are formed by grouping objects that have maximum similarity with other objects within the group, and minimum similarity with objects in other groups. It is a useful approach for exploratory data analysis as it identifies structure(s) in an unlabelled dataset by objectively organizing data into similar groups. Moreover, clustering is used for exploratory data analysis for summary generation and as a pre-processing step for other data mining tasks or as a part of a complex system.

With increasing power of data storages and processors, real-world applications have found the chance to store and keep data for a long time. Hence, data in many applications is being stored in the form of time-series data, for example sales data, stock prices, exchange rates in finance, weather data, biomedical measurements (e.g., blood pressure and electrocardiogram measurements), biometrics data (image data for facial recognition), particle tracking in physics, etc. Accordingly, different works are found in variety of domains such as Bioinformatics and Biology, Genetics, Multimedia [2], [3], [4] and Finance. This amount of time-series data has provided the opportunity of analysing time-series for many researchers in data mining communities in the last decade. Consequently, many researches and projects relevant to analysing time-series have been performed in various areas for different purposes such as: subsequence matching, anomaly detection, motif discovery [5], indexing, clustering, classification [6], visualization [7], segmentation [8], identifying patterns, trend analysis, summarization [9], and forecasting. Moreover, there are many on-going research projects aimed to improve the existing techniques [10], [11].

In the recent decade, there has been a considerable amount of changes and developments in time-series clustering area that are caused by emerging concepts such as big data and cloud computing which increased size of datasets exponentially. For example, one hour of ECG (electrocardiogram) data occupies 1 gigabyte, a typical weblog requires 5 gigabytes per week, the space shuttle database has 200 gigabytes and updating it requires 2 gigabytes per day [12]. Consequently, clustering craved for improvements in recent years to cope with this incremental avalanche of data to keep its reputation as a helpful data-mining tool for extracting useful patterns and knowledge from big datasets. This review is opportune, because despite the considerable changes in the area, there is not a comprehensive review on anatomy and structure of time-series clustering. There are some surveys and reviews that focus on comparative aspects of time-series clustering experiments [6], [13], [14], [15], [16], [17] but none of them tend to be as comprehensive as we are in this review. This research work is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time-series approaches during the last decade and enlighten new paths for future works.

A special type of clustering is time-series clustering. A sequence composed of a series of nominal symbols from a particular alphabet is usually called a temporal sequence, and a sequence of continuous, real-valued elements, is known as a time-series [15]. A time-series is essentially classified as dynamic data because its feature values change as a function of time, which means that the value(s) of each point of a time-series is/are one or more observations that are made chronologically. Time-series data is a type of temporal data which is naturally high dimensional and large in data size [6], [17], [18]. Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, finance, economics, healthcare, to government [16]. While each time-series is consisting of a large number of data points it can also be seen as a single object [19]. Clustering such complex objects is particularly advantageous because it leads to discovery of interesting patterns in time-series datasets. As these patterns can be either frequent or rare patterns, several research challenges have arisen such as: developing methods to recognize dynamic changes in time-series, anomaly and intrusion detection, process control, and character recognition [20], [21], [22]. More applications of time-series data are discussed in Section 1.2. To highlight the importance and the need for clustering time-series datasets, potentially overlapping objectives for clustering of time-series data are given as follows:

1.
Time-series databases contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets.
2.
Time-series databases are very large and cannot be handled well by human inspectors. Hence, many users prefer to deal with structured datasets rather than very large datasets. As a result, time-series data are represented as a set of groups of similar time-series by aggregation of data in non-overlapping clusters or by a taxonomy as a hierarchy of abstract concepts.
3.
Time-series clustering is the most-used approach as an exploratory technique, and also as a subroutine in more complex data mining algorithms, such as rule discovery, indexing, classification, and anomaly detection [22].
4.
Representing time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets.

The problem of clustering of time-series data is formally defined as follows:

Definition 1:

Time-series clustering, given a dataset of n time-series data $D = {F_{1}, F_{2}, . ., F_{n}},$ the process of unsupervised partitioning of D into $C = {C_{1}, C_{2}, . ., C_{k}}$ , in such a way that homogenous time-series are grouped together based on a certain similarity measure, is called time-series clustering. Then, $C_{i}$ is called a cluster, where $D = \cup_{i = 1}^{k} C_{i}$ and $C_{i} \cap^{} C_{j} = \emptyset$ for $i \neq j$ .

Time-series clustering is a challenging issue because first of all, time-series data are often far larger than memory size and consequently they are stored on disks. This leads to an exponential decrease in speed of the clustering process. Second challenge is that time-series data are often high dimensional [23], [24] which makes handling these data difficult for many clustering algorithms [25] and also slows down the process of clustering [26]. Finally, the third challenge addresses the similarity measures that are used to make the clusters. To do so, similar time-series should be found which needs time-series similarity matching that is the process of calculating the similarity among the whole time-series using a similarity measure. This process is also known as “whole sequence matching” where whole lengths of time-series are considered during distance calculation. However, the process is complicated, because time-series data are naturally noisy and include outliers and shifts [18], at the other hand the length of time-series varies and the distance among them needs to be calculated. These common issues have made the similarity measure a major challenge for data miners.

Clustering of time-series data is mostly utilized for discovery of interesting patterns in time-series datasets [27], [28]. This task itself, fall into two categories: The first group is the one which is used to find patterns that frequently appears in the dataset [29], [30]. The second group are methods to discover patterns which happened in datasets surprisingly [31], [32], [33], [34]. Briefly, finding the clusters of time-series can be advantageous in different domains to answer following real world problems:

Anomaly, novelty or discord detection: Anomaly detection are methods to discover unusual and unexpected patterns which happen in datasets surprisingly [31], [32], [33], [34]. For example, in sensor databases, clustering of time-series which are produced by sensor readings of a mobile robot in order to discover the events [35].

1-
Recognizing dynamic changes in time-series: detection of correlation between time-series [36]. For example, in financial databases, it can be used to find the companies with similar stock price move.
2-
Prediction and recommendation: a hybrid technique combining clustering and function approximation per cluster can help user to predict and recommend [37], [38], [39], [40]. For example, in scientific databases, it can address problems such as finding the patterns of solar magnetic wind to predict today’s pattern.
3-
Pattern discovery: to discover the interesting patterns in databases. For example, in marketing database, different daily patterns of sales of a specific product in a store can be discovered.

Table 1 depicts some applications of time-series data in different domains.

Reviewing the literature, one can conclude that most of clustering time-series related works are classified into three categories: “whole time-series clustering”, “subsequence clustering” and “time point clustering” as depicted in Fig. 1. The first two categories are mentioned by Keogh and Lin [242] On behalf of Ali Shirkhorshidi ([email protected]).

•
Whole time-series clustering is considered as clustering of a set of individual time-series with respect to their similarity. Here, clustering means applying conventional (usually) clustering on discrete objects, where objects are time-series.
•
Subsequence clustering means clustering on a set of subsequences of a time-series that are extracted via a sliding window, that is, clustering of segments from a single long time-series.
•
Time point clustering is another category of clustering which is seen in some papers [74], [75], [76]. It is clustering of time points based on a combination of their temporal proximity of time points and the similarity of the corresponding values. This approach is similar to time-series segmentation. However, it is different from segmentation as all points do not need to be assigned to clusters, i.e., some of them are considered as noise.

Essentially, sub-sequence clustering is performed on a single time-series, and Keogh and Lin [242] represented that this type of clustering is meaningless. Time-point clustering also is applied on a single time-series, and is similar to time-series segmentation as the objective of time-point clustering is finding the clusters of time-point instead of clusters of time-series data. The focus of this study is on the “whole time-series clustering”. A complete review on whole time-series clustering is performed and shown in Table 4. Reviewing the literature, it is noticeable that various techniques have been recommended for the clustering of whole time-series data. However, most of them take one of the following approaches to cluster time-series data:

1.
Customizing the existing conventional clustering algorithms (which work with static data) such that they become compatible with the nature of time-series data. In this approach, usually their distance measure (in conventional algorithms) is modified to be compatible with the raw time-series data [16].
2.
Converting time-series data into simple objects (static data) as input of conventional clustering algorithms [16].
3.
Using multi resolutions of time-series as input of a multi-step approach. This approach is discussed further in Section 5.6.

Beside this common characteristic, there are generally three different ways to cluster time-series, namely shape-based, feature-based and model-based.

Fig. 2 shows a brief of these approaches. In the shape-based approach, shapes of two time-series are matched as well as possible, by a non-linear stretching and contracting of the time axes. This approach has also been labelled as a raw-data-based approach because it typically works directly with the raw time-series data. Shape-based algorithms usually employ conventional clustering methods, which are compatible with static data while their distance/similarity measure has been modified with an appropriate one for time-series. In the feature-based approach, the raw time-series are converted into a feature vector of lower dimension. Later, a conventional clustering algorithm is applied to the extracted feature vectors. Usually in this approach, an equal length feature vector is calculated from each time-series followed by the Euclidean distance measurement [77]. In model-based methods, a raw time-series is transformed into model parameters (a parametric model for each time-series,) and then a suitable model distance and a clustering algorithm (usually conventional clustering algorithms) is chosen and applied to the extracted model parameters [16]. However, it is shown that usually model-based approaches has scalability problems [78], and its performance reduces when the clusters are close to each other [79].

Reviewing existing works in the literature, it is implied that essentially time-series clustering has four components: dimensionality reduction or representation method, distance measurement, clustering algorithm, prototype definition, and evaluation. Fig. 3 shows an overview of these components.

The general process in the time-series clustering uses some or all of these components depending on the problem. Usually, data is approximated using a representation method in such a way that can fit in memory. Afterwards, a clustering algorithm is applied on data by using a distance measure. In the clustering process, usually a prototype is required for summarization of the time-series. At last, the clusters are evaluated using criteria. In the following sub-sections, each component is discussed, and several related works and methods are reviewed.

In the rest of this paper, we will provide a state-of-the-art review on main components available in time-series clustering plus the evaluation methods and measures available for validating time-series clustering. In Section 2, time-series representation is discussed. Similarity and dissimilarity measures are represented in Section 3. 4 Time-series cluster prototypes, 5 Time-series clustering algorithms are dedicated to clustering prototypes and clustering algorithms respectively. In section 6 evaluation measures is discussed and finally the paper is concluded in Section 7.

Section snippets

Representation methods for time series clustering

The first component of time-series clustering explained here is dimension reduction which is a common solution for most whole time-series clustering approaches proposed in the literature [9], [80], [81], [82]. This section reviews methods of time-series dimension reduction which is known as time-series representation as well. Dimensionality reduction represents the raw time-series in another space by transforming time-series to a lower dimensional space or by feature extraction. The reason that

Similarity/dissimilarity measures in time-series clustering

This section is a review on distance measurement approaches for time-series. The theoretical issue of time-series similarity/dissimilarity search is proposed by Agrawal et al. [108] and subsequently it became a basic theoretical issue in data mining community. Time-series clustering relies on distance measure to a high extent. There are different measures which can be applied to measure the distance among time-series. Some of similarity measures are proposed based on a specific time-series

Time-series cluster prototypes

Finding the cluster prototype or cluster representative is an essential subroutine in time-series clustering approaches [3], [86], [112], [114], [146], [147]. One of the approaches to address the low quality problem in time-series clustering is remedying the issue of inaccurate prototypes of clusters, especially in partitioning clustering algorithms such as k-Means, k-Medoids, Fuzzy C-Means (FCM), or even Ascendant Hierarchical Clustering which requires a prototype. In these algorithms, the

Time-series clustering algorithms

In this section, the existing works related to clustering of time-series data are concentrated and discussed. Some of them are using raw time-series and some try to use reduction methods before clustering of time-series data. As it is demonstrated in Fig. 6, generally clustering can be broadly classified into six groups: Partitioning, Hierarchical, Grid-based, Model-based, Density-based clustering and Multi-step clustering algorithms. In the following, the application of each group in

Time-series clustering evaluation measures

In this section evaluation method for clustering algorims are discussed. Keogh and Kasetty [6] have made an interesting research on different articles in time-series mining and conclude that the evaluation of time-series mining should follow some disciplines which are recommended as:

•
The validation of algorithms should be performed on various ranges of datasets (unless the algorithm is created only for a specific set). The used dataset should be published and freely available
•
Implementation bias

Conclusion

Although different researches have been conducted on time-series clustering, the unique characteristics of time-series data are barriers that fail most of conventional clustering algorithms to work well for time-series. In particular, the high dimensionality, very high feature correlation, and typically large amount of noise that characterize time-series data have been viewed as an interesting research challenge in time-series clustering. Accordingly, most of the studies in the literature have

Acknowledgements

This research is supported by University of Malaya Research Grant no vote RP0061-13ICT.

References (242)

T. Warrenliao
Clustering of time series data—a survey
Pattern Recognit.
(2005)
M.A. Elangasinghe et al.
Complex time series analysis of PM10 and PM2.5 for a coastal site using artificial neural network modelling and k-means clustering
Atmos. Environ.
(2014)
R.H.R. Shumway
Time-frequency clustering and discriminant analysis
Stat. Probab. Lett
(2003)
Shen Liu et al.
Polarization of forecast densities: a new approach to time series classification
Comput. Stat. Data Anal.
(2014)
Y. Sadahiro et al.
Exploratory analysis of time series data: detection of partial similarities, clustering, and visualization
Comput. Environ. Urban Syst.
(2014)
S. Aghabozorgi et al.
Stock market co-movement assessment using a three-phase clustering method
Expert Syst. Appl.
(2014)
Y.-C. Hsu et al.
A clustering time series model for the optimal hedge ratio decision making
Neurocomputing
(2014)
J. Zhu et al.
Social network users clustering based on multivariate time series of emotional behavior
J. China Univ. Posts Telecommun
(2014)
E. Ghysels et al.
Predicting volatility: getting the most out of return data sampled at different frequencies
J. Econom
(2006)
F. Portet et al.
Automatic generation of textual summaries from neonatal intensive care data
Artif. Intell.
(2009)

P. Rai et al.

A survey of clustering techniques

Int. J. Comput. Appl.

(2010)

V. Niennattrakul, C. Ratanamahatana, On clustering multimedia time series data using k-means and dynamic time warping,...

C. Ratanamahatana, Multimedia retrieval using time series representation and relevance feedback, in: Proceedings of 8th...

C. Ratanamahatana, V. Niennattrakul, Clustering multimedia data using time series, in: Proceedings of the International...

J. Lin, E. Keogh, S. Lonardi, J. Lankford, D. Nystrom, Visually mining and monitoring massive time series, in:...

E. Keogh et al.

On the need for time series data mining benchmarks: a survey and empirical demonstration

Data Min. Knowl. Discov.

(2003)

K. Haigh, W. Foslien, and V. Guralnik, Visual query language: finding patterns in and relationships among time series...

E. Keogh et al.

Segmenting time series: a survey and novel approach

Data Min. Time Ser. Databases

(2004)

J. Lin, E. Keogh, S. Lonardi, and B. Chiu, A symbolic representation of time series, with implications for streaming...

J. Zakaria, S. Rotschafer, A. Mueen, K. Razak, E. Keogh, Mining massive archives of mice sounds with symbolized...

T. Rakthanmanon, A.B. Campana, G. Batista, J. Zakaria, E. Keogh, Searching and mining trillions of time series...

E. Keogh, A decade of progress in indexing and mining large time series databases, in: Proceedings of the International...

S. Laxman et al.

A survey of temporal data mining

Sadhana

(2006)

V. Kavitha et al.

Clustering time series data stream—a literature survey

Int. J. Comput. Sci. Inf. Secur.

(2010)

C. Antunes, A.L. Oliveira, Temporal data mining: an overview, in: KDD Workshop on Temporal Data Mining, 2001, pp....

S. Rani et al.

Recent techniques of clustering of time series data: a survey

Int. J. Comput. Appl

(2012)

J. Lin et al.

Iterative incremental clustering of time series

Adv. Database Technol

(2004)

R. Kumar, P. Nagabhushan, Time series as a point—a novel approach for time series cluster visualization in: Proceedings...

C. Faloutsos et al.

Fast subsequence matching in time-series databases

ACM SIGMOD Rec.

(1994)

X. Wang et al.

Characteristic-based clustering for time series data

Data Min. Knowl. Discov.

(2006)

M. Chiş et al.

Clustering time series data: an evolutionary approach

Found. Comput. Intell.

(2009)

J. Lin, E. Keogh, W. Truppel, Clustering of streaming time series is meaningless, in: Proceedings of 8th ACM SIGMOD...

E. Keogh et al.

A simple dimensionality reduction technique for fast similarity search in large time series databases

Knowl. Inf. Syst.

(2000)

X. Wang, K.A. Smith, R. Hyndman, D. Alahakoon, A Scalable Method for Time Series Clustering,...

H. Zhang et al.

Unsupervised feature extraction for time series clustering using orthogonal wavelet transform

Informatica

(2006)

H. Wang, W. Wang, J. Yang, P.P.S. Yu, Clustering by pattern similarity in large data sets, in: Proceedings of 2002 ACM...

G. Das et al.

Rule discovery from time series,

Knowl. Discov. Data Min

(1998)

T.C. Fu, F.L. Chung, V. Ng, R. Luk, Pattern discovery from stock time series using self-organizing maps, in: Workshop...

B. Chiu, E. Keogh, S. Lonardi, Probabilistic discovery of time series motifs, in: Proceedings of the Ninth ACM SIGKDD...

E. Keogh, S. Lonardi, B.Y. Chiu, Finding surprising patterns in a time series database in linear time and space, in:...

P.K. Chan, M.V. Mahoney, Modeling multiple time series for anomaly detection, in: Proceedings of Fifth IEEE...

L. Wei, N. Kumar, V. Lolla, E. Keogh, Assumption-free anomaly detection in time series, in: Proceedings of the 17th...

M. Leng, X. Lai, G. Tan, X. Xu, Time series representation for anomaly detection, in: Proceedings of 2nd IEEE...

P.M. Polz, E. Hortnagl, E. Prem, Processing and Clustering Time Series of Mobile Robot Sensory Data. Technical Report,...

W. He et al.

A new method for abrupt dynamic change detection of correlated time series,

Int. J. Climatol.

(2011)

A. Sfetsos et al.

Time series forecasting with a hybrid clustering scheme and pattern recognition

IEEE Trans. Syst. Man Cybern

(2004)

N. Pavlidis et al.

Financial forecasting through unsupervised clustering and neural networks

Oper. Res.

(2006)

F. Ito, T. Hiroyasu, M. Miki, H. Yokouchi, Detection of Preference Shift Timing using Time-Series Clustering, 2009, pp....

D. Graves, W. Pedrycz Proximity fuzzy clustering and its application to time series clustering and prediction in:...

U. Rebbapragada et al.

Finding anomalous periodic time series

Mach. Learn.

(2009)

Cited by (1356)

Metro Station functional clustering and dual-view recurrent graph convolutional network for metro passenger flow prediction
2024, Expert Systems with Applications
The metro system is indispensable for alleviating traffic congestion in the urban transportation system. Precise metro passenger flow (MPF) prediction is crucial in ensuring smooth operations of the metro system. Recently, the graph convolutional network (GCN), which is effective in the spatial feature extraction, has been applied in traffic prediction. However, most existing GCN-based methods construct the empirical graphs based on distance and adjacency, which cannot fully express the correlations of metro stations. This paper proposes a novel MPF prediction method consisting of three parts: K-means-based metro station functional clustering (KMSFC), external feature fusion, and dual-view recurrent GCN (DVRGCN). The KMSFC identifies the metro stations both having similar MPF changing tendencies and being located in similar urban functional areas. Furthermore, the DVRGCN is designed to simultaneously capture the spatiotemporal and external features. The dual-view GCN module in the DVRGCN captures both explicit and implicit spatial features of the metro traffic network. To demonstrate the capability for making accurate MPF predictions, the experiments using a real-world metro traffic dataset are conducted. The ablation experiments are also performed to prove the contribution of each module in the proposed method. The experimental results show that the proposed method outperforms other state-of-the-art traffic prediction methods.
Data-driven time series analysis of sensory cortical processing using high-resolution fMRI across different studies
2024, Biomedical Signal Processing and Control
Time series analysis of heterogeneous preclinical functional magnetic resonance imaging (fMRI) studies poses challenges due to data volume and method heterogeneity. Recent advances in machine learning (ML) and artificial intelligence (AI) allow for addressing such challenges in complex datasets. These approaches can uncover patterns, including temporal kinetics, within blood-oxygen-level-dependent (BOLD) time series with a reduced workload. However, the typically low temporal resolution and signal-to-noise ratio (SNR) of fMRI time series have so far limited progress in this area.
Therefore, we used ultrafast 1D line-scanning and 2D-fMRI data for this study to assess whether enhanced spatial and temporal resolution of fMRI data, along with a sophisticated metric design for clustering, was sufficient to detect differences in BOLD response characteristics in the time domain across cortical layers. Next, we compared consistency of the produced results across four independent studies conducted at two different imaging centers, each utilizing distinct study protocols and, finally, we combined line-scanning data from different studies for time-domain clustering analysis to facilitate a cross-study examination of somatosensory information processing during sensory stimulation of the forepaw.
By adopting a voxel-based and purely data-driven approach, we systematically explored different clustering techniques and analyzed somatosensory cortex fMRI data obtained during forepaw stimulation in rats. We established and validated an unsupervised workflow capable of detecting BOLD response latencies between different stimulus modalities, producing consistent results across different study protocols, indicating robustness, reproducibility, and generalizability of our framework.
Application of time series analysis to classify therapeutic breathing patterns
2024, Smart Health
Compare various methods for measuring time series similarity in order to classify referenced therapeutic breathing patterns (BP) used in respiratory disorder rehabilitation.
This experimental study involved the collection of respiratory signals during specified breathing exercises conducted with healthy volunteers. The study employed a screening phase using a k-NN classifier and eight distance measurement methods, including Minkowski Distance, Dynamic Time Warping-DTW (including FastDTW and constrained-cDTW variations), Longest Common Subsequence-LCSS, Edit Distance on Real Sequences-EDR, Time Warp Edit Distance-TWEED, and Minimum Jump Costs-MJC. Two distinct approaches were employed for classifying therapeutic BP based on time series similarity: (1) using the k-Shape algorithm for clustering, and 2) integrating methods to represent therapeutic BP and classify test curves using the most relevant measurement methods obtained from the first approach.
Among the two tested approaches, the combination of the cDTW algorithm and Minkowski distance (p = 2), using the 1-NN classifier, achieved the highest scores in this study, closely matching the metrics obtained from visual inspection conducted by human evaluators.
The use of combined classification methods in the analysis of flow curves referring to therapeutic breathing patterns improves the classification results, with metrics closely aligned with those obtained through visual evaluation conducted by individuals.
Time series analysis methods proved to be sensitive to classify respiratory flow curves equivalent to therapeutic breathing patterns used in respiratory disorder rehabilitation. This methodology can be used to monitor respiratory curves in new applications and implementation in devices for evaluating and treating the ventilatory pattern.
Time series clustering to improve one-class classifier performance[Formula presented]
2024, Expert Systems with Applications
The improvement of one-class classifiers’ performance through clustering of multivariate time series is considered in this paper. Datasets arising from real processes come from the available sensors and are affected by many factors, such as aging of the process, changes in the operation region, and equipment malfunction. Despite that, one expects that the classes represented by such diverse data can be unveiled via trained classifiers. This work hypothesizes that the overall performance can be improved by training sets of one-class classifiers with subsets of data clustered by similarity. The proposed method is applied to one class classifiers since they are trained only with the target class, which is clustered based on time series similarity using Dynamic Time Warping and k-means. The advantages of the techniques are illustrated through their application to a public dataset from the oil industry with instances characterizing eight classes of data represented by five time series. Seven classes are selected to train LSTM classifiers using the variables and instances clustered using time series clustering algorithms. The results show that the increase in the similarity of training data tends to improve the performance of the LSTM classifier, achieving an increase of 10% in the overall performance. In a specific case, where the clustering model raised the similarity by 84%, the classification performance improved by 21%.
Unraveling fundamental properties of power system resilience curves using unsupervised machine learning
2024, Energy and AI
Power system is vital to modern societies, while it is susceptible to hazard events. Thus, analyzing resilience characteristics of power system is important. The standard model of infrastructure resilience, the resilience triangle, has been the primary way of characterizing and quantifying resilience in infrastructure systems for more than two decades. However, the theoretical model provides a one-size-fits-all framework for all infrastructure systems and specifies general characteristics of resilience curves (e.g., residual performance and duration of recovery). Little empirical work has been done to delineate infrastructure resilience curve archetypes and their fundamental properties based on observational data. Most of the existing studies examine the characteristics of infrastructure resilience curves based on analytical models constructed upon simulated system performance. There is a dire dearth of empirical studies in the field, which hindered our ability to fully understand and predict resilience characteristics in infrastructure systems. To address this gap, this study examined more than two hundred power-grid resilience curves related to power outages in three major extreme weather events in the United States. Through the use of unsupervised machine learning, we examined different curve archetypes, as well as the fundamental properties of each resilience curve archetype. The results show two primary archetypes for power grid resilience curves, triangular curves, and trapezoidal curves. Triangular curves characterize resilience behavior based on three fundamental properties: 1. critical functionality threshold, 2. critical functionality recovery rate, and 3. recovery pivot point. Trapezoidal archetypes explain resilience curves based on 1. duration of sustained function loss and 2. constant recovery rate. The longer the duration of sustained function loss, the slower the constant rate of recovery. The findings of this study provide novel perspectives enabling better understanding and prediction of resilience performance of power system infrastructure in extreme weather events.
A statistical analysis of COVID-19 mortality dynamics: Unraveling the interplay between vaccination trends, socioeconomic factors, and government interventions in Brazilian states
2024, Socio-Economic Planning Sciences
The main challenges in fighting the COVID-19 pandemic in Brazil included socioeconomic inequality among different states and the lack of consensus on the measures implemented to contain and prevent the pandemic. This study analyzes the dynamics of COVID-19-related deaths in Brazilian states associated with the evolution of vaccination trends, socioeconomic factors, and government interventions through population mobility. By applying time series clustering techniques and regression modeling, the insights obtained show the commonalities and differences among the 27 Brazilian states and how the different vaccination temporal patterns, socioeconomic factors, and mobility rates correlated to the evolution of COVID-19 deaths.

View all citing articles on Scopus

View full text

Time-series clustering – A decade review

Highlights

Abstract

Introduction

Section snippets

Representation methods for time series clustering

Similarity/dissimilarity measures in time-series clustering

Time-series cluster prototypes

Time-series clustering algorithms

Time-series clustering evaluation measures

Conclusion

Acknowledgements

Pattern Recognit.

Atmos. Environ.

Stat. Probab. Lett

Comput. Stat. Data Anal.

Comput. Environ. Urban Syst.

Expert Syst. Appl.

Neurocomputing

J. China Univ. Posts Telecommun

J. Econom

Artif. Intell.

A survey of clustering techniques

Int. J. Comput. Appl.

On the need for time series data mining benchmarks: a survey and empirical demonstration

Data Min. Knowl. Discov.

Segmenting time series: a survey and novel approach

Data Min. Time Ser. Databases

A survey of temporal data mining

Sadhana

Clustering time series data stream—a literature survey

Int. J. Comput. Sci. Inf. Secur.

Recent techniques of clustering of time series data: a survey

Int. J. Comput. Appl

Iterative incremental clustering of time series

Adv. Database Technol

Fast subsequence matching in time-series databases

ACM SIGMOD Rec.

Characteristic-based clustering for time series data

Data Min. Knowl. Discov.

Clustering time series data: an evolutionary approach

Found. Comput. Intell.

A simple dimensionality reduction technique for fast similarity search in large time series databases

Knowl. Inf. Syst.

Unsupervised feature extraction for time series clustering using orthogonal wavelet transform

Informatica

Rule discovery from time series,

Knowl. Discov. Data Min

A new method for abrupt dynamic change detection of correlated time series,

Int. J. Climatol.

Time series forecasting with a hybrid clustering scheme and pattern recognition

IEEE Trans. Syst. Man Cybern

Financial forecasting through unsupervised clustering and neural networks

Oper. Res.

Finding anomalous periodic time series

Mach. Learn.