Kml: A package to cluster longitudinal data
Introduction
Cohort studies are becoming essential tools in epidemiological research. In these studies, measurements collected for a single subject can be seen as trajectories. Thus, an important question concerns the existence of homogeneous patient trajectories. From a statistical point of view many methods have been developed to deal with this issue [1], [2], [3], [4]. In its survey [5] Warren-Liao divide these methods into five families: partitioning methods construct k clusters containing at least one individual; hierarchical methods work by grouping data objects into a tree of clusters; density-based methods make clusters grow as long as the density in the “neighborhood” exceeds a certain threshold; grid-based methods quantize the object space and perform the clustering operation on the resulting finite grid structure; model-based methods assume a model for each cluster and look for the best fit of data to the model.
The pros and cons of these approaches are regularly discussed [6], [7] even if there is little data to show which method is indeed preferable in which situation. In this paper, we consider k-means, a well-known partitioning method [8], [9]. In favor of an algorithm of this type the following points can be cited: (1) it does not require any normality or parametric assumptions within clusters (although it might be more efficient under certain assumptions). This might be of great interest when the aim is to cluster data on which no prior information is available; (2) it is likely to be more robust as regards numerical convergence; (3) in the particular context of longitudinal data, it does not require any assumption regarding the shape of the trajectory (this is likely to be an important point: the clustering of longitudinal data is basically an exploratory approach); (4) also in the longitudinal context, it is independent from time scaling.
On the other hand, it also suffers from some drawbacks: (1) formal tests cannot be used to check the validity of the partition; (2) the number of clusters needs to be known a priori; (3) the algorithm is not deterministic, the starting condition is often determined at random. So it may converge to a local optimum and one cannot be sure that the best partition has been found; (4) the estimation of a quality criterion cannot be performed if there are missing values in the trajectories.
Regarding software, numerous versions of k-means exist, some with a traditional approach [10], [11], some with variations [12], [13], [14], [15], [16], [17]. They however have several weaknesses: (1) they are not able to deal with missing values. (2) Since determining the number of clusters is still an open question, they require the user to manually re-run the k-means several times.
KmL is a new implementation of k-means specifically designed to analyze longitudinal data. Our package is designed for R platform and is available on CRAN [18]. It is able to deal with missing values; it also provides an easy way to run the algorithm several times, varying the starting conditions and/or the number of clusters looked for; its graphical interface helps the user to choose the appropriate number of clusters when the classic criterion is not efficient.
Section 2 presents theoretical aspects of KmL: the algorithm, different solutions to deal with missing values and quality criteria to select the best number of clusters. Section 3 gives a description of the package. Section 4 compares the impact of the different starting conditions. Section 5 is the discussion.
Section snippets
Introduction to k-means
k-Means is a hill-climbing algorithm [7] belonging to the EM class (Expectation–Maximization) [11]. EM algorithms work as follow: Initially, each observation is assigned to a cluster; then the optimal partition is reached by alternating two phases called respectively “Expectation” and Maximization”. During the Expectation phase, the center of each cluster is determined. Then the Maximization consists in assigning each observation to its “nearest cluster”. The alternation of the two phases is
Package description
In this section, the content of the package is presented (see Fig. 4).
Artificial data sets
To test kml() and compare the efficiency of its various options, we used simulated longitudinal data. We constructed the data as follows: a data set is the mixture of several sub-groups. A subgroup m is defined by a function fm(x) called the theoretical trajectory. Each subject i of a sub-group follows the theoretical trajectory of its subgroup plus a personal variation ɛi(x). The mixture of different theoretical trajectories is called the data set shape. The final construction is performed
Overview
KmL is a new implementation of k-means specifically designed to cluster longitudinal data. It can work either with classical distance (Euclidean, manhattan, Minkovski, etc.), with a distance dedicated to longitudinal data (Frechet, dynamic time warping) or with any user-defined distance. It is able to deal with missing values, using either using Gower adjustment or several imputation methods that are provided. It also provides an easy way to run the algorithm several times, varying the starting
Conflict of interest statement
The author Genolini declare that he had certified that potential conflicts about this manuscript do not exist, have no relevant financial interests in this manuscript, and had full access to all the real data used in the study and take the responsibility for the integrity of the data analysis.
References (34)
Clustering of time series data—a survey
Pattern Recognition
(2005)- et al.
A classification EM algorithm for clustering and two stochastic versions
Computational Statistics and Data Analysis
(1992) - et al.
An empirical comparison of four initialization methods for the K-means algorithm
Pattern Recognition Letters
(1999) - et al.
Optimising k-means clustering results with standard software packages
Computational Statistics and Data Analysis
(2005) - et al.
Mixture model clustering for mixed data with missing information
Computational Statistics and Data Analysis
(2003) - et al.
Clustering functional data
Journal of Classification
(2003) - et al.
Clustering functional data with the SOM algorithm
- et al.
Unsupervised curve clustering using B-splines
Scandinavian Journal of Statistics
(2003) - et al.
Clustering for sparsely sampled functional data
Journal of the American Statistical Association
(2003) - et al.
Latent class models for clustering: a comparison with K-means
Canadian Journal of Marketing Research
(2002)