Kml: A package to cluster longitudinal data

https://doi.org/10.1016/j.cmpb.2011.05.008Get rights and content

Abstract

Cohort studies are becoming essential tools in epidemiological research. In these studies, measurements are not restricted to single variables but can be seen as trajectories. Thus, an important question concerns the existence of homogeneous patient trajectories.

KmL is an R package providing an implementation of k-means designed to work specifically on longitudinal data. It provides several different techniques for dealing with missing values in trajectories (classical ones like linear interpolation or LOCF but also new ones like copyMean). It can run k-means with distances specifically designed for longitudinal data (like Frechet distance or any user-defined distance). Its graphical interface helps the user to choose the appropriate number of clusters when classic criteria are not efficient. It also provides an easy way to export graphical representations of the mean trajectories resulting from the clustering. Finally, it runs the algorithm several times, using various kinds of starting conditions and/or numbers of clusters to be sought, thus sparing the user a lot of manual re-sampling.

Introduction

Cohort studies are becoming essential tools in epidemiological research. In these studies, measurements collected for a single subject can be seen as trajectories. Thus, an important question concerns the existence of homogeneous patient trajectories. From a statistical point of view many methods have been developed to deal with this issue [1], [2], [3], [4]. In its survey [5] Warren-Liao divide these methods into five families: partitioning methods construct k clusters containing at least one individual; hierarchical methods work by grouping data objects into a tree of clusters; density-based methods make clusters grow as long as the density in the “neighborhood” exceeds a certain threshold; grid-based methods quantize the object space and perform the clustering operation on the resulting finite grid structure; model-based methods assume a model for each cluster and look for the best fit of data to the model.

The pros and cons of these approaches are regularly discussed [6], [7] even if there is little data to show which method is indeed preferable in which situation. In this paper, we consider k-means, a well-known partitioning method [8], [9]. In favor of an algorithm of this type the following points can be cited: (1) it does not require any normality or parametric assumptions within clusters (although it might be more efficient under certain assumptions). This might be of great interest when the aim is to cluster data on which no prior information is available; (2) it is likely to be more robust as regards numerical convergence; (3) in the particular context of longitudinal data, it does not require any assumption regarding the shape of the trajectory (this is likely to be an important point: the clustering of longitudinal data is basically an exploratory approach); (4) also in the longitudinal context, it is independent from time scaling.

On the other hand, it also suffers from some drawbacks: (1) formal tests cannot be used to check the validity of the partition; (2) the number of clusters needs to be known a priori; (3) the algorithm is not deterministic, the starting condition is often determined at random. So it may converge to a local optimum and one cannot be sure that the best partition has been found; (4) the estimation of a quality criterion cannot be performed if there are missing values in the trajectories.

Regarding software, numerous versions of k-means exist, some with a traditional approach [10], [11], some with variations [12], [13], [14], [15], [16], [17]. They however have several weaknesses: (1) they are not able to deal with missing values. (2) Since determining the number of clusters is still an open question, they require the user to manually re-run the k-means several times.

KmL is a new implementation of k-means specifically designed to analyze longitudinal data. Our package is designed for R platform and is available on CRAN [18]. It is able to deal with missing values; it also provides an easy way to run the algorithm several times, varying the starting conditions and/or the number of clusters looked for; its graphical interface helps the user to choose the appropriate number of clusters when the classic criterion is not efficient.

Section 2 presents theoretical aspects of KmL: the algorithm, different solutions to deal with missing values and quality criteria to select the best number of clusters. Section 3 gives a description of the package. Section 4 compares the impact of the different starting conditions. Section 5 is the discussion.

Section snippets

Introduction to k-means

k-Means is a hill-climbing algorithm [7] belonging to the EM class (Expectation–Maximization) [11]. EM algorithms work as follow: Initially, each observation is assigned to a cluster; then the optimal partition is reached by alternating two phases called respectively “Expectation” and Maximization”. During the Expectation phase, the center of each cluster is determined. Then the Maximization consists in assigning each observation to its “nearest cluster”. The alternation of the two phases is

Package description

In this section, the content of the package is presented (see Fig. 4).

Artificial data sets

To test kml() and compare the efficiency of its various options, we used simulated longitudinal data. We constructed the data as follows: a data set is the mixture of several sub-groups. A subgroup m is defined by a function fm(x) called the theoretical trajectory. Each subject i of a sub-group follows the theoretical trajectory of its subgroup plus a personal variation ɛi(x). The mixture of different theoretical trajectories is called the data set shape. The final construction is performed

Overview

KmL is a new implementation of k-means specifically designed to cluster longitudinal data. It can work either with classical distance (Euclidean, manhattan, Minkovski, etc.), with a distance dedicated to longitudinal data (Frechet, dynamic time warping) or with any user-defined distance. It is able to deal with missing values, using either using Gower adjustment or several imputation methods that are provided. It also provides an easy way to run the algorithm several times, varying the starting

Conflict of interest statement

The author Genolini declare that he had certified that potential conflicts about this manuscript do not exist, have no relevant financial interests in this manuscript, and had full access to all the real data used in the study and take the responsibility for the integrity of the data analysis.

References (34)

  • B.S. Everitt et al.

    Cluster Analysis

    (2001)
  • J. MacQueen

    Some methods for classification and analysis of multivariate observations

  • J. Hartigan et al.

    A K-means clustering algorithm

    Journal of the Royal Statistical Society Series C—Applied Statistics

    (1979)
  • Rousseeuw et al.

    Finding Groups in Data: An Introduction to Cluster Analysis

    (1990)
  • S. Tokushige et al.

    Crisp and fuzzy k-means clustering algorithms for multivariate functional data

    Computational Statistics

    (2007)
  • T. Tarpey

    Linear transformations and the k-means clustering algorithm: applications to clustering curves

    The American Statistician

    (2007)
  • L.A. García-Escudero et al.

    A proposal for robust curve clustering

    Journal of Classification

    (2005)
  • Cited by (0)

    View full text