Solved – How to do time series ( longitudinal) clustering based entirely on Shape of the curves

clusteringk-meanspanel datatime series

I have a longitudinal (panel) dataset for investment growth for 120 countries covering the time from 1960-2008. Essentially it's viewed as 120 time series.

What I am interested in is to group countries based on their shape of their growth curves over time. Thus whether they share similar Shape of their curves are the only criteria I need for grouping those countries.

I have tried KmL package (K-means for Longitudinal Data), but it seems that (please correct me if I am wrong) this methodology produces the result that group countries exhibiting similar (investment growth) mean value (or magnitude), not exactly according to the similar shape. For example, KmL tends to group countries with high investment growth, median average investment growth, low investment growth, etc. The countries within those groups may have very different shape of curves over time.

What I am looking for is regardless of the absolute value of investment growth. As long as the two countries exhibit similar pattern of their growth over time curve, they should be grouped together in one group.

Could anyone tell me a way to implement this clustering? I have noticed from previous posts that cointegration test may work. Any suggestions will be greatly appreciated!

Best Answer

If you z-standardize each of your series, $(X_i-\bar{X})/\sigma$, that is, unify level of the series firstly and swing of the series secondly, then the only difference that remains is the difference in shape. Compute euclidean distances (or similar measure) between 120 series and perform hierarchical clustering. You might also want (maybe) to do mild smoothig of the curves prior all.

Related Solutions

How to Conduct Time-Series Clustering Based on Curve Shape

Several directions for analyzing longitudinal data were discussed in the link provided by @Jeromy, so I would suggest you to read them carefully, especially those on functional data analysis. Try googling for "Functional Clustering of Longitudinal Data", or the PACE Matlab toolbox which is specifically concerned with model-based clustering of irregularly sampled trajectories (Peng and Müller, Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions, Annals of Applied Statistics 2008 2: 1056). I can imagine that there may be a good statistical framework for financial time series, but I don't know about that.

The kml package basically relies on k-means, working (by default) on euclidean distances between the $t$ measurements observed on $n$ individuals. What is called a trajectory is just the series of observed values for individual $i$, $y_i=(y_{i1},y_{i2},\dots,y_{it})$, and $d(y_i,y_j)=\sqrt{t^{-1}\sum_{k=1}^t(y_{ik}-y_{jk})^2}$. Missing data are handled through a slight modification of the preceding distance measure (Gower adjustment) associated to a nearest neighbor-like imputation scheme (for computing Calinski criterion). As I don't represent myself what you real data would look like, I cannot say if it will work. At least, it work with longitudinal growth curves, "polynomial" shape, but I doubt it will allow you to detect very specific patterns (like local minima/maxima at specific time-points with time-points differing between clusters, by a translation for example). If you are interested in clustering possibly misaligned curves, then you definitively have to look at other solutions; Functional clustering and alignment, from Sangalli et al., and references therein may provide a good starting point.

Below, I show you some code that may help to experiment with it (my seed is generally set at 101, if you want to reproduce the results). Basically, for using kml you just have to construct a clusterizLongData object (an id number for the first column, and the $t$ measurements in the next columns).

library(lattice)
xyplot(var0 ~ date, data=test.data, groups=store, type=c("l","g"))

tw <- reshape(test.data, timevar="date", idvar="store", direction="wide")
parallel(tw[,-1], horizontal.axis=F, 
         scales=list(x=list(rot=45, 
                            at=seq(1,ncol(tw)-1,by=2), 
                            labels=substr(names(tw[,-1])[seq(1,ncol(tw)-1,by=2)],6,100), 
                            cex=.5)))

library(kml)
names(tw) <- c("id", paste("t", 1:(ncol(tw)-1)))
tw.cld <- as.cld(tw)
cld.res <- kml(tw.cld,nbRedrawing=5)
plot(tw.cld)

The next two figures are the raw simulated data and the five-cluster solution (according to Calinski criterion, also used in the fpc package). I don't show the scaled version.

alt text

alt text

Longitudinal Data – Understanding Time Series, Repeated Measures, and Other Longitudinal Data Types

As Jeromy Anglim said, it would help to know the number of time points you have for each individual; as you said "many" I would venture that functional analysis might be a viable alternative. You might want to check the R package fda and look at the book by Ramsay and Silverman.

Related Question