Time Series – Clustering Time Series Data in R

clusteringcointegrationrtime series

I have a set of time series data. Each series covers the same period, although the actual dates in each time series may not all 'line up' exactly.

That is to say, if the Time series were to be read into a 2D matrix, it would look something like this:

date     T1   T2   T3 .... TN
1/1/01   100  59   42      N/A
2/1/01   120  29   N/A     42.5
3/1/01   110  N/A  12      36.82
4/1/01   N/A  59   40      61.82
5/1/01    05  99   42      23.68
...
31/12/01  100  59   42     N/A

etc 

I want to write an R script that will segregate the time series {T1, T2, … TN} into 'families' where a family is defined as a set of series which "tend to move in sympathy" with each other.

For the 'clustering' part, I will need to select/define a kind of distance measure. I am not quite sure how to go about this, since I am dealing with time series, and a pair of series that may move in sympathy over one interval, may not do so in a subsequent interval.

I am sure there are far more experienced/clever people than me on here, so I would be grateful for any suggestions, ideas on what algorithm/heuristic to use for the distance measure and how to use that in clustering the time series.

My guess is that there is NOT an established robust statistic method for doing this, so I would be very interested to see how people approach/solve this problem – thinking like a statistician.

Best Answer

In data streaming and mining of time series databases, a common approach is to transform the series to a symbolic representation, then use a similarity metric, such as Euclidean distance, to cluster the series. The most popular representations are SAX (Keogh & Lin) or the newer iSAX (Shieh & Keogh):

The pages above also contain references to distance metrics and clustering. Keogh and crew are into reproducible research and pretty receptive to releasing their code. So you could email them and ask. I believe they tend to work in MATLAB/C++ though.

There was a recent effort to produce a Java and R implementation:

I don't know how far along it is -- it's geared towards motif finding, but, depending on how far they've gotten, it should have the necessary bits you need to put something together for your needs (iSAX and distance metrics: since this part is common to clustering and motif finding).

Related Question