Solved – Clustering time series with wavelets in R

clusteringfeature selectionrtime serieswavelet

Can discrete wavelet trasform be used for feature extraction from time series in order to cluster them? Any R code how to do this will be appreciated.

Best Answer

You might find this useful:

# EXAMPLE OF BIAS CORRECTION:
install.packages("biwavelet")
require(biwavelet)
# Generate a synthetic time series ’s’ with the same power at three distinct periods
t1=sin(seq(from=0, to=2*5*pi, length=1000))
t2=sin(seq(from=0, to=2*15*pi, length=1000))
t3=sin(seq(from=0, to=2*40*pi, length=1000))
s=t1+t2+t3
# Compare non-corrected vs. corrected wavelet spectrum
wt1=wt(cbind(1:1000, s))
par(mfrow=c(1,2))
plot(wt1, type="power.corr.norm", main="Bias-corrected")
plot(wt1, type="power.norm", main="Not-corrected")
# Compare non-corrected vs. corrected cross-wavelet spectrum
x1=xwt(cbind(1:1000, s), cbind(1:1000, s))
par(mfrow=c(1,2))
plot(x1, type="power.corr.norm", main="Bias-corrected")
plot(x1, type="power.norm", main="Not-corrected")
# ADDITIONAL EXAMPLES
t1=cbind(1:100, rnorm(100))
t2=cbind(1:100, rnorm(100))
# Continuous wavelet transform
wt.t1=wt(t1)
# Plot power
# Make room to the right for the color bar
par(oma=c(0, 0, 0, 1), mar=c(5, 4, 4, 5) + 0.1)
plot(wt.t1, plot.cb=TRUE, plot.phase=FALSE)
# Cross-wavelet
xwt.t1t2=xwt(t1, t2)
# Plot cross wavelet power and phase difference (arrows)
plot(xwt.t1t2, plot.cb=TRUE)
# Wavelet coherence; nrands should be large (>= 1000)
wtc.t1t2=wtc(t1, t2, nrands=10)
# Plot wavelet coherence and phase difference (arrows)
# Make room to the right for the color bar
par(oma=c(0, 0, 0, 1), mar=c(5, 4, 4, 5) + 0.1)
plot(wtc.t1t2, plot.cb=TRUE)
# Perform wavelet clustering of three time series
t1=cbind(1:100, sin(seq(from=0, to=10*2*pi, length.out=100)))
t2=cbind(1:100, sin(seq(from=0, to=10*2*pi, length.out=100)+0.1*pi))
t3=cbind(1:100, rnorm(100))
# Compute wavelet spectra
wt.t1=wt(t1)
wt.t2=wt(t2)
wt.t3=wt(t3)
# Store all wavelet spectra into array
w.arr=array(NA, dim=c(3, NROW(wt.t1$wave), NCOL(wt.t1$wave)))
w.arr[1, , ]=wt.t1$wave
w.arr[2, , ]=wt.t2$wave
w.arr[3, , ]=wt.t3$wave
# Compute dissimilarity and distance matrices
w.arr.dis=wclust(w.arr)
plot(hclust(w.arr.dis$dist.mat, method="ward"), sub="", main="",
     ylab="Dissimilarity", hang=-1)

and the output looks like: enter image description here

I found it here.

Related Solutions

Wavelets – Application of Wavelets to Time-Series-Based Anomaly Detection Algorithms

Wavelets are useful to detect singularities in a signal (see for example the paper here (see figure 3 for an illustration) and the references mentioned in this paper. I guess singularities can sometimes be an anomaly?

The idea here is that the Continuous wavelet transform (CWT) has maxima lines that propagates along frequencies, i.e. the longer the line is, the higher is the singularity. See Figure 3 in the paper to see what I mean! note that there is free Matlab code related to that paper, it should be here.

Additionally, I can give you some heuristics detailing why the DISCRETE (preceding example is about the continuous one) wavelet transform (DWT) is interesting for a statistician (excuse non-exhaustivity) :

There is a wide class of (realistic (Besov space)) signals that are transformed into a sparse sequence by the wavelet transform. (compression property)
A wide class of (quasi-stationary) processes that are transformed into a sequence with almost uncorrelated features (decorrelation property)
Wavelet coefficients contain information that is localized in time and in frequency (at different scales). (multi-scale property)
Wavelet coefficients of a signal concentrate on its singularities.

How to Conduct Time-Series Clustering Based on Curve Shape

Several directions for analyzing longitudinal data were discussed in the link provided by @Jeromy, so I would suggest you to read them carefully, especially those on functional data analysis. Try googling for "Functional Clustering of Longitudinal Data", or the PACE Matlab toolbox which is specifically concerned with model-based clustering of irregularly sampled trajectories (Peng and Müller, Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions, Annals of Applied Statistics 2008 2: 1056). I can imagine that there may be a good statistical framework for financial time series, but I don't know about that.

The kml package basically relies on k-means, working (by default) on euclidean distances between the $t$ measurements observed on $n$ individuals. What is called a trajectory is just the series of observed values for individual $i$, $y_i=(y_{i1},y_{i2},\dots,y_{it})$, and $d(y_i,y_j)=\sqrt{t^{-1}\sum_{k=1}^t(y_{ik}-y_{jk})^2}$. Missing data are handled through a slight modification of the preceding distance measure (Gower adjustment) associated to a nearest neighbor-like imputation scheme (for computing Calinski criterion). As I don't represent myself what you real data would look like, I cannot say if it will work. At least, it work with longitudinal growth curves, "polynomial" shape, but I doubt it will allow you to detect very specific patterns (like local minima/maxima at specific time-points with time-points differing between clusters, by a translation for example). If you are interested in clustering possibly misaligned curves, then you definitively have to look at other solutions; Functional clustering and alignment, from Sangalli et al., and references therein may provide a good starting point.

Below, I show you some code that may help to experiment with it (my seed is generally set at 101, if you want to reproduce the results). Basically, for using kml you just have to construct a clusterizLongData object (an id number for the first column, and the $t$ measurements in the next columns).

library(lattice)
xyplot(var0 ~ date, data=test.data, groups=store, type=c("l","g"))

tw <- reshape(test.data, timevar="date", idvar="store", direction="wide")
parallel(tw[,-1], horizontal.axis=F, 
         scales=list(x=list(rot=45, 
                            at=seq(1,ncol(tw)-1,by=2), 
                            labels=substr(names(tw[,-1])[seq(1,ncol(tw)-1,by=2)],6,100), 
                            cex=.5)))

library(kml)
names(tw) <- c("id", paste("t", 1:(ncol(tw)-1)))
tw.cld <- as.cld(tw)
cld.res <- kml(tw.cld,nbRedrawing=5)
plot(tw.cld)

The next two figures are the raw simulated data and the five-cluster solution (according to Calinski criterion, also used in the fpc package). I don't show the scaled version.

alt text

Best Answer

Related Solutions

Wavelets – Application of Wavelets to Time-Series-Based Anomaly Detection Algorithms

How to Conduct Time-Series Clustering Based on Curve Shape

Related Question