Solved – Multidimensional dynamic time warping

time series

I am trying to understand how to extend the idea of one dimensional dynamic time warping to the multidimensional case.

Lets assume I have a dataset with two dimensions where TrainA holds dimension 1 and TrainB holds dimension 2. It seems that the simplest case would be

distA = dtw(TrainA) 
distB = dtw(TrainB) 
dist = distA + distB  // or maybe distA*distB

Is this the right approach? I know there are packages that do this for you but I want to understand what is actually being done.

Best Answer

There are two ways to do it. The way you describe is DTWI, but other way, DTWD can be better, because it pools the information before warping.

There is an explanation of the differences, and an empirical study here. http://www.cs.ucr.edu/~eamonn/Multi-Dimensional_DTW_Journal.pdf

Related Solutions

Dynamic Time Warping – Dynamic Time Warping and Normalization in Time Series

No "general approach" exists for this at least to my knowledge. Besides you are trying to minimize a distance metric anyway. For example in the granddaddy of DTW papers Sakoe & Chiba (1978) use $|| a_i - b_i||$ as the measurement of difference between two feature vectors.

As you correctly identified you need to have the same number of points (usually) for this to work out of the box. I would propose using a lowess() smoother/interpolator over your curves to make them of equal size first. It's pretty standard stuff for "curve statistics". You can see an example application in Chiou et al. (2003); the authors don't care about DTW as such in this work but it is a good exemplar how to deal with unequal sized readings.

Additionally as you say "amplitude" is an issue. This is a bit more open ended to be honest. You can try an Area-Under-the-Curve approach like the one proposed by Zhang and Mueller (2011) to take care of this but really for the purposes of time warping even sup-norm normalization (ie. replace $f(x)$ with $\frac{f(x)}{sup_y|f(x)|}$ could do as in this paper by Tang and Mueller (2009). I would follow the second, but in any case as you also noticed normalization of samples is a necessity.

Depending on the nature of your data you can find more application specific literature. I personally find the approach of minimizing with the respect to a target pairwise warping function $g$ the most intuitive of all. So the target function to minimize is: $C_\lambda(Y_i,Y_k, g) = E\{ \int_T (Y_i(g(t)) - Y_k(t))^2 + \lambda(g(t) -t)^2 dt| Y_i,Y_k\}$, where the whole thing despite it's uncanniness is actually quite straightforward: you try to find to find the warping function $g$ that minimizes the expected sum of the mismatch of the warped query curve $Y_i(g(t))$ to the reference curve $Y_k(t)$ (the term $ Y_i(g(t)) - Y_k(t) $) subject to some normalization to the time-distortion you impose by that warping (the term $g(t) -t$). This is what the MATLAB package PACE is implementing. I know that there exists an R package fda by J. O. Ramsay et al. that might be of help also but I have not personally used it (a bit annoyingly the standard reference for that package's methods is in many case Ramsay and Silverman's excellent book, Functional Data Analysis (2006) 2nd ed., and you have to scour a 400-page book to get what you look for; at least it's good read anyway)

The problem you are describing in Statistics literature is widely known as "curve registration" (for example see Gasser and Kneip (1995) for an early treatment of the issue) and falls under the general umbrella of Functional Data Analysis techniques.

(In cases I could find the original paper available on-line the link directs there; otherwise the link directs to a general digital library. Almost all the papers mentioned can be found to draft versions for free. I deleted my original comment as it is superseded by this post.)

Dynamic Time Warping – Dynamic Time Warping Clustering Explained

Do not use k-means for timeseries.

DTW is not minimized by the mean; k-means may not converge and even if it converges it will not yield a very good result. The mean is an least-squares estimator on the coordinates. It minimizes variance, not arbitrary distances, and k-means is designed for minimizing variance, not arbitrary distances.

Assume you have two time series. Two sine waves, of the same frequency, and a rather long sampling period; but they are offset by $\pi$. Since DTW does time warping, it can align them so they perfectly match, except for the beginning and end. DTW will assign a rather small distance to these two series. However, if you compute the mean of the two series, it will be a flat 0 - they cancel out. The mean does not do dynamic time warping, and loses all the value that DTW got. On such data, k-means may fail to converge, and the results will be meaningless. K-means really should only be used with variance (= squared Euclidean), or some cases that are equivalent (like cosine, on L2 normalized data, where cosine similarity is the same as $2 -$ squared Euclidean distance)

Instead, compute a distance matrix using DTW, then run hierarchical clustering such as single-link. In contrast to k-means, the series may even have different length.

Best Answer

Related Solutions

Dynamic Time Warping – Dynamic Time Warping and Normalization in Time Series

Dynamic Time Warping – Dynamic Time Warping Clustering Explained

Related Question