Solved – How to correlate two time series with gaps and different time bases

correlationtime seriesunevenly-spaced-time-series

I asked this question over on StackOverflow, and was recommended to ask it here.


I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).

The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.

The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.

So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)

The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).

My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).

The nature of the time series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.

Anyone have any clues for what I can do, or approaches I should research?

Update 28 Feb 2011: Added some plots here showing examples of the data.

Best Answer

The question concerns calculating the correlation between two irregularly sampled time series (one-dimensional stochastic processes) and using that to find the time offset where they are maximally correlated (their "phase difference").

This problem is not usually addressed in time series analysis, because time series data are presumed to be collected systematically (at regular intervals of time). It is rather the province of geostatistics, which concerns the multidimensional generalizations of time series. The archetypal geostatistical dataset consists of measurements of geological samples at irregularly spaced locations.

With irregular spacing, the distances among pairs of locations vary: no two distances may be the same. Geostatistics overcomes this with the empirical variogram. This computes a "typical" (often the mean or median) value of $(z(p) - z(q))^2 / 2$--the "semivariance"--where $z(p)$ denotes a measured value at point $p$ and the distance between $p$ and $q$ is constrained to lie within an interval called a "lag". If we assume the process $Z$ is stationary and has a covariance, then the expectation of the semivariance equals the maximum covariance (equal to $Var(Z(p))$ for any $p$) minus the covariance between $Z(p)$ and $Z(q)$. This binning into lags copes with the irregular spacing problem.

When an ordered pair of measurements $(z(p), w(p))$ is made at each point, one can similarly compute the empirical cross-variogram between the $z$'s and the $w$'s and thereby estimate the covariance at any lag. You want the one-dimensional version of the cross-variogram. The R packages gstat and sgeostat, among others, will estimate cross-variograms. Don't worry that your data are one-dimensional; if the software won't work with them directly, just introduce a constant second coordinate: that will make them appear two-dimensional.

With two million points you should be able to detect small deviations from stationarity. It's possible the phase difference between the two time series could vary over time, too. Cope with this by computing the cross-variogram separately for different windows spaced throughout the time period.

@cardinal has already brought up most of these points in comments. The main contribution of this reply is to point towards the use of spatial statistics packages to do your work for you and to use techniques of geostatistics to analyze these data. As far as computational efficiency goes, note that the full convolution (cross-variogram) is not needed: you only need its values near the phase difference. This makes the effort $O(nk)$, not $O(n^2)$, where $k$ is the number of lags to compute, so it might be feasible even with out-of-the-box software. If not, the direct convolution algorithm is easy to implement.