Solved – Variance and autocorrelation with missing and/or unevenly spaced data in time series

autocorrelationmissing datatime seriesunevenly-spaced-time-series

This question concerns the general problem of working with data that might have missing and/or unevenly spaced values. Let’s call this real data.

Specifically I am calculating rolling variance and autocorrelation of time series obtained from geological processes. The most direct approach I have come up with so far:

  1. Re-calculate the size (in points) of moving windows as you move along the data to always have the same sized window (measured in time),
  2. For every window that contains at least one NaN (missing data point), return a NaN for the entire window.

Questions:

  1. Does the varying window size (measured in points) adversely affect these statistics somehow?

  2. Say your data is missing every 20th point or so; linearly interpolating these points will bias your ACF towards higher values since the interpolated points trivially depend on their neighbors. Any ideas for filling in the gaps without causing this problem?

Best Answer

Does the varying window size (measured in points) adversely affect these statistics somehow?

Strictly speaking, variance is a property of the distribution of your data points and all you can do is to estimate it using a variance estimator. The latter is normalised to the number of samples and thus independent on the window you apply it to – assuming that you are using the unbiased estimator and do not try to fill any gaps.

However, all of this is implicitly based on the assumption that each of your data points is an indepedent sample from the same distribution, which may not even be a good approximation for real data (in which case variance may not be a good measure anymore anyway). As a pathological example, suppose that your data points just linearly depend on the time. In this case, increasing the temporal width of the window increases the variance. The same holds whenever your data is temporally correlated.

Taking another point of view: If assuming that your data points are independent samples from the same distribution is actually appropriate, the time at which a data point was sampled does not matter for estimating that distribution’s variance and there is no difference between sampling equidistantly and at random points. However, this assumption often does not hold in real applications and a variance estimator may serve other purposes that estimating a distribution’s variance.

This problem becomes less severe if your gaps are short and essentially random in position. Or, from another point of view: If you have the number of data points go towards infinity and it’s random which data points are missing, there is no effect of gaps.

The estimator of the autocorrelation function is based on variance estimators and averages, which are both normalised to the sample size. Thus, if you calculate the means ignoring the missing points, there is no effect in the limit of infinite data points and if the missing data points are random.

However if a lot of data points are missing or if there is no rhythm in your sampling times, you will hardly ever find a pair of points for a given time lag and thus you cannot estimate the autocorrelation directly anymore.

Any ideas for filling in the gaps without causing this problem?

Variance: Don’t fill the gaps. You may not have a problem in the first place, and if you have, filling the gaps won’t fix it. Do not increase the temporal size of your window, this may introduce a bias, if there is any correlation in your data.

Autocorrelation: If you have a few missing unbiased data points in otherwise evenly sampled data, the above applies. You can estimate the components of the autocorrelation estimator ignoring the missing points.

If you have a lot of missing data points or there is no even sampling in the first place, I would try to first obtain an estimate of the frequency spectrum using Lomb–Scargle periodograms and then estimate the autocorrelation function from this using the Wiener–Khinchin theorem. I am no expert on these methods, so there might be problems with this. I suggest to test this approach with artificial data first or find literature about it.

In neither case do I see a reason to fully ignore windows with missing data points.

Related Question