Solved – apply Kalman smoothing to irregularly spaced time series

kalman filterpython

My data is an irregularly spaced time series:

        date    adate
0   2012-03-30  0.0
1   2012-03-30  1.0
2   2012-03-31  19.0
3   2012-04-19  1.0
4   2012-04-20  1.0
... ... ...
240 2019-11-08  6.0
241 2019-11-14  0.0
242 2019-11-14  1.0
243 2019-11-24  13.0
244 2019-12-07  NaN

since I want to perform some sort of timeseries analysis on the data (ARIMA , preferably) I want to interpolate it such that every data point is evenly distributed. I have read I can apply Kalman smoothing a series of data sampled at irregular time points.
I have read a few papers and have found a number of libraries to apply Kalman filters like pykalman but I haven't understood how to apply it simply , like you can apply a linear or cubic interpolation using scipy/ pandas.

Best Answer

Setting aside the repeated measures for now, the easiest way to deal with an irregularly spaced time series with relatively regular "small" gaps is to view it as a regularly spaced time series with missing data. Here, since your smallest gap is 1 day, you can consider it as daily data but with some days missing:

date    adate
2012-03-30  1.0
2012-03-31  19.0
2012-04-01  NA
2012-04-02  NA
...         ...
2012-04-18  NA
2012-04-19  1.0
...         ...

The situation is a little bit different if you have a very large variance in the size of the gaps, for example if you had millisecond-level time stamps but sometimes go a whole year without any observation; in that case it can be handled more efficiently in another way (e.g. by having time-varying matrices in the state space model used by the Kalman filter).

The Kalman filter will allow you to fit an ARIMA model with missing values by computing the likelihood which you can then optimize over the parameters. You can then use that model to forecast. If you need, you can also use the Kalman filter or smoother to get the distribution of the missing values conditional on your data (only past data for the filter, or including future data for the smoother) and parameters.

But you do not need to impute these values first, and doing this is not a preliminary step to an analysis (it is the analysis, you have already picked an ARIMA model at this point).

As for the repeated measures, if it makes sense for the domain you can sum or average those values on a given day. If it doesn't and you have no way to differentiate those records in a given day, you can set up a state space model where the state is, for example, given by:

$$X_t = \phi X_{t-1} + \eta_t$$

And the observation equation is:

$$Y_t^{(i)} = X_t + \varepsilon_t^{(i)}, i = 1, ..., n_t$$

This would be an ARIMA(1,0,0) model with repeated measures of varying sample sizes depending on the day. The Kalman filter can accommodate state space models with varying observation dimension.