Solved – Using Kalman filters to impute Missing Values in Time Series

data-imputationkalman filter

I am interested in how Kalman Filters can be used to impute missing values in Time Series Data. Is it also applicable if some consecutive time points are missing? I cannot find much on this topic. Any explanations, comments and links are welcome and appreciated!

Best Answer

Preliminaries: Kalman filtering:

Kalman filters operate on state-space models of the form (there are several ways to write it; this is an easy one based on Durbin and Koopman (2012); all of the following is based on that book, which is excellent):

$$ \begin{align} y_t & = Z \alpha_t + \varepsilon_t \qquad & \varepsilon_t \sim N(0, H) \\ \alpha_{t_1} & = T \alpha_t + \eta_t & \eta_t \sim N(0, Q) \\ \alpha_1 & \sim N(a_1, P_1) \end{align} $$

where $y_t$ is the observed series (possibly with missing values) but $\alpha_t$ is fully unobserved. The first equation (the "measurement" equation) says that the observed data is related to the unobserved states in a particular way. The second equation (the "transition" equation) says that the unobserved states evolve over time in a particular way.

The Kalman filter operates to find optimal estimates of $\alpha_t$ ($\alpha_t$ is assumed to be Normal: $\alpha_t \sim N(a_t, P_t)$, so what the Kalman filter actually does is to compute the conditional mean and variance of the distribution for $\alpha_t$ conditional on observations up to time $t$).

In the typical case (when observations are available) the Kalman filter uses the estimate of the current state and the current observation $y_t$ to do the best it can to estimate the next state $\alpha_{t+1}$, as follows:

$$ \begin{align} a_{t+1} & = T a_t + K_t (y_t - Z \alpha_t) \\ P_{t+1} & = T P_t (T - K_t Z)' + Q \end{align} $$

where $K_t$ is the "Kalman gain".

When there is not an observation, the Kalman filter still wants to compute $a_{t+1}$ and $P_{t+1}$ in the best possible way. Since $y_t$ is unavailable, it cannot make use of the measurement equation, but it can still use the transition equation. Thus, when $y_t$ is missing, the Kalman filter instead computes:

$$ \begin{align} a_{t+1} & = T a_t \\ P_{t+1} & = T P_t T' + Q \end{align} $$

Essentially, it says that given $\alpha_t$, my best guess as to $\alpha_{t+1}$ without data is just the evolution specified in the transition equation. This can be performed for any number of time periods with missing data.

If there is data $y_t$, then the first set of filtering equations take the best guess without data, and add a "correction" in, based on how good the previous estimate was.

Imputing data:

Once the Kalman filter has been applied to the entire time range, you have optimal estimates of the states $a_t, P_t$ for $t = 1, 2, \dots, T$. Imputing data is then simple via the measurement equation. In particular, you just calculate:

$$\hat y_t = Z a_t $$

As for a reference, Durbin and Koopman (2012) is excellent; section 4.10 discusses missing observations.

Durbin, J., & Koopman, S. J. (2012). Time series analysis by state space methods (No. 38). Oxford University Press.

Best Answer

Related Solutions

Solved – Imputing missing values in time series using SAS

Imputing Missing Data – How to Impute Missing Observations in Multivariate Time Series

Related Question