Gaussian Process Regression vs Kalman Filter (for time series)

gaussian processkalman filtertime series

I'm curious about the similarities and differences between Kalman Filter (KF) and Gaussian Process Regression (GPR). From various sources, I've pieced together that the KF is analogous to a Hidden Markov Model (HMM) but where both X and Y are continuous. If this is accurate, I'd assume that the KF makes two assumptions

  1. the Markov assumption that $P(x_{i}|X[1:i-1]) = P(x_{i}|x_{i-1})$ and
  2. that each observed emission is influenced only by the current latent state, $P(y_{i}|x_{i})$.

GPR, on the other hand, uses a covariance function to determine the relationship between any two points. Various kernels exist, such as the radial basis kernel, where the closer to points are, the greater the relationship. (Others exist, for example, more sensitive to seasonal patterns.)

My question is- Which of these methods are better for time-series data?

Say it's time to forecast sales, for example, and there's no more data available. In either case, I imagine that the further into the future you project, the wider the confidence/credible intervals become. But my intuition says that since GPR does not make the Markov assumption, the variance around its predictions will be tighter longer for any arbitrarily long forecast window when compared with the KF.

Thoughts?

Best Answer

TL; DR Kalman filters and smoothers can be viewed as solvers for a family of Gaussian process regression models, in particular, Markov Gaussian processes.

Say, for example, we have a GP regression model

$$ \begin{split} U(t) &\sim \mathrm{GP}(0, C(t, t')),\\ Y_k &= U(t_k) + \xi_k, \end{split}\tag{1} $$

Now, what is the goal of GP regression? The goal is to learn the posterior distribution $p(u_{1:T} \mid y_{1:T})$ jointly for any times $t_1,\ldots,t_T$ with data $y_{1:T}$. However, this is known to be expensive, as you need to solve matrix inversion of size $T$. Also, in practice, we are mostly interested with the marginal posterior $p(u_k \mid y_{1:T})$ for $k=1,2,\ldots,T$ instead of the joint one. So, is there any efficient solver for $\{p(u_k \mid y_{1:T})\}_{k=1}^T$?

Yes, suppose that the covariance function $C$ is chosen properly, in the sense that $U$ is a Markov (Gaussian) process and is governed by an stochastic differential equation

$$ \begin{split} \mathrm{d} \overline{U}(t) &= A \, \overline{U}(t) \,\mathrm{d}t + B \, \mathrm{d}W(t), \\ U(t_0) &= U_0 \sim N(0, P_0) \end{split}\tag{2} $$

and that $U = H \, \overline{U}$ for some matrix $H$. Then estimating $p(\overline{u}_k \mid y_{1:T})$ from the data is called a smoothing problem in stochastic calculus. This can be solved by using Kalman filters and smoothers in linear computational time. You can also discretise the SDE above into

$$ \overline{U}_k = F \, \overline{U}_{k-1} + q_{k-1},\tag{3} $$ for some matrix $F$ and Gaussian r.v. $q$, if this kind of state-space form looks more familiar to you.

By using Kalman filter and smoothing on (2) or (3), you will get exactly the same results for solving the GP model (1) using the standard GP regression equations (you could check Figure 4.1 in 1).

So, eventually, what are the key differences between the conventional GPs and state-space GPs?

  1. In conventional GPs, you specify their mean and covariane functions. In state-space GPs, you specify the SDE coefficients instead, and their mean and covariane functions will be implicitly defined from these SDE coefficients.

  2. In conventional GPs, you usually solve their regression problem jointly at data points. In state-space, you solve their regression problem marginally at data points, in a linear computational time. This benefits from the Markov property of the state-space models.

Also please note that not all GPs are Markov (Gaussian) processes thus, not all GPs can be represented in SDEs!

For more detailed expositions, you are welcome to check my dissertation. Feel free to ask me more questions on this topic if you have any.


Some heuristic explanations to (2) as per the request from @jbuddy_13 and for those who are not familiar with SDEs:

Solutions of SDEs are stochastic processes, in particular, continuous-time Markov processes (also semimartingales, but don't think Markov <=> semimartingale). You can think of SDEs as means to construct stochastic processes. Ito's theory was really devoted for constucting diffusion processes. Now since GPs are genuinly stochastic processes, it makes perfect sense to construct GPs via SDEs. It turns out that the solutions to linear SDEs like (2) are indeed (Markov) GPs, and the mean and covariance functions of GP $\overline{U}$ are governed by certain ODEs (see, Eq. 4.9 in 1).

The SDE in (2) has three main components:

  1. $A \, \overline{U}$ is called the drift of the SDE (I will call $A$ the drift matrix), and $B$ is called the dispersion of the SDE. The drift term is to model the infinitesimal change of $\overline{U}$ in time, while the dispersion term adds stochastic volatility to the change.

  2. $t\mapsto W(t)$ is a Wiener process (also interchangably called Brownian motion). This is where the randomness of (2) coming from.

  3. $U_0$ the initial condition, a Gaussian random variable. This is also where the randomness of (2) coming from.

More intuitively, let's "divide both side of (2) by $\mathrm{d}t$" we get:

$$ \frac{\mathrm{d} \overline{U}(t)}{\mathrm{d}t} = A \, \overline{U}(t) + w(t), $$

where $w(t)$ informally stands for $d W(t)/dt$. You can see that the SDE is essentially an ODE driven by a white noise process.