Solved – Equivalence of gaussian process and bayesian linear regression by inspecting the covariance matrix

covariance-matrixgaussian processlinear algebra

I'm aware that a gaussian process is equivalent to bayesian linear regression for the kernel $K(x_i,x_j) = x_i x_j$ (assume scalar $x$ here). However, the proof itself didn't lend much intuition to me.

If I imagine sampling a function from the linear GP as sampling one point from an uncountable-dimension gaussian RV with covariance matrix $K$ (I'm aware this isn't mathematically rigorous, but bear with me), it is very unintuitive to me why all the points should lie on a line — why the function should be linear.

All I know about this "covariance matrix" $K = xx^T$ (where $x$ is a vector containing all the real numbers) is that the rank is 1, and that it's symmetric. I should be able to diagonalize it as $K = Q^T\Lambda Q$ with all the eigenvalues on the diagonal of $\Lambda$. Since the rank is 1, it should have one non-zero eigenvalue which I can force to be the top-left entry simply by permuting rows/cols of $Q$ and $\Lambda$. So now I can imagine sampling with covariance $\Lambda$, and then applying rotation $Q^T$.

If I fix $f(0)$ as the "first" dimension of our uncountable gaussian, this means I can sample $f(0)$ from some univariate gaussian, and then $f(x) = 0$ for all other $x$, since all the other entries of $\Lambda$ are 0 and the mean is 0.

This definitely doesn't look like a linear function to me — it looks like a constant function with a discontinuity at 0. Furthermore, I'm not sure how rotation $Q$ affects the function (surely, it doesn't correspond to rotating a plot of the function on the 2D plane).

I think I've gone wrong with the math somewhere, so the question is: is there a way to show that a rank-1 kernel for GP corresponds to linear functions? What about rank-2, does it correspond to quadratic functions?

Best Answer

Deducing from your question, you are considering a Bayesian linear regression model $$f(\boldsymbol{x}) = \boldsymbol{w}^{T}\boldsymbol{x}$$ with a standard Gaussian prior $\boldsymbol{w} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{1}_{\{d \times d\}})$. As you correctly remark, this is equivalent to the function space view where we specify a prior directly on $f$ as $$f \sim \mathcal{GP}(0, K)$$ where we have the linear kernel $K(\boldsymbol{x}, \boldsymbol{x}')=\boldsymbol{x}^{T}\boldsymbol{x}'$.

Let's think about how we produce a random draw of the function evaluated on some fixed set of points that we collect into a matrix $\boldsymbol{X}_{*} \in \mathbb{R}^{m \times d}$.

$\textit{Bayesian Linear Regression}$: Here it is pretty straight-forward, just sample one realization $$\boldsymbol{w} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{1}_{\{d \times d\}})$$ and we easily find the corresponding function values $$\boldsymbol{f}_{*} = \boldsymbol{X}_{*}\boldsymbol{w} \in \mathbb{R}^{m}$$ Here it is as expected, all points lie in the same plane determined by $\boldsymbol{w}$.
$\textit{Function Space View}$: This is slightly more annoying. As you correctly say, to obtain a random draw of prior function values $f_{*}$ at $\boldsymbol{X}_{*}$, we need to sample $$f^{*} \sim \mathcal{N}(\boldsymbol{0}, K(\boldsymbol{X}_{*}, \boldsymbol{X}_{*})) \stackrel{(d)}{=}\mathcal{N}(\boldsymbol{0}, \boldsymbol{X}_{*}\boldsymbol{X}_{*}^{T})$$ The question is now: $\textit{Why should those samples lie in a plane?}$ As you remarked, the covariance $\boldsymbol{C}=\boldsymbol{X}_{*}\boldsymbol{X}_{*}^{T}$ is degenerate as it is of rank $k=min(d, m)$. Let us assume that $m>d$, otherwise there is always a plane of dimension $d-1$ that contains the $m$ points. So we have $k=d$. As a consequence of the low rank structure, we do not get the "full" randomness but only one that corresponds to the push-forward of a lower dimensional random variable. Let's make this low-dimensional randomness more explicit before giving the final answer. Due to positive semi-definiteness, we diagonalize as $$\boldsymbol{C} = \boldsymbol{V}\boldsymbol{\Lambda}\boldsymbol{V}^{T}$$ where $\boldsymbol{V} \in \mathbb{R}^{m \times m}$ and $\boldsymbol{\Lambda} \in \mathbb{R}^{m \times m}$ with only the first $d$ diagonal entries non-zero. Take a random variable ${\boldsymbol{\tilde{z}}} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{1}_{\{d \times d\}})$ and pad it with $m-d$ zeros to get $\boldsymbol{z} \in \mathbb{R}^{m}$. We can then rewrite $$f^{*}\stackrel{(d)}{=} \boldsymbol{V}\boldsymbol{\Lambda}^{\frac{1}{2}}\boldsymbol{z}$$ due to the formula $\text{cov}(\boldsymbol{A}\boldsymbol{z}) = \boldsymbol{A}\boldsymbol{A}^{T}$ and invariance of the Gaussian distribution under linear transformations. We thus see that the randomness only stems from a $d$-dimensional random variable $\boldsymbol{z}$. Now to the final answer, we also have due to the same facts as above that $$f_{*} \stackrel{d}{=}\boldsymbol{X}_{*}\boldsymbol{\tilde{z}}$$ We hence see that $f_{*}$ also lives in a $d$-dimensional plane.

Best Answer

Related Solutions

Solved – Incremental Gaussian Process Regression

Solved – Gaussian Process Regression for piecewise linear response functions

Related Question