Deriving the posterior predictive mean for a Gaussian process with nonzero mean function

bayesiannormal distribution

$\require{color}$
$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{{#1}}}}}$
$\newcommand{\x}{\textbf{x}}$
$\newcommand{\y}{\textbf{y}}$
$\newcommand{\w}{\textbf{w}}$
$\newcommand{\wpriorcov}{\Sigma_p}$
$\newcommand{\wpriorcovi}{\Sigma_p^{-1}}$
$\newcommand{\wpriormean}{\boldsymbol\mu}$
$\newcommand{\wd}{\overline{\textbf{w}}}$
$\newcommand{\c}{-\frac{1}{2}}$
$\newcommand{\s}{\sigma_{\varepsilon}^2}$
$\newcommand{\si}{\sigma_{\varepsilon}^{-2}}$
$\newcommand{\siD}{\si K – \si K \Phi_*\left(\s I + \Phi_*^T K \Phi_*\right)^{-1}\Phi_*^T K}$
$\newcommand{\xtest}{X_*}$
$\newcommand{\phixtest}{\phi(\textbf{x}_*)}$
$\newcommand{\phixtests}{\phi_*}$
$\newcommand{\ytest}{\y_*}$
$\newcommand{\brace}[1]{{\left({#1}\right)}}$
$\newcommand{\bracek}[1]{{\left[{#1}\right]}}$
$\newcommand{\bracec}[1]{{\left\{{#1}\right\}}}$
$\newcommand{\mutest}{\mu_{\ytest}}$
$\newcommand{\stest}{\sigma_{\ytest}^{2}}$
$\newcommand{\D}{\bracec{K + \si\phixtests\phixtests^T}^{-1}}$

I'm learning about Gaussian processes and I have attempted to derive the corresponding posterior predictive distribution mean and covariance. If I do it using Gaussian properties, Woodbury formulas and this:

$${\si\left(\si\Phi\Phi^T+\wpriorcovi\right)^{-1}\Phi = \wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}},$$

I will get the following mean and covariance for the posterior predictive distribution (using e.g. Rasmussen's book as reference):

$$E\left[\ytest|X_*,\y,X\right] = \Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\y + \s\Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\Phi^{-1}\wpriorcovi\wpriormean$$

$$\text{Cov}\left[\ytest|X_*,\y,X\right] =\s I + \Phi_*^T\wpriorcov\Phi_* -\Phi_*^T\wpriorcov\Phi\left[\Phi^T\wpriorcov\Phi + \s I\right]^{-1}\Phi^T\wpriorcov\Phi_*,$$

where:

$$\x\in \mathbb{R}^q, \phi(\x)\in \mathbb{R}^d, \w\in \mathbb{R}^d$$

$$f(\textbf{x})=\phi(\textbf{x})^T \textbf{w}$$

$$y = f(\textbf{x}) + \varepsilon$$

$$\textbf{w}\sim \mathcal{N}(\wpriormean, \Sigma_p)$$

$$\varepsilon \sim \mathcal{N}(0, \sigma_{\varepsilon}^2)$$

$$\phixtests = \phixtest$$

$$\phi(\x_i) = \begin{pmatrix}\phi_{1i} \\ \phi_{2i} \\ \vdots \\ \phi_{di}\end{pmatrix},\;\;\;\phi: \mathbb{R}^q \to \mathbb{R}^d $$

$$\Phi = \Phi(X) = \begin{pmatrix}\phi(\x_1), \phi(\x_2), …, \phi(\x_n)\end{pmatrix} = \begin{pmatrix}\phi_{11} & \phi_{12} & \cdots & \phi_{1n} \\ \phi_{21} & \phi_{22} & \cdots & \phi_{2n} \\ \vdots & \vdots & \vdots & \vdots \\ \phi_{d1} & \phi_{d2} & \cdots & \phi_{dn}\end{pmatrix} \in \mathbb{R}^{d\times n}$$

If there is a $*$-symbol it denotes we are dealing with new data. Anyway, if I compare with a derivation e.g in this source: https://see.stanford.edu/materials/aimlcs229/cs229-gp.pdf or Rasmussen's book, I get exactly the same result when I have zero mean function $\wpriormean=\textbf{0}$. The covariance does not change if $\wpriormean\neq\textbf{0}$ but the mean it obviously affects. From literature, I have noticed that when dealing with the case $\wpriormean\neq\textbf{0}$ the posterior predictive mean is written as:

$$\textcolor{blue}{E\left[\ytest|X_*,\y,X\right] = \Phi_*^T\wpriormean + \Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\left(\y-\Phi^T\wpriormean\right)},\;\;\;(1)$$

which looks intuitive and clean, when compared to what I got:

$$E\left[\ytest|X_*,\y,X\right] = \Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\y + \s\Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\Phi^{-1}\wpriorcovi\wpriormean\;\;\;(2)$$

So either:

$$\textcolor{red}{\s\Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\Phi^{-1}\wpriorcovi = I – \Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}}?$$

or I have made a mistake somewhere in the derivation. So I am a bit stuck at the moment because my derivation agrees with the literature on the covariance and with the posterior predictive mean when the prior mean is a zero function, and I can't find my mistake. I have done the derivation two or three times now from the beginning and I am maybe missing something.

My question thus is: 1) Did I make a mistake and if I did, what did I miss? 2) If I did not make a mistake, how do I show that $(1)$ and $(2)$ are the same? So what I'm looking for is a derivation for the posterior predictive mean when the prior mean is nonzero, the exact derivation steps more or less. Thank you for your help!

P.S.

In many sources e.g. the $\Phi^T\wpriorcov\Phi$ is denoted as the kernel covariance function: $K(X,X)=\Phi^T\wpriorcov\Phi$ et cetera.

UPDATE: I think my problem might be at the very beginning when I start to derive the posterior predictive after deriving the model weight posterior. I attached a low resolution image if it helps at all. I think my problem might be in that I am not considering the joint distribution of old and new data, do the Gaussian conditioning etc. as is done e.g. here https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote15.html

enter image description here

Best Answer

$\require{color}$ $\newcommand{\vect}[1]{{\mathbf{\boldsymbol{{#1}}}}}$ $\newcommand{\x}{\textbf{x}}$ $\newcommand{\y}{\textbf{y}}$ $\newcommand{\w}{\textbf{w}}$ $\newcommand{\wpriorcov}{\Sigma_p}$ $\newcommand{\wpriorcovi}{\Sigma_p^{-1}}$ $\newcommand{\wpriormean}{\boldsymbol\mu}$ $\newcommand{\wd}{\overline{\textbf{w}}}$ $\newcommand{\c}{-\frac{1}{2}}$ $\newcommand{\s}{\sigma_{\varepsilon}^2}$ $\newcommand{\si}{\sigma_{\varepsilon}^{-2}}$ $\newcommand{\siD}{\si K - \si K \Phi_*\left(\s I + \Phi_*^T K \Phi_*\right)^{-1}\Phi_*^T K}$ $\newcommand{\xtest}{X_*}$ $\newcommand{\phixtest}{\phi(\textbf{x}_*)}$ $\newcommand{\phixtests}{\phi_*}$ $\newcommand{\ytest}{\y_*}$ $\newcommand{\brace}[1]{{\left({#1}\right)}}$ $\newcommand{\bracek}[1]{{\left[{#1}\right]}}$ $\newcommand{\bracec}[1]{{\left\{{#1}\right\}}}$ $\newcommand{\mutest}{\mu_{\ytest}}$ $\newcommand{\stest}{\sigma_{\ytest}^{2}}$ $\newcommand{\D}{\bracec{K + \si\phixtests\phixtests^T}^{-1}}$ I got this checked myself. Indeed, even though I derived this couple of times, in both cases I made a minor error in the algebra, but it was enough :) So after doing it again, I got the result that the posterior predictive mean is:

Given the posterior mean $\wd$ for model weights:

$$ \Phi_*^T\wd \\ = \Phi_*^T\left(K\left(\si\Phi\y + \wpriorcovi\wpriormean\right)\right) = \si\Phi_*^T\left(\si\Phi\Phi^T + \wpriorcovi\right)^{-1}\Phi\y + \Phi_*^T\left(\si\Phi\Phi^T + \wpriorcovi\right)^{-1}\wpriorcovi\wpriormean\\ ==\Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\y + \Phi_*^T\left(\si\Phi\Phi^T + \wpriorcovi\right)^{-1}\wpriorcovi\wpriormean.\\ =\Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\y + \Phi_*^T\left[\wpriorcov-\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi+\s I\right)^{-1}\Phi^T\wpriorcov\right]\wpriorcovi\wpriormean.\\ =\Phi_*^T\wpriormean + \Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\left(\y-\Phi^T\wpriormean\right)\\ \dot{.\hspace{.095in}.}\hspace{.5in} E\left[\ytest|X_*,\y,X\right] = \Phi_*^T\wpriormean + \Phi_*^T\wpriorcov\Phi\left(\Phi^T\wpriorcov\Phi + \s I\right)^{-1}\left(\y-\Phi^T\wpriormean\right), $$ which now agrees with the literature.

Related Question