Solved – Gaussian process regression: leave-one-out prediction

cross-validationgaussian processkrigingpredictionregression

According to Dubrule's Cross validation of kriging in a unique neighborhood, it is possible to compute leave-one-out the gaussian process prediction $\hat{Y}_{-i}(x_i)$ at a point $x_i$ from the design of experiments $\mathcal{X}$, where the kriging prediction is done by removing $x_i$ from $\mathcal{X}$. The formula for the prediction mean is
$$
\hat{Y}_{-i}(x_i) = y_i – [K^{-1}y]_i/[K^{-1}]_{ii}
$$
assuming $y$ is the vector of measurements and $K$ the covariance matrix $R(X,X)$, $R$ being the covariance function. I was thinking of computing this quantity in the case of noisy measurements i.e. $y_i = f(x_i) + \epsilon_i$, $\epsilon_i \sim_{i.i.d.} \mathcal{N}(0,\sigma_\epsilon^2) $ independent of $x$.
Should there be any significant modification or does it suffice it to replace $K$ by $K + \sigma_\epsilon^2I_n$.

I ask this because all gaussian process toolboxes seem to ignore the case of noisy measurements. Perhaps because the above generalization does not hold any longer ?

Best Answer

In the general noisy or ``signal $+$ noise'' framework $y_i = f(\mathbf{x}_i) + \epsilon_i$, several observations can be done at the same location $\mathbf{x}_i$, so the notations $Y(\mathbf{x}_i)$ and $f_{-i}$ can be misleading then.

Suppose first that the $n$ locations $\mathbf{x}_i$ are distinct, so that deleting an observation and deleting a location are the same thing. Using gaussian conditional expectation, the predicted (or smoothed) value for the signal vector $\mathbf{f}$ is $\widehat{\mathbf{f}} = \mathbf{H} \mathbf{y}$ where $\mathbf{H}:= \mathbf{K}(\mathbf{K} + \sigma^2_\epsilon \mathbf{I}_n)^{-1}$. Then we can use Cook and Weisberg's formula for a linear smoother $$ y_i - \widehat{f}_{-i} = \frac{y_i - \widehat{f}_i }{1 - H_{ii}} $$ which requires $H_{ii} \neq 1$. This rewrites in a form similar to yours $\widehat{f}_{-i} = y_i - [\mathbf{B}\mathbf{y}]_i / [\mathbf{B}]_{ii}$ with $ \mathbf{B} := \mathbf{I}_n - \mathbf{H} = \sigma^2_\epsilon (\mathbf{K} + \sigma^2_\epsilon \mathbf{I}_n)^{-1} $ and we get your formula after simplifying $\sigma^2_\epsilon$ in the fraction. So in this case it is enough to replace the covariance $\mathbf{K}$ by $\mathbf{K} + \sigma^2_\epsilon \mathbf{I}_n$.

In the general case, the $n$ observations $y_i$ are related to $p \leqslant n$ signal values $f_k:=f(\mathbf{x}_k)$ at $p$ distinct points. Most likely, a cross-validation will then leave out all the $y_i$ at the same location $\mathbf{x}_k$ (and not only one $y_i$). Then we are back to the first case by averaging the observations at the same location and by modifying consequently the variance. For the location $\mathbf{x}_k$, the variance $\sigma^2_\epsilon$ must be replaced by $\sigma^2_\epsilon/n_k$ where $n_k$ is the number of observations at site $k$, hence $\sigma^2_\epsilon \mathbf{I}_n$ above will be replaced by a diagonal matrix. This general context can be regarded as a Bayes linear regression where the parameter $\mathbf{f}$ has prior $\text{Norm}(\mathbf{0},\,\mathbf{K})$ and $\sigma^2_\epsilon$ is known, so conjugacy holds and we have update formulas that can be used for the deletion of one observation if needed.

The Cook and Weisberg's formula above (p. 33 of their famous 1982 book Residuals and Influence in Regression) holds for smoothing splines as well as for linear regression. It can be proved by using the Woodbury matrix identiy or by purely statistical arguments involving optimality (e.g. minimal variance). In $$ \widehat{f}_i = H_{ii} \,y_i + (1- H_{ii}) \sum_{j \neq i} \frac{H_{ij}}{1- H_{ii}} \,y_j, $$ the sum $\sum$ at right hand side can be identified as the optimal prediction of $f_i$ based on the $y_j$ with $ j\neq i$. Indeed it is unbiased, and it is also optimal because $\widehat{f}_i$ would otherwise no longer be such. The formula does not hold as such for non-noisy (interpolating) kriging because then $H_{ii} = 1$.

Related Question