Solved – Estimating the prediction variance in kernel ridge regression

degrees of freedomkernel trickkernel-smoothingridge regressionvariance

I'm trying to estimate the variance of predictions for a kernel ridge regression model.
The model is simply kernel ridge regression:
$$\hat{y} = K(K+\lambda I)^{-1}y = A y$$
$K$ is the $n \times n$ kernel matrix (any arbitrary kernel, but I would like to use the linear kernel and RBF kernel especially), $\lambda > 0$ is the regularization parameter, and $y$ are the regression targets and $\hat{y}$ are the predictions. $A$ is comparable to the hat matrix $H$ in regular linear regression, except for the fact that it's not idempotent or symmetric.
I guess using this model my assumptions are that:
$$ y_i = f(x_i) + \epsilon$$
$$\epsilon \sim N(0,\sigma^2)$$
And $x_i$ is the feature vector of object $i$ in kernel space, and $f$ is a linear function (I think?).

To find the variance of the predictions we get that:
$$\text{Cov}(y) = A A^T \sigma^{*2}$$
Here, $\sigma^{*2}$ is the error variance of the non-regularized model.
I now want to estimate $\sigma^*$. In regular linear regression we can estimate $\sigma^2$ using the formula:
$$\sigma^2 \approx S_E^2 = \frac{e^T e }{n-df}$$
Where $e$ is the vector of residuals and $df$ is the degrees of freedom of the model. However, I'm not entirely sure if I can use the formula for $S_E^2$, and I'm not sure how to derive it for the kernelised case. $df$ is:
$$df = \text{trace}(A)$$
However, if we use no regularization, I find that:
$$df = \text{trace}(KK^{-1}) = \text{trace}(I) = n$$
Therefore I cannot calculate $S_E^2$, since I divide by zero. However, notice that if you don't use any regularization, we get that:
$$e = \hat{y} – y = Ay – y = 0$$
Since $A = I$ in the case of no regularization. So actually you divide zero by zero, so I thought maybe taking the limit of $\lambda \to 0$ we can might still be able to use the formula for $S_E$. However, my intuition tells me that $e^T e$ will go faster to zero, since it is quadratic in $A$, while the numerator is probably linear (?) in $A$ (the trace). I guess you could make this more exact by using the eigenvalue decomposition of $K$.
I performed a small experiment in Matlab where I compute $S_E^2$ for values of $\lambda$ going to zero and I found that $S_E^2$ goes to zero very fast for small values of $\lambda$, so it appears to be the case that $\sigma = 0$.

However to conclude that $\sigma = 0$ and thus that the variance of $\hat{y}$ is equal to zero for the kernel ridge regression model seems implausible to me. I guess a different approach would be to use bootstrapping to compute the variances of $\hat{y}$, however it feels like there should be some better way to attack this problem (I would like to compute it analytically if possible). In any case, with a linear kernel I don't have this problem (since in this case $df \neq 0$ and $e \neq 0$), but I would like to use the RBF kernel as well.

Best Answer

For anyone interested, this paper helped me a lot further:

Estimating Predictive Variances with Kernel Ridge Regression by G. C. Cawley , N. L. C. Talbot and O. Chapelle

Note that this paper explains how to do it for heteroscedastic noise, however you can also use it for homogeous noise as follows.

The first method, KRR + LOO:

  1. Train the KRR model on the data
  2. For each sample, compute the LOO error
  3. To estimate the $\sigma^2$ compute the sum of squared residuals that were found in step 2

A more detailed method is KRR + LOO + KRR.

  1. Train the KRR model on the data
  2. For each sample, compute the LOO error
  3. Train a KRR model on the squared residuals that were found in step 2
  4. The predictions of the second model are the prediction variances, however now we have to regularize two models, but we do get heteroscedastic variances. I guess I could also take the mean prediction of the second KRR model to find the homogeous noise term $\sigma^2$, but I'm unsure if this is useful or better than the above method.

Finally, they propose a new model, that will directly take heteroscedastic noise into account. This way, the model can be regularized more in noisy regions in feature space, and thus this model is expected to perform better.

In the end what I am really interested in is to estimate the probability distribution of the prediction on unseen / new data. The methods that I am now considering are:

  1. Performing the KRR + LOO method to estimate $\sigma^2$. Now I can estimate the variance of new data using: $$ \hat{y_{new}} = K_{x_{new},X_{trainset}} (K_{X_{trainset},X_{trainset}} + \lambda I)^{-1} y_{train} = P y_{train}$$ Using the identity: $$ Cov(Xa) = X^T Cov(a) X$$ I obtain: $$ Cov(\hat{y_{new}}) = P_{train}^T P_{train} \sigma^2 $$ I use the prediction on unseen data of the regular KRR model as the mean for the distribution.

  2. Performing KRR + LOO + KRR. I have two choices, I either estimate $\sigma^2$ by averaging the outputs of the second KRR model and use the above formula, or I directly use the prediction of the second KRR model on new data to estimate the prediction variance of a new point. I use the prediction on unseen data of the regular KRR model as the mean for the distribution.

  3. I use their new model to estimate the mean and variance at unseen data.
  4. I use bootstrapping to train multiple models, and I compute the variance of their predictions on the traindata. This way I can compute $\sigma^2$. Then I use the same formula as in 1. I use the prediction on unseen data of the regular KRR model as the mean for the distribution.
  5. I use bootstrapping to train multiple models, and I compute their predictions on my new data. I compute the variance of those predictions, to compute the prediction variance of each point. I use the prediction on unseen data of the regular KRR model as the mean for the distribution.

What I am a bit confused about is the actual distribution of the prediction on data at a new point. Is this distribution actually a Gaussian / normally distributed?