I'm trying to estimate the variance of predictions for a kernel ridge regression model.
The model is simply kernel ridge regression:
$$\hat{y} = K(K+\lambda I)^{-1}y = A y$$
$K$ is the $n \times n$ kernel matrix (any arbitrary kernel, but I would like to use the linear kernel and RBF kernel especially), $\lambda > 0$ is the regularization parameter, and $y$ are the regression targets and $\hat{y}$ are the predictions. $A$ is comparable to the hat matrix $H$ in regular linear regression, except for the fact that it's not idempotent or symmetric.
I guess using this model my assumptions are that:
$$ y_i = f(x_i) + \epsilon$$
$$\epsilon \sim N(0,\sigma^2)$$
And $x_i$ is the feature vector of object $i$ in kernel space, and $f$ is a linear function (I think?).
To find the variance of the predictions we get that:
$$\text{Cov}(y) = A A^T \sigma^{*2}$$
Here, $\sigma^{*2}$ is the error variance of the non-regularized model.
I now want to estimate $\sigma^*$. In regular linear regression we can estimate $\sigma^2$ using the formula:
$$\sigma^2 \approx S_E^2 = \frac{e^T e }{n-df}$$
Where $e$ is the vector of residuals and $df$ is the degrees of freedom of the model. However, I'm not entirely sure if I can use the formula for $S_E^2$, and I'm not sure how to derive it for the kernelised case. $df$ is:
$$df = \text{trace}(A)$$
However, if we use no regularization, I find that:
$$df = \text{trace}(KK^{-1}) = \text{trace}(I) = n$$
Therefore I cannot calculate $S_E^2$, since I divide by zero. However, notice that if you don't use any regularization, we get that:
$$e = \hat{y} – y = Ay – y = 0$$
Since $A = I$ in the case of no regularization. So actually you divide zero by zero, so I thought maybe taking the limit of $\lambda \to 0$ we can might still be able to use the formula for $S_E$. However, my intuition tells me that $e^T e$ will go faster to zero, since it is quadratic in $A$, while the numerator is probably linear (?) in $A$ (the trace). I guess you could make this more exact by using the eigenvalue decomposition of $K$.
I performed a small experiment in Matlab where I compute $S_E^2$ for values of $\lambda$ going to zero and I found that $S_E^2$ goes to zero very fast for small values of $\lambda$, so it appears to be the case that $\sigma = 0$.
However to conclude that $\sigma = 0$ and thus that the variance of $\hat{y}$ is equal to zero for the kernel ridge regression model seems implausible to me. I guess a different approach would be to use bootstrapping to compute the variances of $\hat{y}$, however it feels like there should be some better way to attack this problem (I would like to compute it analytically if possible). In any case, with a linear kernel I don't have this problem (since in this case $df \neq 0$ and $e \neq 0$), but I would like to use the RBF kernel as well.
Best Answer
For anyone interested, this paper helped me a lot further:
Estimating Predictive Variances with Kernel Ridge Regression by G. C. Cawley , N. L. C. Talbot and O. Chapelle
Note that this paper explains how to do it for heteroscedastic noise, however you can also use it for homogeous noise as follows.
The first method, KRR + LOO:
A more detailed method is KRR + LOO + KRR.
Finally, they propose a new model, that will directly take heteroscedastic noise into account. This way, the model can be regularized more in noisy regions in feature space, and thus this model is expected to perform better.
In the end what I am really interested in is to estimate the probability distribution of the prediction on unseen / new data. The methods that I am now considering are:
Performing the KRR + LOO method to estimate $\sigma^2$. Now I can estimate the variance of new data using: $$ \hat{y_{new}} = K_{x_{new},X_{trainset}} (K_{X_{trainset},X_{trainset}} + \lambda I)^{-1} y_{train} = P y_{train}$$ Using the identity: $$ Cov(Xa) = X^T Cov(a) X$$ I obtain: $$ Cov(\hat{y_{new}}) = P_{train}^T P_{train} \sigma^2 $$ I use the prediction on unseen data of the regular KRR model as the mean for the distribution.
Performing KRR + LOO + KRR. I have two choices, I either estimate $\sigma^2$ by averaging the outputs of the second KRR model and use the above formula, or I directly use the prediction of the second KRR model on new data to estimate the prediction variance of a new point. I use the prediction on unseen data of the regular KRR model as the mean for the distribution.
What I am a bit confused about is the actual distribution of the prediction on data at a new point. Is this distribution actually a Gaussian / normally distributed?