Solved – Gaussian process prediction interval

confidence intervalgaussian processprediction interval

How can the prediction interval of a Gaussian process be evaluated? I don't know how to estimate this interval though I can find a 95 % confidence interval for the mean line.

Best Answer

I will answer your question inside the Bayesian framework. If you specifically need a frequentist solution, you can get one by slightly modifying my answer, but I think it will underestimate the actual uncertainty: you would need a fully frequentist approach, but I don't know how to do that in this specific case.

To briefly recap the Bayesian GPR (Gaussian Process Regression) framework, you assume the model

$$y=f(x|\boldsymbol{\theta})+\epsilon$$

where $f(x|\boldsymbol{\theta})\sim \mathcal{GP}(\mu(x|\boldsymbol{\theta}),k(x,x'|\boldsymbol{\theta}))$, i.e., the latent variables or function values are distributed as a Gaussian Process, conditionally on the hyperparameters $\boldsymbol{\theta}$, and $\epsilon\sim\mathcal{N}(0,\sigma^2)$ is the usual iid Gaussian noise.


Actually, $\sigma^2$ is an hyperparameter too, so it really belongs in $\boldsymbol{\theta}$, but I wanted to underscore that GPR usually assumes a trivial covariance structure for the noise.


The posterior predictive distribution of $y_*$ at a new point $x_*$, conditionally on data $\{(x_1,y_1,)\dots,(x_d,y_d)\}=(\mathbf{x},\mathbf{y})$ and on hyperparameters $\boldsymbol{\theta}$, is $p(y_*|\boldsymbol{\theta},\mathbf{y})$. Now, assume that the mean function of the Gaussian Process is zero: the general case can also be dealt with, but let's try and keep things simple. Then, using the usual GPR machinery, we get

$$p(f_*|\boldsymbol{\theta},\mathbf{y}) = \mathcal{N}(\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{y},k(x_*,x_*)-\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{k}_*)$$

where

$$K=\pmatrix{k(x_{1},x_{1};\boldsymbol{\theta})& & k(x_{1},x_{d};\boldsymbol{\theta}) \\ & \ddots & \\k(x_{1},x_{d};\boldsymbol{\theta})& & k(x_{d},x_{d};\boldsymbol{\theta}) }$$

$$\mathbf{k}_*=\pmatrix{k(x_*,x_{1};\boldsymbol{\theta}) \\ \vdots \\k(x_*,x_{d};\boldsymbol{\theta})}$$

i.e., conditional on observed data and hyperparameters, the distribution of the latent variable at a new point is still Gaussian, with the mean and standard deviation shown above.

However, we are interested in the distribution of a new observation $y_*$, not a new latent variable. This is easy because in our model the noise is additive, independent of all other variables and is normally distributed with zero mean and variance $\sigma^2$, thus we only need to add the noise variance:

$$p(y_*|\boldsymbol{\theta},\mathbf{y}) = \mathcal{N}(\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{y},k(x_*,x_*)-\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{k}_*+\sigma^2)$$

Note that I'm considering a single new observation $y_*$, so the distribution $p(y_*|\boldsymbol{\theta},\mathbf{y})$ is just a univariate Gaussian, and the variance is actually a variance and not a variance-covariance matrix.

To actually use this expression, you need values for the hyperparameters, which are not known. There are 2 ways out of this:

  1. (the most common solution) the hyperparameters are estimated by MLE, or MAP, and the above expression is used. This approach completely neglects the uncertainty in the estimation of the hyperparameters, thus it doesn't seem very safe.

  2. in a fully Bayesian approach, you are not really interested in $p(y_*|\boldsymbol{\theta},\mathbf{y})$, but in the predictive distribution of $y_*$ given $\mathbf{y}$, which is obtained by $p(y_*|\boldsymbol{\theta},\mathbf{y})$ after integrating out the hyperparameters:

    $$p(y_*|\mathbf{y})=\int{p(y_*,\boldsymbol{\theta}|\mathbf{y})}d\boldsymbol{\theta}=\int{p(y_*|\boldsymbol{\theta},\mathbf{y})p(\boldsymbol{\theta}|\mathbf{y})}d\boldsymbol{\theta}$$

There are two problems here: given a prior distribution for the hyperparameters $p(\boldsymbol{\theta})$, then the posterior distribution $p(\boldsymbol{\theta}|\mathbf{y})$, which appears in the integral, is not known but must be derived using the Bayes' theorem, which for most hyperpriors means having to run a MCMC. Thus we don't have an explicit expression for $p(\boldsymbol{\theta}|\mathbf{y})$, but only samples from the MCMC. And even if we had an expression for $p(\boldsymbol{\theta}|\mathbf{y})$, then the integral giving $p(y_*|\mathbf{y})$ would still be impossible to evaluate in a closed form in most cases. The solution is an hierarchical Bayes simulation: for each sample $\hat{\boldsymbol{\theta}}_i$ obtained from $p(\boldsymbol{\theta}|\mathbf{y})$ with the MCMC, you draw a sample $y^*_i$ from $p(y_*|\hat{\boldsymbol{\theta}}_i,\mathbf{y})$. Use these $m$ samples $y^*_i$ to estimate an HPD interval for $y_*$, and there you are.

From an intuitive point of view, the second solution draws samples from a distribution where the hyperparameters are "not fixed", but allowed to vary randomly according to their posterior distribution $p(\boldsymbol{\theta}|\mathbf{y})$. Thus the prediction interval obtained in the second case takes into account the uncertainty due to our lack of knowledge about the hyperparameters.