Solved – Gaussian process prediction interval

confidence intervalgaussian processprediction interval

How can the prediction interval of a Gaussian process be evaluated? I don't know how to estimate this interval though I can find a 95 % confidence interval for the mean line.

Best Answer

I will answer your question inside the Bayesian framework. If you specifically need a frequentist solution, you can get one by slightly modifying my answer, but I think it will underestimate the actual uncertainty: you would need a fully frequentist approach, but I don't know how to do that in this specific case.

To briefly recap the Bayesian GPR (Gaussian Process Regression) framework, you assume the model

$$y=f(x|\boldsymbol{\theta})+\epsilon$$

where $f(x|\boldsymbol{\theta})\sim \mathcal{GP}(\mu(x|\boldsymbol{\theta}),k(x,x'|\boldsymbol{\theta}))$, i.e., the latent variables or function values are distributed as a Gaussian Process, conditionally on the hyperparameters $\boldsymbol{\theta}$, and $\epsilon\sim\mathcal{N}(0,\sigma^2)$ is the usual iid Gaussian noise.

Actually, $\sigma^2$ is an hyperparameter too, so it really belongs in $\boldsymbol{\theta}$, but I wanted to underscore that GPR usually assumes a trivial covariance structure for the noise.

The posterior predictive distribution of $y_*$ at a new point $x_*$, conditionally on data $\{(x_1,y_1,)\dots,(x_d,y_d)\}=(\mathbf{x},\mathbf{y})$ and on hyperparameters $\boldsymbol{\theta}$, is $p(y_*|\boldsymbol{\theta},\mathbf{y})$. Now, assume that the mean function of the Gaussian Process is zero: the general case can also be dealt with, but let's try and keep things simple. Then, using the usual GPR machinery, we get

$$p(f_*|\boldsymbol{\theta},\mathbf{y}) = \mathcal{N}(\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{y},k(x_*,x_*)-\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{k}_*)$$

where

$$K=\pmatrix{k(x_{1},x_{1};\boldsymbol{\theta})& & k(x_{1},x_{d};\boldsymbol{\theta}) \\ & \ddots & \\k(x_{1},x_{d};\boldsymbol{\theta})& & k(x_{d},x_{d};\boldsymbol{\theta}) }$$

$$\mathbf{k}_*=\pmatrix{k(x_*,x_{1};\boldsymbol{\theta}) \\ \vdots \\k(x_*,x_{d};\boldsymbol{\theta})}$$

i.e., conditional on observed data and hyperparameters, the distribution of the latent variable at a new point is still Gaussian, with the mean and standard deviation shown above.

However, we are interested in the distribution of a new observation $y_*$, not a new latent variable. This is easy because in our model the noise is additive, independent of all other variables and is normally distributed with zero mean and variance $\sigma^2$, thus we only need to add the noise variance:

$$p(y_*|\boldsymbol{\theta},\mathbf{y}) = \mathcal{N}(\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{y},k(x_*,x_*)-\mathbf{k}_*^T(K+\sigma^2I)^{-1}\mathbf{k}_*+\sigma^2)$$

Note that I'm considering a single new observation $y_*$, so the distribution $p(y_*|\boldsymbol{\theta},\mathbf{y})$ is just a univariate Gaussian, and the variance is actually a variance and not a variance-covariance matrix.

To actually use this expression, you need values for the hyperparameters, which are not known. There are 2 ways out of this:

(the most common solution) the hyperparameters are estimated by MLE, or MAP, and the above expression is used. This approach completely neglects the uncertainty in the estimation of the hyperparameters, thus it doesn't seem very safe.
in a fully Bayesian approach, you are not really interested in $p(y_*|\boldsymbol{\theta},\mathbf{y})$, but in the predictive distribution of $y_*$ given $\mathbf{y}$, which is obtained by $p(y_*|\boldsymbol{\theta},\mathbf{y})$ after integrating out the hyperparameters:

$$p(y_*|\mathbf{y})=\int{p(y_*,\boldsymbol{\theta}|\mathbf{y})}d\boldsymbol{\theta}=\int{p(y_*|\boldsymbol{\theta},\mathbf{y})p(\boldsymbol{\theta}|\mathbf{y})}d\boldsymbol{\theta}$$

There are two problems here: given a prior distribution for the hyperparameters $p(\boldsymbol{\theta})$, then the posterior distribution $p(\boldsymbol{\theta}|\mathbf{y})$, which appears in the integral, is not known but must be derived using the Bayes' theorem, which for most hyperpriors means having to run a MCMC. Thus we don't have an explicit expression for $p(\boldsymbol{\theta}|\mathbf{y})$, but only samples from the MCMC. And even if we had an expression for $p(\boldsymbol{\theta}|\mathbf{y})$, then the integral giving $p(y_*|\mathbf{y})$ would still be impossible to evaluate in a closed form in most cases. The solution is an hierarchical Bayes simulation: for each sample $\hat{\boldsymbol{\theta}}_i$ obtained from $p(\boldsymbol{\theta}|\mathbf{y})$ with the MCMC, you draw a sample $y^*_i$ from $p(y_*|\hat{\boldsymbol{\theta}}_i,\mathbf{y})$. Use these $m$ samples $y^*_i$ to estimate an HPD interval for $y_*$, and there you are.

From an intuitive point of view, the second solution draws samples from a distribution where the hyperparameters are "not fixed", but allowed to vary randomly according to their posterior distribution $p(\boldsymbol{\theta}|\mathbf{y})$. Thus the prediction interval obtained in the second case takes into account the uncertainty due to our lack of knowledge about the hyperparameters.

Related Solutions

Solved – Prediction interval for a fitted log-normal distribution

Here is one simple approach:

> x.logmod = lm(log(x) ~ 1)
> exp(predict(x.logmod, newdata = data.frame(junk = 0), interval = "predict"))
       fit       lwr      upr
1 1.094619 0.1773106 6.757576

The linear model obtains the mean of $\log(x)$. The predict statement can compute a prediction interval for a new dataset, so if we un-transform it, we get a prediction interval for $x$ itself. The newdata argument may be skipped if you want 100 copies of the same interval! Instead, I provided a dataset that has just one row; since we are predicting the intercept, it doesn't matter what's in it.

Regression Analysis – Difference Between Prediction Interval and Confidence Interval Explained

The blue lines don't matter for your prediction. You can see what happen with your data, because it's the same you can expect to happen with your predictions. Please notice that some past observations fall outside of the blue lines, but most of all (about 95%) fall between red lines. That is what we should expect to happen with your predictions: although you have a single red point, the real (future observed) value is expected to fall anywhere between red lines 95% of times.

If we return to your model we can see better what both colours of lines mean. Your model is:

$$\mathit{Y}=\beta_0+\beta_1X_1+\varepsilon$$

Where $\beta_0$ and $\beta_1$ are unknown parameters we only can estimate and $\varepsilon$ is a random variable.

The blue lines are confidence intervals for $\beta_0+\beta_1X_1$, that is, intervals for the green line, but please notice that that is not the future observation (that is just your point estimation of it). Your future observation will include an $\varepsilon$ term which will cause more variability, and the red lines account for that extra variability.

Edit:

It was asked (and answered) in comments what are the blue lines useful for. For completeness, this is the answer:

If you are predicting future observations, confidence intervals (blue lines) don't have a direct use. However they can be useful sometimes. I give a couple of examples:

When epsilon stands for an error of measurement, we are usually more interested in predicting means than in predicting observations. Then, blue lines are more useful than red ones.
Blue lines show how much can we improve predictions by increasing sample size. If blue lines are nearly as apart as red lines, we could improve a lot our predictions with a larger sample; in the opposite case there is very little to gain.

Best Answer

Related Solutions

Solved – Prediction interval for a fitted log-normal distribution

Regression Analysis – Difference Between Prediction Interval and Confidence Interval Explained

Related Question