Prediction Intervals – Understanding Prediction and Tolerance Intervals in Statistical Analysis

predictionprediction intervaltolerance-interval

I have a couple of questions for prediction and tolerance intervals.

Let's agree on the definition of the tolerance intervals first: We are given a confidence level, say 90%, the percentage of the population to capture, say 99%, and a sample size, say 20. The probability distribution is known, say normal for convenience. Now, given the above three numbers (90%, 99% and 20) and the fact that the underlying distribution is normal, we can compute the tolerance number $k$. Given a sample $(x_1,x_2,\ldots,x_{20})$ with mean $\bar{x}$ and standard deviation $s$, the tolerance interval is $\bar{x}\pm ks$. If this tolerance interval captures 99% of the population, then the sample $(x_1,x_2,\ldots,x_{20})$ is called a success and the requirement is that 90% of the samples are successes.

Comment: 90% is the a priori probability for a sample to be a success. 99% is the conditional probability that a future observation will be in the tolerance interval, given that the sample is a success.

My questions:
Can we see prediction intervals as tolerance intervals? Looking on the web I got conflicting answers on this, not to mention that nobody really defined the prediction intervals carefully. So, if you have a precise definition of the prediction interval (or a reference), I would appreciate it.

What I understood is that a 99% prediction interval for instance, does not capture 99% of all future values for all samples. This would be the same as a tolerance interval that captures 99% of the population with 100% probability.

In the definitions I found for a 90% prediction interval, 90% is the a priori probability given a sample, say $(x_1,x_2,\ldots,x_{20})$ (size is fixed) and a single future observation $y$, that $y$ will be in the prediction interval. So, it seems that both the sample and the future value are both given at the same time, in contrast to the tolerance interval, where the sample is given and with a certain probability it is a success, and under the condition that the sample is a success, a future value is given and with a certain probability falls into the tolerance interval. I am not sure if the above definition of the prediction interval is right or not, but it seems counterintuitive (at least).

Any help?

Best Answer

Your definitions appear to be correct.

The book to consult about these matters is Statistical Intervals (Gerald Hahn & William Meeker), 1991. I quote:

A prediction interval for a single future observation is an interval that will, with a specified degree of confidence, contain the next (or some other prespecified) randomly selected observation from a population.

[A] tolerance interval is an interval that one can claim to contain at least a specified proportion, p, of the population with a specified degree of confidence, $100(1-\alpha)\%$.

Here are restatements in standard mathematical terminology. Let the data $\mathbf{x}=(x_1,\ldots,x_n)$ be considered a realization of independent random variables $\mathbf{X}=(X_1,\ldots,X_n)$ with common cumulative distribution function $F_\theta$. ($\theta$ appears as a reminder that $F$ may be unknown but is assumed to lie in a given set of distributions ${F_\theta \vert \theta \in \Theta}$). Let $X_0$ be another random variable with the same distribution $F_\theta$ and independent of the first $n$ variables.

A prediction interval (for a single future observation), given by endpoints $[l(\mathbf{x}), u(\mathbf{x})]$, has the defining property that

$$ \inf_\theta\{{\Pr}_\theta(X_0 \in [l(\mathbf{X}), u(\mathbf{X})])\}= 100(1-\alpha)\%.$$

Specifically, ${\Pr}_\theta$ refers to the $n+1$ variate distribution of $(X_0, X_1, \ldots, X_n)$ determined by the law $F_\theta$. Note the absence of any conditional probabilities: this is a full joint probability. Note, too, the absence of any reference to a temporal sequence: $X_0$ very well may be observed in time before the other values. It does not matter.

I'm not sure which aspect(s) of this may be "counterintuitive." If we conceive of selecting a statistical procedure as an activity to be pursued before collecting data, then this is a natural and reasonable formulation of a planned two-step process, because both the data ($X_i, i=1,\ldots,n$) and the "future value" $X_0$ need to be modeled as random.
A tolerance interval, given by endpoints $(L(\mathbf{x}), U(\mathbf{x})]$, has the defining property that

$$ \inf_\theta\{{\Pr}_\theta\left(F_\theta(U(\mathbf{X})) - F_\theta(L(\mathbf{X})\right) \ge p)\} = 100(1-\alpha)\%.$$

Note the absence of any reference to $X_0$: it plays no role.

When $\{F_\theta\}$ is the set of Normal distributions, there exist prediction intervals of the form

$$l(\mathbf{x}) = \bar{x} - k(\alpha, n) s, \quad u(\mathbf{x}) = \bar{x} + k(\alpha, n) s$$

($\bar{x}$ is the sample mean and $s$ is the sample standard deviation). Values of the function $k$, which Hahn & Meeker tabulate, do not depend on the data $\mathbf{x}$. There are other prediction interval procedures, even in the Normal case: these are not the only ones.

Similarly, there exist tolerance intervals of the form

$$L(\mathbf{x}) = \bar{x} - K(\alpha, n, p) s, \quad U(\mathbf{x}) = \bar{x} + K(\alpha, n, p) s.$$

There are other tolerance interval procedures: these are not the only ones.

Noting the similarity among these pairs of formulas, we may solve the equation

$$k(\alpha, n) = K(\alpha', n, p).$$

This allows one to reinterpret a prediction interval as a tolerance interval (in many different possible ways by varying $\alpha'$ and $p$) or to reinterpret a tolerance interval as a prediction interval (only now $\alpha$ usually is uniquely determined by $\alpha'$ and $p$). This may be one origin of the confusion.

Related Solutions

Solved – Does more variables mean tighter confidence intervals

do I ALWAYS get a tighter confidence interval if I include more variables in my model?

Yes, you do (EDIT: ...basically. Subject to some caveats. See comments below). Here's why: adding more variables reduces the SSE and thereby the variance of the model, on which your confidence and prediction intervals depend. This even happens (to a lesser extent) when the variables you are adding are completely independent of the response:

a=rnorm(100)
b=rnorm(100)
c=rnorm(100)
d=rnorm(100)
e=rnorm(100)

summary(lm(a~b))$sigma       # 0.9634881
summary(lm(a~b+c))$sigma     # 0.961776
summary(lm(a~b+c+d))$sigma   # 0.9640104 (Went up a smidgen)
summary(lm(a~b+c+d+e))$sigma # 0.9588491 (and down we go again...)

But this does not mean you have a better model. In fact, this is how overfitting happens.

Consider the following example: let's say we draw a sample from a quadratic function with noise added.

enter image description here

A first order model will fit this poorly and have very high bias.

enter image description here

A second order model fits well, which is not surprising since this is how the data was generated in the first place.

enter image description here

But let's say we don't know that's how the data was generated, so we fit increasingly higher order models.

enter image description here

As the complexity of the model increases, we're able to capture more of the fluctuations in the data, effectively fitting our model to the noise, to patterns that aren't really there. With enough complexity, we can build a model that will go through each point in our data nearly exactly.

enter image description here

As you can see, as the order of the model increases, so does the fit. We can see this quantitatively by plotting the training error:

enter image description here

But if we draw more points from our generating function, we will observe the test error diverges rapidly.

enter image description here

The moral of the story is to be wary of overfitting your model. Don't just rely on metrics like adjusted-R2, consider validating your model against held out data (a "test" set) or evaluating your model using techniques like cross validation.

For posterity, here's the code for this tutorial:

set.seed(123)
xv = seq(-5,15,length.out=1e4)
X=sample(xv,20)
gen=function(v){v^2 + 7*rnorm(length(v))}
Y=gen(X)
df = data.frame(x=X,y=Y)
plot(X,Y)
lines(xv,xv^2, col="blue") # true model
legend("topleft", "True Model", lty=1, col="blue")    

build_formula = function(N){ 
  paste('y~poly(x,',N,',raw=T)')
}

deg=c(1,2,10,20)
formulas = sapply(deg[2:4], build_formula)
formulas = c('y~x', formulas)
pred = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
                            })
# Progressively add fit lines to the plot
lapply(1:length(pred), function(i){
  plot(df, main=paste(deg[i],"-Degree"))
  lapply(1:i,function(n){
    lines(xv,pred[[n]], col=n)
  })
  })

# let's actually generate models from poly 1:20 to calculate MSE
deg=seq(1,20)
formulas = sapply(deg, build_formula)
pred.train = lapply(formulas
                   ,function(f){
                     predict(
                       lm(f, data=df)
                       ,newdata=list(x=df$x))
                   })

pred.test = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
              })

rmse.train = unlist(lapply(pred.train,function(P){
  regr.eval(df$y,P, stats="rmse")
}))    

yv=gen(xv)
rmse.test = unlist(lapply(pred.test,function(P){
  regr.eval(yv,P, stats="rmse")
}))    

plot(rmse.train, type='l', col='blue'
     , main="Training Error"
     ,xlab="Model Complexity")

plot(rmse.test, type='l', col='red'
     , main="Train vs. Test Error"
     ,xlab="Model Complexity")
lines(rmse.train, type='l', col='blue')
legend("topleft", c("Test","Train"), lty=c(1,1), col=c("red","blue"))

Solved – Making a forecast: Confidence/Prediction/Tolerance Interval

This will depend on what you want to use the forecast and the interval for.

For instance, I do forecasting for retail, and our prediction intervals (more precisely: high quantile forecasts) are used for replenishment. In such a use case, the relevant thing is, say, a 95% quantile forecast, because this will translate more-or-less into a specific service level, which will hopefully balance the costs of understock and overstock.

As an example, let's look at the AirPassengers dataset. Maybe we want to provide enough capacity to achieve a service level of 95%. We can do this by fitting a model and looking at a 90% prediction interval. This is comprised of the 5% and the 95% quantile forecasts. Below, the "Lo 90%" is the number such that we expect a 5% chance of observing fewer passengers, while the "Hi 90%" is the number such that we expect a 95% chance of observing fewer passengers - together, the two numbers bracket off a prediction interval that we expect a 90% chance of covering the next observation.

> library(forecast)
> (foo <- forecast(auto.arima(AirPassengers),h=1,level=0.90))
         Point Forecast   Lo 90    Hi 90
Jan 1961       446.7582 427.413 466.1034
> plot(foo)

So, to achieve 95% service level, we would plan for providing capacity for 466.1 (thousand) passengers. (I'm glossing over different possible definitions of the service level, which here don't really make a difference. Plus, in planning for a longer lead time, we would need to take remaining autocorrelation into account, and so forth.)

Most forecasters I work with are only interested in prediction intervals, because you can verify them to a certain extent, simply by observing the next realization. Confidence intervals can never be verified, and you will have to trust that you have a correctly specified model for the mathematics to work. (And I have never seen the tolerance interval used in forecasting.)

When you are predicting the winner of an election, a prediction interval does not really make sense, because this random variable is not numeric. Unless you are not predicting the winner, but, say, a candidate's share of the vote.

Best Answer

Related Solutions

Solved – Does more variables mean tighter confidence intervals

Solved – Making a forecast: Confidence/Prediction/Tolerance Interval

Related Question