Prediction Interval – Calculating Prediction Interval for Binomial Random Variables

binomial distributionconfidence intervalprediction interval

What is the formula (approximate or exact) for a prediction interval for a Binomial random variable?

Assume $Y \sim \mathsf{Binom}(n, p)$, and we observe $y$ (drawn from $Y$). The $n$ is known.

Our goal is to obtain a 95% prediction interval for a new draw from $Y$.

The point estimate is $n\hat{p}$, where $\hat{p}=\frac{y}{n}$. A confidence interval for $\hat{p}$ is straightforward, but I cannot find a formula for a prediction interval for $Y$. If we knew $p$ (rather than $\hat{p}$), then a 95% prediction interval just involves finding the quantiles of a binomial. Is there something obvious I'm overlooking?

Best Answer

Ok, let's try this. I'll give two answers - the Bayesian one, which is in my opinion simple and natural, and one of the possible frequentist ones.

Bayesian solution

We assume a Beta prior on $p$, i,e., $p \sim Beta(\alpha,\beta)$, because the Beta-Binomial model is conjugate, which means that the posterior distribution is also a Beta distribution with parameters $\hat{\alpha}=\alpha+k,\hat{\beta}=\beta+n-k$, (I'm using $k$ to denote the number of successes in $n$ trials, instead of $y$). Thus, inference is greatly simplified. Now, if you have some prior knowledge on the likely values of $p$, you could use it to set the values of $\alpha$ and $\beta$, i.e., to define your Beta prior, otherwise you could assume a uniform (noninformative) prior, with $\alpha=\beta=1$, or other noninformative priors (see for example here). In any case, your posterior is

$Pr(p|n,k)=Beta(\alpha+k,\beta+n-k)$

In Bayesian inference, all that matters is the posterior probability, meaning that once you know that, you can make inferences for all other quantities in your model. You want to make inference on the observables $y$: in particular, on a vector of new results $\mathbf{y}=y_1,\dots,y_m$, where $m$ is not necessarily equal to $n$. Specifically, for each $j=0,\dots,m$, we want to compute the probability of having exactly $j$ successes in the next $m$ trials, given that we got $k$ successes in the preceding $n$ trials; the posterior predictive mass function:

However, our Binomial model for $Y$ means that, conditionally on $p$ having a certain value, the probability of having $j$ successes in $m$ trials doesn't depend on past results: it's simply

$f(j|m,p)=\binom{j}{m} p^j(1-p)^j$

Thus the expression becomes

$Pr(j|m,n,k)=\int_0^1 \binom{j}{m} p^j(1-p)^j Pr(p|n,k)dp=\int_0^1 \binom{j}{m} p^j(1-p)^j Beta(\alpha+k,\beta+n-k)dp$

The result of this integral is a well-known distribution called the Beta-Binomial distribution: skipping the passages, we get the horrible expression

$Pr(j|m,n,k)=\frac{m!}{j!(m-j)!}\frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+k)\Gamma(\beta+n-k)}\frac{\Gamma(\alpha+k+j)\Gamma(\beta+n+m-k-j)}{\Gamma(\alpha+\beta+n+m)}$

Our point estimate for $j$, given quadratic loss, is of course the mean of this distribution, i.e.,

$\mu=\frac{m(\alpha+k)}{(\alpha+\beta+n)}$

Now, let's look for a prediction interval. Since this is a discrete distribution, we don't have a closed form expression for $[j_1,j_2]$, such that $Pr(j_1\leq j \leq j_2)= 0.95$. The reason is that, depending on how you define a quantile, for a discrete distribution the quantile function is either not a function or is a discontinuous function. But this is not a big problem: for small $m$, you can just write down the $m$ probabilities $Pr(j=0|m,n,k),Pr(j\leq 1|m,n,k),\dots,Pr(j \leq m-1|m,n,k)$ and from here find $j_1,j_2$ such that

$Pr(j_1\leq j \leq j_2)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\geq 0.95$

Of course you would find more than one couple, so you would ideally look for the smallest $[j_1,j_2]$ such that the above is satisfied. Note that

$Pr(j=0|m,n,k)=p_0,Pr(j\leq 1|m,n,k)=p_1,\dots,Pr(j \leq m-1|m,n,k)=p_{m-1}$

are just the values of the CMF (Cumulative Mass Function) of the Beta-Binomial distribution, and as such there is a closed form expression, but this is in terms of the generalized hypergeometric function and thus is quite complicated. I'd rather just install the R package extraDistr and call pbbinom to compute the CMF of the Beta-Binomial distribution. Specifically, if you want to compute all the probabilities $p_0,\dots,p_{m-1}$ in one go, just write:

library(extraDistr)  
jvec <- seq(0, m-1, by = 1) 
probs <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

where alpha and beta are the values of the parameters of your Beta prior, i.e., $\alpha$ and $\beta$ (thus 1 if you're using a uniform prior over $p$). Of course it would all be much simpler if R provided a quantile function for the Beta-Binomial distribution, but unfortunately it doesn't.

Practical example with the Bayesian solution

Let $n=100$, $k=70$ (thus we initially observed 70 successes in 100 trials). We want a point estimate and a 95%-prediction interval for the number of successes $j$ in the next $m=20$ trials. Then

n <- 100
k <- 70
m <- 20
alpha <- 1
beta  <- 1

where I assumed a uniform prior on $p$: depending on the prior knowledge for your specific application, this may or may not be a good prior. Thus

bayesian_point_estimate <- m * (alpha + k)/(alpha + beta + n) #13.92157

Clearly a non-integer estimate for $j$ doesn't make sense, so we could just round to the nearest integer (14). Then, for the prediction interval:

jvec <- seq(0, m-1, by = 1)
library(extraDistr)
probabilities <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

The probabilities are

> probabilities
 [1] 1.335244e-09 3.925617e-08 5.686014e-07 5.398876e-06
 [5] 3.772061e-05 2.063557e-04 9.183707e-04 3.410423e-03
 [9] 1.075618e-02 2.917888e-02 6.872028e-02 1.415124e-01
[13] 2.563000e-01 4.105894e-01 5.857286e-01 7.511380e-01
[17] 8.781487e-01 9.546188e-01 9.886056e-01 9.985556e-01

For an equal-tail probabilities interval, we want the smallest $j_2$ such that $Pr(j\leq j_2|m,n,k)\ge 0.975$ and the largest $j_1$ such that $Pr(j < j_1|m,n,k)=Pr(j \le j_1-1|m,n,k)\le 0.025$. This way, we will have

$Pr(j_1\leq j \leq j_2|m,n,k)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\ge 0.975-0.025=0.95$

Thus, by looking at the above probabilities, we see that $j_2=18$ and $j_1=9$. The probability of this Bayesian prediction interval is 0.9778494, which is larger than 0.95. We could find shorter intervals such that $Pr(j_1\leq j \leq j_2|m,n,k)\ge 0.95$, but in that case at least one of the two inequalities for the tail probabilities wouldn't be satisfied.

Frequentist solution

I'll follow the treatment of Krishnamoorthy and Peng, 2011. Let $Y\sim Binom(m,p)$ and $X\sim Binom(n,p)$ be independently Binominally distributed. We want a $1-2\alpha-$prediction interval for $Y$, based on a observation of $X$. In other words we look for $I=[L(X;n,m,\alpha),U(X;n,m,\alpha)]$ such that:

$Pr_{X,Y}(Y\in I)=Pr_{X,Y}(L(X;n,m,\alpha)\leq Y\leq U(X;n,m,\alpha)]\geq 1-2\alpha$

The "$\geq 1-2\alpha$" is due to the fact that we are dealing with a discrete random variable, and thus we cannot expect to get exact coverage...but we can look for an interval which has always at least the nominal coverage, thus a conservative interval. Now, it can be proved that the conditional distribution of $X$ given $X+Y=k+j=s$ is hypergeometric with sample size $s$, number of successes in the population $n$ and population size $n+m$. Thus the conditional pmf is

$Pr(X=k|X+Y=s,n,n+m)=\frac{\binom{n}{k}\binom{m}{s-k}}{\binom{m+n}{s}}$

The conditional CDF of $X$ given $X+Y=s$ is thus

$Pr(X\leq k|s,n,n+m)=H(k;s,n,n+m)=\sum_{i=0}^k\frac{\binom{n}{i}\binom{m}{s-i}}{\binom{m+n}{s}}$

The first great thing about this CDF is that it doesn't depend on $p$, which we don't know. The second great thing is that it allows to easily find our PI: as a matter of fact, if we observed a value $k$ of X, then the $1-\alpha$ lower prediction limit is the smallest integer $L$ such that

$Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$

correspondingly, the the $1-\alpha$ upper prediction limit is the largest integer such that

$Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$

Thus, $[L,U]$ is a prediction interval for $Y$ of coverage at least $1-2\alpha$. Note that when $p$ is close to 0 or 1, this interval is conservative even for large $n$, $m$, i.e., its coverage is quite larger than $1-2\alpha$.

Practical example with the Frequentist solution

Same setting as before, but we don't need to specify $\alpha$ and $\beta$ (there are no priors in the Frequentist framework):

n <- 100
k <- 70
m <- 20

The point estimate is now obtained using the MLE estimate for the probability of successes, $\hat{p}=\frac{k}{n}$, which in turns leads to the following estimate for the number of successes in $m$ trials:

frequentist_point_estimate <- m * k/n #14

For the prediction interval, the procedure is a bit different. We look for the largest $U$ such that $Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$, thus let's compute the above expression for all $U$ in $[0,m]$:

jvec <- seq(0, m, by = 1)
probabilities <- phyper(k,n,m,k+jvec)

We can see that the largest $U$ such that the probability is still larger than 0.025 is

jvec[which.min(probabilities > 0.025) - 1] # 18

Same as for the Bayesian approach. The lower prediction bound $L$ is the smallest integer such that $Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$, thus

probabilities <- 1-phyper(k-1,n,m,k+jvec)
jvec[which.max(probabilities > 0.025) - 1] # 8

Thus our frequentist "exact" prediction interval is $[L,U]=[8,18]$.

Related Solutions

Gamma Distribution – Confidence Interval for a Random Sample from Gamma Distribution

Edit: Time to add details, I think. The OP has long since worked it out but hasn't taken the invitation to write up a more complete solution, so I shall, in the interest of having a full answer to the question.

A pivot is a function of the data and the statistic whose distribution doesn't depend on the value of the statistic.

So consider:

(1) what would the distribution of a statistic consisting of the sum of the observations ($T=\sum_i x_i$) be?

A sum of $n$ i.i.d. $\text{gamma}(\alpha,\theta)$ random variables has the $\text{gamma}(n\alpha,\theta)$ distribution (for the shape-rate form of the gamma).

Here $n=6$ and $\alpha=2$, so the sum, $T$ has a $\text{gamma}(12,\theta)$ distribution.

(2) Note that the distribution in (1) does depend on $\theta$ and the form of the statistic doesn't. You need to modify the statistic ($Q=f(T,\theta)$) in such a way that both of those change. (This part is trivial!)

Let $Q=T/\theta$. Then $Q\sim \text{gamma}(12,1)$.

$Q$ satisfies the conditions required for a pivotal quantity.

(3) Once you have a pivotal quantity (i.e. $Q$), write down an interval for the pivotal quantity (in the form of a pair of inequalities, $a< Q< b$) with the given coverage. Since the distribution doesn't depend on the parameter, this interval is always the same (at a given sample size) no matter what the value of $\theta$.

One such interval is $(a,b)$, where $P(a<Q<b)=0.95$, when $a$ is the 0.025 point of the $\text{gamma}(12,1)$ distribution and $b$ is the 0.975 point.

(4) Now write the interval involving the pivotal quantity back in terms of the data and $\theta$. Back out an interval for the parameter, for which the corresponding probability statement must still hold (keeping in mind that the random quantity is not $\theta$ but the interval).

$P(a<T/\theta<b)=0.95$ implies $P(1/b < \theta/T < 1/a)=0.95$, so $P(T/b < \theta < T/a)=0.95$. Therefore $(T/b,T/a)$ is a 95% interval for $\theta$.

Our observed total, $t = 4.91$. The 0.025 point of a gamma(12,1) is 6.2006 and the 0.975 point is 19.682. Hence a 95% interval for $\theta$ is (4.91/19.682,4.91/6.200)
= $(0.249, 0.792)$.

Solved – Can we make probabilistic statements with prediction intervals

First, on the use of the word probability, frequentists don't have a problem with using the word probability when predicting something where the random piece has not taken place yet. We don't like the word probability for a confidence interval because the true parameter is not changing (we are assuming it is a fixed, though unknown, value) and the interval is fixed because it is based on data that we have already collected. For example if our data comes from a random sample of adult male humans and x is their height and y is their weight and we fit the general regression model then we don't use probability when talking about the confidence intervals. But if I want to talk about what is the probability of a 65 inch tall male chosen at random from all 65 inch tall males having a weight within a certain interval, then it is fine to use probability in that context (because the random selection has not yet been made, so probability makes sense).

So I would say that the answer to the bonus question is "Yes". If we knew enough information, then we could compute the probability of seeing a y value within an interval (or find an interval with the desired probability).

For your statement labeled "1." I would say that it is OK if you use a word like "approximate" when talking about the interval or probability. Like you mention in the bonus question, we can decompose the uncertainty into a piece about the center of the prediction and a piece about the randomness around the true mean. When we combine these to cover all our uncertainty (and assuming we have the model/normality correct) we have an interval that will tend to be too wide (though can be too narrow as well), so the probability of a new randomly chosen point falling into the prediction interval is not going to be exactly 95%. You can see this by simulation. Start with a known regression model with all the parameters known. Choose a sample (across many x values) from this relationship, fit a regression, and compute the prediction interval(s). Now generate a large number of new data points from the true model again and compare them to the prediction intervals. I did this a few times using the following R code:

x <- 1:25
y <- 5 + 3*x + rnorm(25, 0, 5)
plot(x,y)

fit <- lm(y~x)
tmp <- predict(fit, data.frame(x=1:25), interval='prediction')

sapply( 1:25, function(x){ 
    y <- rnorm(10000, 5+3*x, 5)
    mean( tmp[x,2] <= y & y <= tmp[x,3] )
})

I ran the above code a few times (around 10, but I did not keep careful count) and most of the time the proportion of new values falling in the intervals ranged in the 96% to 98% range. I did have one case where the estimated standard deviation was very low that the proportions were in the 93% to 94% range, but all the rest were above 95%. So I would be happy with your statement 1 with the change to "approximately 95%" (assuming all the assumptions are true, or close enough to be covered in the approximately).

Similarly, statement 2 needs an "approximately" or similar, because to cover our uncertainty we are capturing on average more than the 95%.