Solved – Prediction interval for a future proportion of successes under Binomial setting

beta-binomial distributionbeta-regressionbinomial distributionconfidence intervalgamlss

Suppose I fit a Binomial regression and obtain the point estimates and variance-covariance matrix of the regression coefficients. That will allow me to get a CI for the expected proportion of successes in a future experiment, $p$, but I need a CI for the observed proportion. There have been a few related answers posted, including simulation (suppose I don't want to do that) and a link to Krishnamoorthya et al (which does not quite answer my question).

My reasoning is as follows: if we use just the Binomial model, we are forced to assume that $p$ is sampled from Normal distribution (with the corresponding Wald CI) and therefore it is impossible to get CI for the observed proportion in closed form. If we assume that $p$ is sampled from beta distribution, then things are much easier because the count of successes will follow Beta-Binomial distribution. We will have to assume that there is no uncertainty in the estimated beta parameters, $\alpha$ and $\beta$.

There are three questions:

1) A theoretical one: is it ok to use just the point estimates of beta parameters? I know that to construct a CI for future observation in multiple linear regression

$Y = x'\beta + \epsilon, \epsilon \sim N(0, \sigma^2)$

they do that w.r.t. error term variance, $\sigma^2$. I take it (correct me if I am wrong) that the justification is that in practice $\sigma^2$ is estimated with a far greater precision than the regression coefficients and we won't gain much by trying to incorporate the uncertainty of $\sigma^2$. Is a similar justification applicable to the estimated beta parameters, $\alpha$ and $\beta$?

2) What package is better (R: gamlss-bb, betareg, aod?; I also have access to SAS).

3) Given the estimated beta parameters, is there an (approximate) shortcut to get the quantiles (2.5%, 97.5%) for the count of future successes or, better yet, for the proportion of future successes under Beta-Binomial distribution.

Best Answer

I will address all 3 parts to the question.

There are two conflated issues, first is the method you use to fit a regression model in this case. The second is how to interval estimates from your estimates to predict a new estimate.

if your response variables are Binomially distributed you would typically use either a logistic regression or a probit regression (glm with normal cdf as a link function).

If you do a logistic regression, take the response to be the ratio of the observed counts divided by the the known upper bound i.e. $y_i/n_i$. Then take your predictors/covariates and put these into your R call to a glm function. The returned object has everything you need to do the rest of your calculations.

x<- rnorm(100, sd=2)
prob_true <- 1/(1+exp(-(1+5*x)))
counts <- rbinom(100, 50,prob_true)
print(d.AD <- data.frame(counts,x))
glm.D93 <- glm(counts/50 ~ x, family = binomial() )

For a linear regression model the formula for a prediction interval is:

$\hat{y}_i \pm t_{n-p}s_y\sqrt{1+\frac{1}{n}+\frac{(x_i-\bar{x})^2}{(n-1)s^2_x}}$

You can use the linear regression model as an approximation for the glm. To do this you would linear regression formula for the linear combination of predictors before you do the inverse link transformation to get the probabilities back on the 0-1 scale. The code to do this is baked into the predict.glm() R function. Here is some example code that will also make a nice plot. (EDIT: This code is for confidence interval, not for prediction interval)

y_hat <- predict(glm.D93, type="link", se.fit=TRUE)
t_np<- qt(.975, 100-2, ncp=0)

ub <- y_hat$fit + t_np * y_hat$se.fit
lb <- y_hat$fit - t_np * y_hat$se.fit

point <- y_hat$fit

p_hat <- glm.D93$family$linkinv(point)
p_hat_lb <- glm.D93$family$linkinv(lb)
p_hat_ub <- glm.D93$family$linkinv(ub)

plot(x,p_hat)
points(x, p_hat_ub, col='red')
points(x, p_hat_lb, col='blue')

You can do the same thing for any glm, e.g. Poisson, inverse Gaussian, gamma, etc. In each case do the prediction interval on the scale of the linear combination of the predictors. After you get the two end points of the prediction interval you convert these end points via the inverse link. For each of the glms I mentioned the inverse link might be different than the logit case I wrote here. Hope this helps.

Related Solutions

Solved – Probability distribution for a binomial proportion ‘derived’ from serially dependent data

You could model this data using BUGS (or JAGS or Stan), which would allow you to model the data with any distribution you want (binomial model for observations and beta for the parameters (p) of a binomial distribution) and in any way you want = basically you build model for p taking into account dependence between days + treatment. See the wealth of examples at:

BUGS book examples http://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-the-bugs-book/bugs-book-examples/
BUGS Volume I-III examples http://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-the-bugs-book/bugs-book-examples/

Binomial Distribution – Accounting for Uncertainty of p When Estimating Mean of Binomial Distribution

There are several problems with your approach. First, you want to use confidence intervals for something that they were not designed for. If $p$ varies, then confidence interval will not show you how does it vary. Check Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean? to learn more about confidence intervals. Moreover, using normal approximation for binomial proportion and it's confidence intervals is not a good idea, as described by Brown et al (2001).

In fact, from your description it sounds like you want to estimate the Bayesian credible interval, i.e. interval that will contain certain fraction of $p$'s distribution. Yes, I said Bayesian, since in fact you already defined your problem as a Bayesian model. You say that you assume that $p$ is a random variable, while in frequentist setting $p$ would be a fixed parameter. If you already assumed it, why not use a Bayesian model for your data? You would be using beta-binomial model (see also An introduction to the Beta-Binomial model paper by Dan Navarro and Amy Perfors). In cases like this it is extremely easy to estimate such model. We can define it as follows:

$$ X \sim \mathrm{Binomial}(N, p) \\ p \sim \mathrm{Beta}(\alpha, \beta) $$

so, your data $X$ follows binomial distribution parametrized by $N$ and $p$, where $p$ is a random variable. We assume beta distribution with parameters $\alpha$ and $\beta$ as a prior for $p$. I guess that if you wanted to use frequentist method, you do not have any prior knowledge about possible distribution of $p$, so you would choose "uninformative" prior parametrized by $\alpha = \beta = 1$, or $\alpha = \beta = 0.5$ (if you prefer, you may translate those parameters to mean and precision, or mean and variance). After updating your prior, posterior distribution of $p$ is simply a beta distribution parametrized by

$$ \alpha' = \alpha + \text{total number of successes} \\ \beta' = \beta + \text{total number of failures} $$

with mean

$$ E(X) = N \frac{\alpha'}{\alpha'+\beta'} $$

To read more about calculating other quantities of this distribution check Wikipedia article on beta-binomial distribution. You can compute credible intervals numerically either by (a) inverting numerically the cumulative distribution function of beta-binomial distribution, or by (b) sampling large number of random values from beta-binomial distribution and then computing sample quantiles from it. Second approach is pretty easy since you only need to sequentially repeat the following procedure:

draw $p$ from beta distribution parametrized by $\alpha'$ and $\beta'$,

draw $x$ from binomial distribution parametrized by $p$ and $N$.

until you draw sample large enough to find it confident for calculating the quantities of interest.

Of course if you know mean and standard deviation of $p$ and you insist on using normal distribution for it, you can use simulation as well, but with using normal distribution for simulating the values of $p$. Below I provide code example in R for such simulation.

R <- 1e5                       # number of samples to draw in simulation
N <- 500                       # known N
mu <- 0.3                      # known mean of p
sigma <- 0.07                  # known standard deviation of p
p <- rnorm(R, mu, sigma)       # simulate p
x <- rbinom(R, N, p)           # simulate X
mean(x)                        # estimate for mean of X
quantile(p*N, c(0.025, 0.975)) # 95% interval estimate for variability of E(X)

Or you can simply take appropriate quantiles using inverse of normal cumulative distribution function and multiply them by $N$. Remember however that this is not a confidence interval, but a credible interval.

Brown, L.D., Cai, T.T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical science, 101-117.

Best Answer

Related Solutions

Solved – Probability distribution for a binomial proportion ‘derived’ from serially dependent data

Binomial Distribution – Accounting for Uncertainty of p When Estimating Mean of Binomial Distribution

Related Question