Solved – Confidence interval for Bernoulli sampling

bernoulli-distributionbinomial distributionconfidence intervalfaq

I have a random sample of Bernoulli random variables $X_1 … X_N$, where $X_i$ are i.i.d. r.v. and $P(X_i = 1) = p$, and $p$ is an unknown parameter.

Obviously, one can find an estimate for $p$: $\hat{p}:=(X_1+\dots+X_N)/N$.

My question is how can I build a confidence interval for $p$?

Best Answer

If the average, $\hat{p}$, is not near $1$ or $0$, and sample size $n$ is sufficiently large (i.e. $n\hat{p}>5$ and $n(1-\hat{p})>5$, the confidence interval can be estimated by a normal distribution and the confidence interval constructed thus:

$$\hat{p}\pm z_{1-\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
If $\hat{p} = 0$ and $n>30$, the $95\%$ confidence interval is approximately $[0,\frac{3}{n}]$ (Javanovic and Levy, 1997); the opposite holds for $\hat{p}=1$. The reference also discusses using using $n+1$ and $n+b$ (the later to incorporate prior information).
Else Wikipedia provides a good overview and points to Agresti and Couli (1998) and Ross (2003) for details about the use of estimates other than the normal approximation, the Wilson score, Clopper-Pearson, or Agresti-Coull intervals. These can be more accurate when above assumptions about $n$ and $\hat{p}$ are not met.

R provides functions binconf {Hmisc} and binom.confint {binom} which can be used in the following manner:

set.seed(0)
p <- runif(1,0,1)
X <- sample(c(0,1), size = 100, replace = TRUE, prob = c(1-p, p))
library(Hmisc)
binconf(sum(X), length(X), alpha = 0.05, method = 'all')
library(binom)
binom.confint(sum(X), length(X), conf.level = 0.95, method = 'all')

Agresti, Alan; Coull, Brent A. (1998). "Approximate is better than 'exact' for interval estimation of binomial proportions". The American Statistician 52: 119–126.

Jovanovic, B. D. and P. S. Levy, 1997. A Look at the Rule of Three. The American Statistician Vol. 51, No. 2, pp. 137-139

Ross, T. D. (2003). "Accurate confidence intervals for binomial proportion and Poisson rate estimation". Computers in Biology and Medicine 33: 509–531.

Related Solutions

Solved – What does a confidence interval with a negative endpoint mean

When the procedure you have used to calculate a confidence interval gives an interval including impossible values, that is an indication of problems with the method. In your case, you have used a normal (central limit theorem-based) CI with so few observations that the approximation is invalid. You can test that easily in R, say:

We have plotted the loglikelihodd function for your case. If this is (close to) quadratic, the normal approximation will be good. That is clearly not the case here!

As @Glen_b says in a comment, you need to read up on binomial confidence intervals, see for instance Wikipedia or Binomial confidence interval estimation - why is it not symmetric?.

R code used for the plot:

make_loglik <- function(n, x) {
     function(p) dbinom(x, n, p, log=TRUE)  
    }

loglik <- make_loglik(10, 1)

plot(loglik, from=0, to=1, xlab="p", col="blue", main="log likelihood function\nBernoulli, n=10, x=1")

Solved – Confidence interval for the “shift parameter” of a non-central exponential distribution

Note that it's quite easy to work out the distribution of $K(X)=\;\stackrel{_\min}{_i}X_i$, and so to identify $Q(X,k)\,=\,k-K(X)$ as a pivotal quantity.

From there, you can immediately pass to a confidence interval, by placing the limits on $Q$ so that you get 95% of the probability inside them and then manipulating the resulting interval to make $k$ the subject of the pair of inequalities.

You might put all the risk on that one side, as you did, or split the probability evenly, or whatever other approach you choose (as you always can with confidence intervals).

Your choice should (on average) produce the shortest interval, I think, and makes good sense in this case, but it's not the only choice.

Best Answer

Related Solutions

Solved – What does a confidence interval with a negative endpoint mean

Solved – Confidence interval for the “shift parameter” of a non-central exponential distribution

Related Question