Solved – Calculate the confidence interval for the mean of a beta distribution

beta distributionmean

Consider a beta distribution for a given set of ratings in [0,1]. After having calculated the mean:

$$
\mu = \frac{\alpha}{\alpha+\beta}
$$

Is there a way to provide a confidence interval around this mean?

Best Answer

While there are specific methods for calculating confidence intervals for the parameters in a beta distribution, I’ll describe a few general methods, that can be used for (almost) all sorts of distributions, including the beta distribution, and are easily implemented in R.

Profile likelihood confidence intervals

Let’s begin with maximum likelihood estimation with corresponding profile likelihood confidence intervals. First we need some sample data:

# Sample size
n = 10

# Parameters of the beta distribution
alpha = 10
beta = 1.4

# Simulate some data
set.seed(1)
x = rbeta(n, alpha, beta)

# Note that the distribution is not symmetrical
curve(dbeta(x,alpha,beta))

Probability density function for the beta distribution.

The real/theoretical mean is

> alpha/(alpha+beta)
0.877193

Now we have to create a function for calculating the negative log likelihood function for a sample from the beta distribution, with the mean as one of the parameters. We can use the dbeta() function, but since this doesn’t use a parametrisation involving the mean, we have have to express its parameters (α and β) as a function of the mean and some other parameter (like the standard deviation):

# Negative log likelihood for the beta distribution
nloglikbeta = function(mu, sig) {
  alpha = mu^2*(1-mu)/sig^2-mu
  beta = alpha*(1/mu-1)
  -sum(dbeta(x, alpha, beta, log=TRUE))
}

To find the maximum likelihood estimate, we can use the mle() function in the stats4 library:

library(stats4)
est = mle(nloglikbeta, start=list(mu=mean(x), sig=sd(x)))

Just ignore the warnings for now. They’re caused by the optimisation algorithms trying invalid values for the parameters, giving negative values for α and/or β. (To avoid the warning, you can add a lower argument and change the optimisation method used.)

Now we have both estimates and confidence intervals for our two parameters:

> est
Call:
mle(minuslogl = nloglikbeta, start = list(mu = mean(x), sig = sd(x)))

Coefficients:
        mu        sig 
0.87304148 0.07129112

> confint(est)
Profiling...
         2.5 %    97.5 %
mu  0.81336555 0.9120350
sig 0.04679421 0.1276783

Note that, as expected, the confidence intervals are not symmetrical:

par(mfrow=c(1,2))
plot(profile(est)) # Profile likelihood plot

Profile likelihood plot for the beta distribution.

(The second-outer magenta lines show the 95% confidence interval.)

Also note that even with just 10 observations, we get very good estimates (a narrow confidence interval).

As an alternative to mle(), you can use the fitdistr() function from the MASS package. This too calculates the maximum likelihood estimator, and has the advantage that you only need to supply the density, not the negative log likelihood, but doesn’t give you profile likelihood confidence intervals, only asymptotic (symmetrical) confidence intervals.

A better option is mle2() (and related functions) from the bbmle package, which is somewhat more flexible and powerful than mle(), and gives slightly nicer plots.

Bootstrap confidence intervals

Another option is to use the bootstrap. It’s extremely easy to use in R, and you don’t even have to supply a density function:

> library(simpleboot)
> x.boot = one.boot(x, mean, R=10^4)
> hist(x.boot)                # Looks good
> boot.ci(x.boot, type="bca") # Confidence interval
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = x.boot, type = "bca")

Intervals : 
Level       BCa          
95%   ( 0.8246,  0.9132 )  
Calculations and Intervals on Original Scale

The bootstrap has the added advantage that it works even if your data doesn’t come from a beta distribution.

Asymptotic confidence intervals

For confidence intervals on the mean, let’s not forget the good old asymptotic confidence intervals based on the central limit theorem (and the t-distribution). As long as we have either a large sample size (so the CLT applies and the distribution of the sample mean is approximately normal) or large values of both α and β (so that the beta distribution itself is approximately normal), it works well. Here we have neither, but the confidence interval still isn’t too bad:

> t.test(x)$conf.int
[1] 0.8190565 0.9268349

For just slightly larges values of n (and not too extreme values of the two parameters), the asymptotic confidence interval works exceedingly well.

Related Question