Solved – How to calculate confidence intervals for ratios

confidence interval

Consider an experiment that outputs a ratio $X_i$ between 0 and 1. How this ratio is obtained should not be relevant in this context. It was elaborated in a previous version of this question, but removed for clarity after a discussion on meta.

This experiment is repeated $n$ times, while $n$ is small (about 3-10). The $X_i$ are assumed to be independent and identically distributed. From these we estimate the mean by calculating the average $\overline X$, but how to calculate a corresponding confidence interval $[U,V]$?

When using the standard approach for calculating confidence intervals, $V$ is sometimes larger than 1. However, my intuition is that the correct confidence interval…

… should be within the range 0 and 1
… should get smaller with increasing $n$
… is roughly in the order of the one calculated using the standard approach
… is calculated by a mathematically sound method

These are not absolute requirements, but I would at least like to understand why my intuition is wrong.

Calculations based on existing answers

In the following, the confidence intervals resulting from the existing answers are compared for $\{X_i\} = \{0.985,0.986,0.935,0.890,0.999\}$.

Standard Approach (aka "School Math")

$\overline X = 0.959$, $\sigma^2 = 0.0204$, thus the 99% confidence interval is $[0.865,1.053]$. This contradicts intuition 1.

Cropping (suggested by @soakley in the comments)

Just using the standard approach then providing $[0.865,1.000]$ as result is easy to do. But are we allowed to do that? I am not yet convinced that the lower boundary just stays constant (–> 4.)

Logistic Regression Model (suggested by @Rose Hartman)

Transformed data: $\{4.18,4.25,2.09,2.66,6.90\}$
Resulting in $[0.173,7.87]$, transforming it back results in $[0.543,0.999]$.
Obviously, the 6.90 is an outlier for the transformed data while the 0.99 is not for the untransformed data, resulting in a confidence interval that is very large. (–> 3.)

Binomial proportion confidence interval (suggested by @Tim)

The approach looks quite good, but unfortunately it does not fit the experiment. Just combining the results and interpreting it as one large repeated Bernoulli experiment as suggested by @ZahavaKor results in the following:

$985+986+890+935+999 = 4795$ out of $5*1000$ in total.
Feeding this into the Adj. Wald calculator gives $[0.9511,0.9657]$. This does not seem to be realistic, because not a single $X_i$ is inside that interval! (–> 3.)

Bootstrapping (suggested by @soakley)

With $n=5$ we have 3125 possible permutations. Taking the $\frac{3093}{3125} = 0.99$ middle means of the permutations, we get $[0.91,0.99]$. Looks not that bad, though I would expect a larger interval (–> 3.). However, it is per construction never larger than $[min(X_i),max(X_i)]$. Thus for a small sample it will rather grow than shrink for increasing $n$ (–> 2.). This is at least what happens with the samples given above.

Best Answer

First, to clarify, what you're dealing with is not quite a binomial distribution, as your question suggests (you refer to it as a Bernoulli experiment). Binomial distributions are discrete --- the outcome is either success or failure. Your outcome is a ratio each time you run your experiment, not a set of successes and failures that you then calculate one summary ratio on. Because of that, methods for calculating a binomial proportion confidence interval will throw away a lot of your information. And yet you're correct that it's problematic to treat this as though it's normally distributed since you can get a CI that extends past the possible range of your variable.

I recommend thinking about this in terms of logistic regression. Run a logistic regression model with your ratio variable as the outcome and no predictors. The intercept and its CI will give you what you need in logits, and then you can convert it back to proportions. You can also just do the logistic conversion yourself, calculate the CI and then convert back to the original scale. My python is terrible, but here's how you could do that in R:

set.seed(24601)
data <- rbeta(100, 10, 3)
hist(data)

data_logits <- log(data/(1-data)) 
hist(data_logits)

# calculate CI for the transformed data
mean_logits <- mean(data_logits)
sd <- sd(data_logits)
n <- length(data_logits)
crit_t99 <- qt(.995, df = n-1) # for a CI99
ci_lo_logits <- mean_logits - crit_t * sd/sqrt(n)
ci_hi_logits <- mean_logits + crit_t * sd/sqrt(n)

# convert back to ratio
mean <- exp(mean_logits)/(1 + exp(mean_logits))
ci_lo <- exp(ci_lo_logits)/(1 + exp(ci_lo_logits))
ci_hi <- exp(ci_hi_logits)/(1 + exp(ci_hi_logits))

Here are the lower and upper bounds on a 99% CI for these data:

> ci_lo
[1] 0.7738327
> ci_hi
[1] 0.8207924

Related Solutions

Solved – Constructing confidence intervals based on profile likelihood

In general, the confidence interval based on the standard error strongly depends on the assumption of normality for the estimator. The "profile likelihood confidence interval" provides an alternative.

I am pretty sure you can find documentation for this. For instance, here and references therein.

Here is a brief overview.

Let us say the data depend upon two (vectors of) parameters, $\theta$ and $\delta$, where $\theta$ is of interest and $\delta$ is a nuisance parameter.

The profile likelihood of $\theta$ is defined by

$L_p(\theta) = \max_{\delta} L(\theta, \delta)$

where $L(\theta, \delta)$ is the 'complete likelihood'. $L_p(\theta)$ does no longer depend on $\delta$ since it has been profiled out.

Let a null hypothesis be $H_0 : \theta = \theta_0$ and the likelihood ratio statistic be

$LR = 2 (\log L_p(\hat{\theta}) - \log L_p(\theta_0))$

where $\hat{\theta}$ is the value of $\theta$ that maximises the profile likelihood $L_p(\theta)$.

A "profile likelihood confidence interval" for $\theta$ consists of those values $\theta_0$ for which the test is not significant.

Confidence Interval – Calculating for an Exponential Distribution

The asymptotic confidence interval may be based on the (asymptotic) distribution of the mle. The Fisher information for this problem is given by $\frac{1}{\theta^2}$. Hence an asymptotic CI for $\theta$ is given by

$$\bar{X} \pm 1.96 \sqrt{\frac{\bar{X}^2}{n}}$$

where we have replaced $\theta^2$ by its mle, since we do not know the population parameter.

And here is a very simple R-simulation of the coverage for the case of a sample of size fifty from an exponential distribution with parameter $2$.

r<-rep(0,1000)
for(i in 1:1000){
  x<-rexp(50,2)
mle<-mean(x)
if(1/2<=mle+qnorm(0.975)*sqrt((mle^2)/50) & 1/2>=mle+qnorm(0.025)*sqrt((mle^2)/50)){r[i]<-1}
}
sum(r==1)
 [1] 948