[Math] estimate population percentage within an interval, given a small sample

probability distributionssamplingstatistics

Given a small sample from a normally-distributed population, how do I calculate the confidence that a specified percentage of the population is within some bounds [A,B]?

To make it concrete, if I get a sample of [50,51,52] from a normally-distributed set, I should be able to calculate a fairly high confidence that 50% of the population lies within the range of 0-100, even with such a small sample.

This is certainly related to the "tolerance interval", but differs in an important way. In all of the examples I can find for tolerance intervals, the required percentile and confidence is given, and the interval is found. In my problem, the interval and percentile are given, and I need to find the confidence.

The relevant equation is this one: (Guttman 1970)

$$1 – \gamma = P\left[P(X \geqq t_0) \geqq 1 – p\right] = P\left[T_{n-1}^*(\sqrt n z_p) \leqq \sqrt n K\right]$$

With definitions:

  • $1 – \gamma$ is the confidence
  • $n$ is the number of samples
  • $100p$ is the percentage of the population required to be within the interval, as estimated from the sample mean and sample variance.
  • $t_0 = x – K_{S, z_p}$ is the $(1 – p) 100$th percentile of the standard normal distribution
  • $T_v^*(\delta)$ is the noncentral Student’s t distribution with $v$ degrees of freedom and noncentrality parameter $\delta$.

This solves the one-sided problem, but I'm having trouble extending this to the two-sided problem. In confidence-interval land, I'd use the fact that $P(t_1 \leqq X \leqq t_2) = 1 – P(t_1 \gt X) – P(X \gt t_2)$, to break this into two one-sided problems, but in tolerance-interval land I need to relate these back to the confidence ($1-\gamma$), and I don't see how.

$$1 – \gamma = P\left[P(t_1 \geqq X \geqq t_2) \geqq 1 – p\right] = ??? $$

If I attempt to turn this into two one-sided problems:

$$1 – \gamma = P\left[1 – P(t_1 \lt X) – P(X \lt t_2) \geqq 1 – p\right] = ??? $$

And I'm utterly stuck there. I don't see how to relate this back to the one-sided tolerance interval solution.


I'm not certain this is useful for people to understand the question, but it might, so I'm putting it in this addenda.

In scipy, I'm able to pretty easily calculate $K$ given $p$ $\gamma$ and $n$ as:

def K(p, gamma, n):
    from scipy import stats
    return stats.nct.ppf(1-gamma, n-1, sqrt(n) * stats.norm.ppf(1-p)) / sqrt(n)

I'm also able to find $\gamma$ given $K$ $p$ and $n$ as:

def gamma(p, n, K):
    from scipy import stats                                                                                              
    z_p = stats.norm.ppf(1-p)
    return 1 - stats.nct.cdf(sqrt(n) * K, n-1, sqrt(n) * z_p)

Much less important, but is this a valid simplification of the Guttman's formula?

$$1 – \gamma = P\left[P(X \geqq t_0) \geqq 1 – p\right] = P\left[T_{n-1}^*(\sqrt n z_p) \leqq \sqrt n K\right]$$
$$\gamma = P\left[P(X \geqq t_0) \lt 1 – p\right] = P\left[T_{n-1}^*(\sqrt n z_p) \gt \sqrt n K\right]$$
$$\gamma = P\left[P(X \lt t_0) \lt p\right] = P\left[T_{n-1}^*(\sqrt n z_p) \gt \sqrt n K\right]$$

If so, this form seems way easier to understand, to me.

Best Answer

One possibility: you have a certain number of successes and failures, and you want to guess the percentage of the underlying population. You calculate the Agresti-Coull interval (see Brown, Cai, & DasGupta 2001) with the required parameters.

On the bright side this is easy and makes no assumptions about the type of distribution. On the downside it doesn't take advantage of the extra information about where the samples in the interval fall -- if you know something about how they are distributed this might get you a more precise estimate based on the data you have.

Edit: for a brief overview see http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Agresti-Coull_Interval

Related Question