Solved – Confidence interval for poisson distributed data

confidence intervalpoisson distributionstandard deviation

I'd like to calculate the confidence interval for a variable with a lower bound and I can't seem to figure out how to do so. I've seen several similar posts, but none answered my question, at least not directly. I use a toy example here for simplicity.

To be concrete lets consider the distribution (I chose these numbers myself so it kind of looks like a Poisson distribution),

{1,2,3,5,1,2,2,3,7,2,3,4,1,5,7,6,4,1,2,2,3,9,2,1,2,2,3}

which is shown below,

enter image description here

If I calculate the standard deviation and mean for this distribution I get 3.1 and 2.4 respectively, but since the distribution isn't a normal distribution this doesn't really give me the confidence interval. How would I go about calculating the confidence interval properly?

Best Answer

This answer is based on the clarification offered in comments:

I'd like to make a statement such as ... "I am 68% sure the mean is between $3.1−σ_−$ and $3.1+σ_+$", and I want to calculate $σ_+$ and $σ_−$.

I think that at least in the physics world this is called a confidence interval.

Let's take it as given that in response to my question "confidence interval for what?" you responded that want a confidence interval for the mean (and as made clear, not some other interval for values from the distribution).

There's one issue to clear up first - "I am 68% sure the mean is between" isn't really the usual interpretation placed on a confidence interval. Rather, it's that if you repeated the procedure that generated the interval many times, 68% of such intervals would contain the parameter.

Now to address the confidence interval for the mean.

I agree with your calculation of mean and sd of the data:

> x=c(1,2,3,5,1,2,2,3,7,2,3,4,1,5,7,6,4,1,2,2,3,9,2,1,2,2,3)
> mean(x);sd(x)
[1] 3.148148
[1] 2.106833

However, the mean doesn't have the same sd as the population the data was drawn from.

The standard error of the mean is $\sigma/\sqrt{n}$. We could estimate that from the sample sd (though if the data were truly Poisson, this isn't the most efficient method):

> sd(x)/sqrt(length(x))
[1] 0.4054603

If we assumed that the sample mean was approximately normally distributed (but did not take advantage of the possible Poisson assumption for the original data), and assumed that $\sigma=s$ (invoking Slutsky, in effect) then an approximate 68% interval for the mean would be $3.15\pm 0.41$.

However, the sample isn't really large enough for Slutsky. A better interval would take account of the uncertainty in $\hat \sigma$, which is to say, a 68% t$_{26}$-interval for the mean would be

$3.15\pm 1.013843\times 0.41$

which is just a fraction wider.

Now, as for whether the sample size is large enough to apply the normal theory CI we just used, that depends on your criteria. Simulations at similar Poisson means (in particular, ones deliberately chosen to be somewhat smaller than the observed one) at this sample size suggest that using a t-interval will work quite well for similar Poisson rates and 27 observations or more.

If we take account of the fact that the data are (supposedly) Poisson, we can get a more efficient estimate of the standard deviation and an interval for $\mu$, but if there's any risk the Poisson assumption could be wrong - a chance of overdispersion caused by some homogeneity of Poisson parameters, say - then the t-interval would probably be better.

Nevertheless, we should consider that specific question - "how to get a confidence interval for the population mean of Poisson variables" -- but this more specific question has already been answered here on CV - for example, see the fine answers here.

Related Solutions

Solved – confusion regarding confidence interval of normal distribution

It sounds like the statement you are quoting is not about confidence intervals but rather about the probability content within k standard deviations of the mean for k=1, 2, 3. This result plays a role in constructing confidence intervals for the mean of a normal sample though, because if Xi iid N(μ,σ) the sample mean Xb is N(μ,σ/√n) So for known σ [Xb-2σ/√n,Xb+2σ/√n] is approximately a 95% confidence interval for μ (actually 95.4%). When σ is unknown replacing σ with the sample standard deviation s will still give an approximate 95% confidence interval for μ when n is large. For small n the exact distribution to construct the confidence interval is student t with n-1 degrees of freedom. So 2 should be replaced by the (larger) appropriate percentile from the t distribution to get the 95% confidence.

Solved – Calculating a confidence interval and required sample size for the mean time to next event of a Poisson process

The first thing is that with a constant intensity Poisson-process with rate $\lambda$ per unit time, the time between events has an exponential distribution with mean $\mu=1/\lambda$.

Second, you actually ask two separate questions, so let's deal with them one at a time.

if I calculate the average time to occurrence based on 10 such events, how do I calculate the upper and lower error bound (in other words, the confidence interval) of the estimated average.

If you average ten such events, the mean is distributed as $\text{gamma}(10,\mu/10)$ (this is the gamma parameterized in the shape-scale form, not the shape-rate form)

If you want a $1-\alpha$ interval for the population mean, then one approach is to construct a pivotal quantity for the mean, which is to say a quantity $Q$ which is a function of the data and the parameter whose distribution doesn't depend on the parameter.

Note that $\hat{\mu}/\mu$ will also have a gamma distribution, it will be distributed as $\text{gamma}(10,1/10)$, so that is a pivotal quantity (and works just fine for this), but it may be easier to work with its inverse $\mu/\hat{\mu}$, which will be distributed as an inverse-gamma distribution, specifically $\text{inverse-gamma}(10,1/10)$.

Similarly, for a sample of size $n$, the distribution of $\mu/\hat{\mu}$ will be $\text{inverse-gamma}(n,1/n)$.

(The inverse chi-square is a particular case of this distribution, and it, too can be used for this purpose, with a suitable rescaling. So the advice you were given was correct, if not immediately helpful to you.)

(I'll assume you want a symmetric (in probability) interval.)

Let $l$ be the $\alpha/2$ quantile of an $\text{inverse-gamma}(n,1/n)$ distribution, and let $u$ be the $1-\alpha/2$ quantile - that is, the values that cut off a lower and upper tail area of $\alpha/2$.

Then $P(l<Q<u) = 1-\alpha$, so a $1-\alpha$ interval for $Q$ is $(l,u)$. Hence we can work out the interval for $\mu$:

$$l<\mu/\hat{\mu}<u$$

$$l\,\hat{\mu}<\mu<u\,\hat{\mu}$$

where $\hat{\mu}=\bar x$, which is to say a $1-\alpha$ interval for $\mu$ is $(l\,\bar x,u\, \bar x)$

Imagine the sample size was 10, and that the mean time was 3.21 seconds. Further, imagine you wanted a 95% interval.

I'm going to work in R, but anything that will give you gamma, inverse gamma, chi-square or inverse-chi-square quantiles will work.

So here's the 0.025 and 0.975 quantiles of a gamma(10,1/10):

 qgamma(.025,10,scale=.1)
[1] 0.4795389
> qgamma(.975,10,scale=.1)
[1] 1.70848

The corresponding cutoffs for an inverse gamma will be the inverses of these, flipped around:

> 1/qgamma(.975,10,scale=.1)
[1] 0.5853155
> 1/qgamma(.025,10,scale=.1)
[1] 2.085337

So that's our $l$ and $u$ respectively.

Hence a 95% interval for $\mu$ in this example is $(3.21\times 0.5853, 3.21\times 2.0853)$, or $(1.879, 6.694)$

That interval is far bigger that $\pm 20\%$, of course.

If you want to do it in terms of an inverse-chi-square instead, note that a $χ²(\nu)$ is a $\text{gamma}(\nu/2, 1/2)$; from there it's reasonably easy to convert the above calculations.

if I want the estimated average time to occurrence to be within ±20% of the actual mean based on a confidence level, how many samples do I need to record?

This is basically the inverse of the above problem, where we specify information about the interval and solve for $n$

Edit: the approach I took here before didn't quite solve the problem as stated, but a related problem (I solved the true mean being within 20% of the estimates). This version should be right:

An interval that's like this: $0.8 \mu <\hat \mu < 1.2 \mu$

is equivalent to (after dividing by $\mu\,\hat\mu$ and inverting):

$0.833 \hat\mu < \mu < 1.25 \hat\mu$

which is in the right form to play with.

So our interval is $l_n \hat{\mu}<\mu<u_n \hat{\mu}$.

You want to choose $n$ that $u_n>1.25$ and that $l_n<0.833$.

Unfortunately, the distribution of a $\text{gamma}(n,1/n)$ (and, correspondingly, and inverse gamma) is different for every $n$.

One could of course proceed by trial and error and this will work perfectly well, if sometimes being a little tedious.

Half a dozen guesses for the lower limit gets us to:

> n=86;1/qgamma(.025,n,scale=1/n)
[1] 1.250202
> n=87;1/qgamma(.025,n,scale=1/n)
[1] 1.248503

While for the other end:

> n=105;1/qgamma(.975,n,scale=1/n)
[1] 0.8332447
> n=106;1/qgamma(.975,n,scale=1/n)
[1] 0.8339306

Which means we need 106 observations to be within the desired limit on both ends and get the desired coverage.

Note that for the lower end we needed a larger sample size to achieve the desired width. [If you are willing to use an asymmetric interval (asymmetric in probability, I mean), in order to get a symmetric-in-percentage-error interval -- i.e. to be substantially more likely to be wrong out one end of the interval than the other -- then you can achieve a smaller sample size while still getting $1-\alpha$ coverage.]

Computing an asymmetric interval

If we try different values of $n$, we can find how much tail area is cut off either side of the +/-20%, and then find the smallest $n$ that totals to no more than $\alpha$ in the tails. In our example, the first place to try is to 'split the difference' and try $n=96$:

 n=96;pgamma(.8,n,scale=1/n);pgamma(1.2,n,scale=1/n,lower.tail=FALSE)
[1] 0.01907025
[1] 0.03033813

Those two tail areas give just under 0.05. Trying n=95 gives more than 0.05 in the tails combined, so n=96 is as small as you can go.

(Note that the 0.03033 value corresponds to the lower tail for the inverse gamma, but it's the probability of the estimate being at least 20% too high)

So if you are happy with more probability of being too high than too low, you can do n=96 for a 95% interval. [A higher coverage ($1-\alpha$) would give a larger $n$, and a lower coverage would give a smaller $n$.]

Simulation example in R to show that this approach is doing what it should. In this example, we calculate the limits ($l$ and $u$) for a symmetric-in-probability (2.5% in each tail) interval based on the 106 just computed:

n=106;l=1/qgamma(.975,n,scale=1/n);u=1/qgamma(.025,n,scale=1/n)
l;u
[1] 0.8339306
[1] 1.221422

Since $l$ and $u$ are within $(0.8333, 1.25)$ if the true mean lies inside that interval the estimate can't be further than 20% from the true mean and by construction, 95% of the time the generated interval includes the true mean, then at least 95% of the time we generate such an interval, the estimate must be within 20% of the true mean.

Now we simulate ten thousand intervals to show it does have the required coverage:

sum(replicate(10000,
    isTRUE(all.equal(c(l,u)*mean(rexp(106,rate=1/3.21))<3.21,c(TRUE,FALSE)))))
[1] 9507

- looks like it's right.

To avoid (or at least greatly reduce) the trial and error (not that it takes more than a few moments), we can make use of an approximation that will get us very close to the right answer - the Wilson-Hilferty transformation:

If $X \sim χ²(k)$ then $\scriptstyle\sqrt[3]{X/k}$ is approximately normally distributed with mean $\scriptstyle 1-2/(9k)$ and variance $\scriptstyle 2/(9k)$.

Best Answer

Related Solutions

Solved – confusion regarding confidence interval of normal distribution

Solved – Calculating a confidence interval and required sample size for the mean time to next event of a Poisson process

Related Question