[Math] Calculating test statistic of a poisson distribution

hypothesis testingpoisson distributionprobabilitystatistics

I'm in an intro stats class, and I'm wondering how I can argue or prove the question below regarding sample size and poisson distribution.

Suppose a company which produces fire alarms has claimed that the fire alarms make only one false alarm per year, on average. Let $X$ denote the number of false alarms per year. Assume $X \sim Poisson(\lambda)$. Under the company's claim, the probability of observing $x$ fire alarms per year is

$P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!} = \frac{e^{-1}}{x!}, x = 0,1,…$

A customer had a bad experience with the fire alarm he purchased before. He allows 1% chance for falsely rejecting $H_0$. He gathers one hundred people who observed three or more fire alarms and observed:

$(X_1,X_2,…,X_{100}) = (4,6,…,3)$ with

$\bar{X}_{100} = \frac{1}{100} \sum_{i=1}^{100}X_i = 3.32$

a. Ignoring any flaw of data collection, calculate the test statistic (which is compared to the standard normal distribution) and the approximate p-value for testing

$H_0 : \lambda = 1$ versus $H_1 : \lambda > 1$

Also note that the population mean and the population variance are equal to $\lambda = 1$

b. In two sentences, argue why the sample size of $n = 100$ is not useful for the hypothesis testing.


a. For part a, I am doing this, because I am supposed to use a normal distribution.

$T = \frac{\bar{X} – \mu}{\sqrt{\frac{\sigma^2}{n}}} = \frac{\bar{X} – \lambda}{\sqrt{\frac{\lambda^2}{n}}} = \frac{3.32 – 1}{\sqrt{\frac{1}{100}}} = 23.2$

The p-value I got (using R) was just $qnorm(0.99,0,1) \approx 2.326$

b. Part b is where I get confused. I read that having a large sample size is actually good when testing a hypothesis. However I also realize that as $n$ increases, so does the testing statistic while $\lambda$ remains fixed. I'm just wondering what can I argue mathematically.. and why is it like this for Poisson distributions?

Best Answer

According to the statement of the question, the sample that is collected is only representative of those individuals who experienced at least three alarms: therefore, the salient question is, what is the distribution of the sample mean under the assumption of the null hypothesis? That is, what is the distribution of $\bar X \mid H_0$ given that $X_i \ge 3$ for each $i = 1, 2, \ldots, 100$?

To this end, we observe that if $Y_i = (X_i \mid X_i \ge 3)$, $$\Pr[Y = x] = \frac{\Pr[(X_i = x) \cap (X_i \ge 3)]}{\Pr[X_i \ge 3]} = \frac{e^{-\lambda} \lambda^x/x!}{1-e^{-\lambda}(1+\lambda+\lambda^2/2)}, \quad x \ge 3.$$ Under the null hypothesis that $\lambda = 1$, this simply becomes $$\Pr[Y_i = x] = \frac{2}{(2e-5)x!}.$$ The expected value is $$\mu = \operatorname{E}[Y] = \sum_{x=3}^\infty x \Pr[Y_i = x] = 1 + \frac{1}{2e-5} \approx 3.29062.$$ The variance is $$\sigma^2 = \operatorname{Var}[Y] = \frac{2(7-8e+2e^2)}{(2e-5)^2} \approx 0.334309.$$ Therefore, the expected value of the sample mean under the null hypothesis is $$\mu_0 = \mu = 3.29062,$$ and the variance of the sample mean under the null hypothesis is $$\sigma_0^2 = \frac{\sigma^2}{100} = 0.00334309.$$ Using the Central Limit Theorem, $$\bar X \mid H_0 \, \dot\sim \operatorname{Normal}(\mu_0, \sigma_0^2),$$ hence we can perform a $z$-test: if $$Z = \frac{\bar X - \mu_0}{\sigma_0}$$ exceeds the $99^{\rm th}$ percentile of the standard normal distribution, then the sample contains sufficient evidence to suggest that the true value of $\lambda$ is strictly greater than $1$ with at most a Type I error probability of $\alpha = 0.01$.

Note that you cannot use the calculation you performed for part (a). The reason should now be clear: such a calculation would only be applicable when the sample is drawn from all people, not just those who observed at least 3 alarms. What we can see from the correct calculation is that if the mean number of alarms among those who had at least 3 is more than $3.42512$, then you would reject the one-sided test at the 1% significance level. The sample just isn't extreme enough.

As for part (b), I don't understand the intent here. As you can see, we are easily able to perform a meaningful hypothesis test with $n = 100$, which is large enough to justify the use of a normal approximation to the sampling distribution. The test statistic that you used, which seems to be implied by the question, is not the correct one to apply for the given hypothesis.