1st Yr Statistics: Hypothesis testing for the mean number of accidents at the 5 percent sig level

hypothesis testingpoisson distributionprobabilitystatistics

The following table gives the number of fatal accidents of U.S. commercial airline carriers in the $16$ years from 1980 to 1995. [See screenshot below] Do these data disprove, at the $5$ percent level of significance, the hypothesis that the mean number of accidents in a year is greater than or equal to $4.5$? What is the p-value? (Hint: First formulate a model for the number of accidents.)

My attempt

Let $Y =$ the number of accidents in a $16$ year period, since the chance of an accident is low and given the hypothesis is true I will assume $Y \sim Pois(\lambda = 4.5(16) = 72)$

$$ H_0: \lambda \ge 72$$
$$ H_1: \lambda < 72$$

Since the observed number of accidents is $58$, the p value is equal to $ P(Y \le 58) = .052$ which means the data does not disprove at the $5$ percent level the hypothesis.

> 4.5*16
[1] 72
> x <- c(0,4,4,4,1,4,2,4,3,11,6,4,4,1,4,2)
> sum(x)
[1] 58
> sum(x)/length(x)
[1] 3.625
> ppois(58,4.5*16)
[1] 0.05211354

My Question: is the above reasoning correct? I have doubts because the growing number of departures per year doesn't factor in, neither do fatalities, however the book question seems to only care about # of accidents. I don't think each of the 16 years are iid from the exact same distribution.


Book question

enter image description here

Best Answer

Note that you've aggregated the Poisson intensity across the $16$-year observation period, rather than treated the data as a sample of size $n = 16$ of annual rates. Since the research question is whether the true annual rate is below a certain hypothesized rate, it seems more natural to work with annual rates, which is also how the data are provided.

In such a case, the hypothesis test would be $$H_0 : \lambda \ge 4.5 \quad \text{vs.} \quad H_1 : \lambda < 4.5,$$ where $\lambda$ represents the true annual rate of fatal accidents. The data we are provided on the number of such accidents, which is count data, would support the notion that such accidents are Poisson distributed with annual intensity $\lambda$. So we can construct a test statistic that would reject $H_0$ at the $\alpha = 0.05$ level if the number of accidents is "sufficiently" low. How would we do this? Well, it seems natural to consider the total number of accidents in a sample of size $n$, as, under the assumption of the null hypothesis, this is also Poisson distributed with intensity $\lambda_T = n\lambda_0 = (16)(4.5) = 72$. If we denote by $T$ the sample total, then we seek a critical value $t_{\text{crit}}$ such that $$\Pr[T < t_{\text{crit}} \mid H_0] \le 0.05;$$ in other words, we want to find a threshold value where, assuming the sample was drawn from a Poisson distribution with rate $4.5$ per year, then the chance that we could see a total below this threshold is no more than the Type I error. This ensures that we do not easily make the error of rejecting $H_0$ when it is in fact true.

To find this critical value, we must find the largest $t$ such that $\Pr[T < t] \le 0.05$, where $T \sim \operatorname{Poisson}(\lambda_T = 72)$. A good first guess is to use a normal approximation, in which case $\frac{T - 72}{\sqrt{72}}$ is approximately standard normal, and since the $5^{\rm th}$ percentile of a standard normal is $-1.645$, we solve the equation $$\frac{t - 72}{\sqrt{72}} = -1.645$$ to get $t \approx 58$. Then using a computer, we calculate the exact $t$: $$\sum_{k=0}^{57} e^{-72} \frac{72^k}{k!} = 0.0399542,$$ and since $$e^{-72} \frac{72^{58}}{58!} = 0.0121594,$$ adding in the next term will make the sum exceed $0.05$, so we know $$t_{\text{crit}} = 58.$$ So you will reject $H_0$ at $\alpha = 0.05$ if the sample total is less than $58$.

As it turns out, the sample total is exactly $T = 58$, so you fail to reject. The $p$-value you calculated is simply $\Pr[T \le 58 \mid \lambda_T = 72]$ which is correct. So yes, your method is computationally correct, although perhaps a bit imprecisely reasoned. The reason why it is correct is because you've exploited the property that the sum of iid Poisson variables is also Poisson, just as I have done by stating the distribution of the test statistic under the null hypothesis.

Related Question