Solved – The impact of non-normality (contaminated normal) on type 1 error and power when sigma is known

confidence intervalstatistical-powertype-i-and-ii-errors

When sigma is unknown, contaminated normal distributions tend to have (1) the issue of the actual Type 1 error rate being lower than the nominal level and (2) the issue of reduced power. This pattern has been confirmed by data simulations.

The question then is: does knowing the population variance help with (2) keeping the type 1 error at the nominal level and (2) having good power?

Thank you in advance for your help!


To just add some info:

By power, I simply mean correctly rejecting the null hypothesis of mu=mu0 when this h0 is actually false. I thought that contaminated normal distributions tend to cause wider CIs, which can make correct rejections more difficult.

I did some data simulations to check contaminated normal distributions. Say I repeatedly sample from a contaminated normal distribution with mu=0 and sigma=3 for 1000 times. Then I create 90% confidence intervals based on these samples and check how many of these 1000 CIs contain the population mean. As it turned out (I tried this several times), the percentage of CIs containing the population mean is approximately 90% (that is when sigma is known). So in this case, it seems that the the actual type 1 error rate is comparable to the nominal level of 0.1.

I also tried this assuming that sigma is unknown (so using t-statistics, rather than z), I consistently got more than 90% CIs that contained the population mean, so in this case, the actual type 1 error rate seems to be lower than the nominal level of .1.

But I thought the type 1 error rate and power would suffer even when sigma is known (e.g., the actual type 1 error can be lower the nominal level). But my simulations (assuming that I did it correctly) at least did not show this. So I want to check with you guys to see what's going on. Thanks 🙂

Best Answer

Are you testing 2 populations for the same mean, or one population against a theoretical mean?

Whatever ..... a contaminated normal will tend to inflate the variance estimate (and standard deviation) and thus reduce the value of the t-statistic (by inflating the denominator). It can wreck havoc with the numerator as well, but typically, the inflated denominator will pull the t-statistic towards 0 and inflate the type 1 error.

If $\sigma$ is known, then your test statistic will be a $z$ score, based on the sample mean. The mean is non-robust in this situation, so spurious large values could inflate that score and increase the type-1 error.

I am assuming here that the mean and variance in question are those of the "true" distribution, absent contamination. But concretely, in the real world, what does that actually mean? Suppose I have 2 industrial processes A and B that make widgets. Suppose the diameter of widget A is less than that of widget B (which I want, for some reason), but 10% of the A widgets are way off because of "contamination". Which process should I prefer? And knowing the extent of the contamination - which in a sense is what knowing the variance amounts to in this case - is not going to help very much.

Rather than hope your type 1 and type 2 errors don't lead you astray, would it not be better to cull the outliers, use a robust estimator, or ensure the contamination does not occur in the first place?

Just sayin'

Related Question