Hypothesis Testing – Does P-Value Mean Anything in Bimodal Test Statistic Distribution?

bimodaldescriptive statisticshypothesis testingp-valuestatistical significance

P-value is defined the probability of obtaining a test-statistic at least as extreme as what is observed, assuming null-hypothesis is true. In other words,

$$P( X \ge t | H_0 )$$
But what if the test-statistic is bimodal in distribution? does p-value mean anything in this context? For example, I am going to simulate some bimodal data in R:

set.seed(0)
# Generate bi-modal distribution
bimodal <- c(rnorm(n=100,mean=25,sd=3),rnorm(n=100,mean=100,sd=5)) 
hist(bimodal, breaks=100)

enter image description here

And let's assume we observe a test statistic value of 60. And here we know from the picture this value is very unlikely. So ideally, I would want a statistic procedure that I use(say, p-value) to reveal this. But if we compute for the p-value as defined, we get a pretty high-p value

observed <- 60

# Get P-value
sum(bimodal[bimodal >= 60])/sum(bimodal)
[1] 0.7991993

If I did not know the distribution, I would conclude that what I observed is simply by random chance. But we know this is not true.

I guess the question I have is: Why, when computing p-value, do we compute the probability for the values "at least as extreme as" the observed?
And if I encounter a situation like the one I simulated above, what is the alternative solution?

Best Answer

What makes a test statistic "extreme" depends on your alternative, which imposes an ordering (or at least a partial order) on the sample space - you seek to reject those cases most consistent (in the sense being measured by a test statistic) with the alternative.

When you don't really have an alternative to give you a something to be most consistent with, you're essentially left with the likelihood to give the ordering, most often seen in Fisher's exact test. There, the probability of the outcomes (the 2x2 tables) under the null orders the test statistic (so that 'extreme' is 'low probability').

If you were in a situation where the far left (or far right, or both) of your bimodal null distribution was associated with the kind of alternative you were interested in, you wouldn't seek to reject a test statistic of 60. But if you're in a situation where you don't have an alternative like that, then 60 is unsual - it has low likelihood; a value of 60 is inconsistent with your model and would lead you to reject.

[This would be seen by some as one central difference between Fisherian and Neyman-Pearson hypothesis testing. By introducing an explicit alternative, and a ratio of likelihoods, a low likelihood under the null won't necessarily cause you to reject in a Neyman-Pearson framework (as long as it performs relatively well compared too the alternative), while for Fisher, you don't really have an alternative, and the likelihood under the null is the thing you're interested in.]

I'm not suggesting either approach is right or wrong here - you go ahead and work out for yourself what kind of alternatives you seek power against, whether it's a specific one, or just anything that's unlikely enough under the null. Once you know what you want, the rest (including what 'at least as extreme' means) pretty much follows from that.