Solved – Bayesian criticism of frequentist p-value

bayesianp-valuestatistical significance

I recently started reading the Bayesian criticism of the p-value and it seems that there is a lot of discussion around the fact that a frequentist approach is not that good when the Null Hypothesis is true.

For instance in this paper the authors write that "p-values overstate the evidence against the null […] this does not have to do with type-I or type-II errors; it is an “independent” property of p-value."

To illustrate this point, the authors show that when the null is true, the p-value has a uniform distribution.

What I do not get is that even when the null is true, a frequentist approach, thanks to the Central Limit Theorem, is still able to construct confidence intervals that includes 0 (non-significance) at the appropriate $\alpha$ level.

I do not get why the fact that the p-value is uniform when the null is true shows that a frequentist approach is biased. And what does it mean "independent property of p-value"?

enter image description here

library(tidyverse)
library(broom)

n=1000
x = rnorm(n,100,30)
d = 0
y = x*d + rnorm(n,0,20)
df = data.frame(y,x)
plot(x,y)
abline(lm(y~x), col = 'red')

r = replicate(1000, sample_n(df, size = 50), simplify = F)
m = r %>% map(~ lm(y~x,data = .)) %>% map(tidy)

# Central Limit Theorem
bind_rows(.id = 'sample', m) %>% filter(term =='x') %>% ggplot(aes(estimate)) + facet_grid(~term) + geom_histogram()

s = bind_rows(.id = 'sample', m) %>% filter(term =='x')
s$false_positive = ifelse(s$p.value < 0.05, 1, 0)
prop.table(table(s$false_positive))

# uniform
hist(s$p.value, breaks = 50)

Best Answer

The point that the authors are trying to make is a subtle one: they see it as a failure of NHST that, as $n$ gets arbitrarily large, the $p$-value doesn't tend to 1. It's a bit surprising that this doesn't contain any discussion of equivalence testing. To me it's somewhat obvious and reasonable that the p-value maintains its uniform distribution when the null is true considering larger and larger $n$. Large $n$ means having sensitivity to detect smaller and smaller effects, while the false positive error rate remains fixed. So under the somewhat constrained setting of the null being exactly true, the behavior of the $p$-value distribution doesn't depend on $n$ at all.

  1. NHST is, in my mind, desirable specifically because there's no way of declaring a null hypothesis to be true, as my experimental design is setup specifically to disprove it. A non-significant result may mean that my experiment was underpowered or the assumptions were wrong, so there are risks associated with accepting the null that I'd rather not incur.

  2. We never actually believe that the null hypothesis is true. Typically failed designs arise because the truth is too close to the null to be detectable. Having too much data is kind of a bad thing in this case, rather there's a subtle art in designing a study to obtain only enough sample size so as to reject the null when a meaningful difference is present.

  3. One can design a frequentist test that sequentially tests for differences (one or two tailed), and depending on a negative result, performs an equivalence test (declare that the null is true as a significant result). In the latter case one can show that the power of an equivalence test goes to 1 when the null is in fact true.