Solved – Is this really how p-values work? Can a million research papers per year be based on pure randomness

I'm very new to statistics, and I'm just learning to understand the basics, including $p$-values. But there is a huge question mark in my mind right now, and I kind of hope my understanding is wrong. Here's my thought process:

Aren't all researches around the world somewhat like the monkeys in the "infinite monkey theorem"? Consider that there are 23887 universities in the world. If each university has 1000 students, that's 23 million students each year.

Let's say that each year, each student does at least one piece of research, using hypothesis testing with $\alpha=0.05$.

Doesn't that mean that even if all the research samples were pulled from a random population, about 5% of them would "reject the null hypothesis as invalid". Wow. Think about that. That's about a million research papers per year getting published due to "significant" results.

If this is how it works, this is scary. It means that a lot of the "scientific truth" we take for granted is based on pure randomness.

A simple chunk of R code seems to support my understanding:

library(data.table)
dt <- data.table(p=sapply(1:100000,function(x) t.test(rnorm(10,0,1))$p.value))
dt[p<0.05,]

So does this article on successful $p$-fishing: I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.

Is this really all there is to it? Is this how "science" is supposed to work?

Best Answer

This is certainly a valid concern, but this isn't quite right.

If 1,000,000 studies are done and all the null hypotheses are true then approximately 50,000 will have significant results at p < 0.05. That's what a p value means. However, the null is essentially never strictly true. But even if we loosen it to "almost true" or "about right" or some such, that would mean that the 1,000,000 studies would all have to be about things like

The relationship between social security number and IQ
Is the length of your toes related to the state of your birth?

and so on. Nonsense.

One trouble is, of course, that we don't know which nulls are true. Another problem is the one @Glen_b mentioned in his comment - the file drawer problem.

This is why I so much like Robert Abelson's ideas that he puts forth in Statistics as Principled Argument. That is, statistical evidence should be part of a principled argument as to why something is the case and should be judged on the MAGIC criteria:

Magnitude: How big is the effect?
Articulation: Is it full of "ifs", "ands" and "buts" (that's bad)
Generality: How widely does it apply?
Interestingness
Credibilty: Incredible claims require a lot of evidence