Hypothesis Testing – Why Will a Statistic Be Significant with Large Samples Unless Population Effect Is Zero?

hypothesis testing

From Wikipedia

Given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero.

For example, a sample Pearson correlation coefficient of 0.1 is strongly statistically significant if the sample size is 1000. Reporting only the significant p-value from this analysis could be misleading if a correlation of 0.1 is too small to be of interest in a particular application.

I was wondering why "given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero"?

Thanks and regards!

Best Answer

With increasing sample size, the statistical power (see below) to detect even the smallest effect size is also increasing and these tiny effect sizes are then found to be statistically significant, even though they bear no relevance at all. Just as a thought experiment to illustrate it further: What if you could include all people of interest in a study. All statistics calculated from that complete "sample" would reflect the true values in the population with no errors. So if the population effect sizes are exactly 0, then, and only then you would find them to be exactly 0. Otherwise you would find some tiny differences or correlations or whatever your effect size is.

This post might also be interesting in that context.

Addition

I found this wonderful analogy of statistical power in Harvey Motulsky's Book Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking (the analogy was originally developed by John Hartung):

Suppose you send your child into your basement to fetch a tool, say a hammer. The child comes back and says, "The hammer isn't there." What is your conclusion? Is the hammer in the basement or not? We cannot be 100% sure, so the answer must be a probability. The question that you really want to answer is, "What is the probability that the hammer is in the basement?" For this question to answer, we would need a prior probability and thus, Bayesian statistics. But we can ask a different question, "If the hammer really is in the basement, what is the chance that your child would have found it?" It is immediately clear that the answer depends:

  • How long did your child spent looking? This is analogous to sample size. The longer the child keeps looking, the more likely it is that it finds the hammer. And importantly: even if the hammer is really small, if the child spent hours looking, it is likely that it finds the hammer, despite its small size. This is also true for studies: the larger the sample size, the smaller effect sizes ("hammers") can be detected.
  • How big is the hammer? This is analogous to the effect size. A sledgehammer is easier (i.e. faster) to find than a tiny hammer. A study has more power if the effect size is large.
  • How messy is the basement? It is easier to find the hammer in an organized basement than in a messy one. This is analogous to experimental scatter (variation). A study has more power when the data show little variation.

Your child has a hard time if it has to find a tiny hammer within a short time in a messy basement. On the other hand, your child has a good chance of finding if it spends a long time searching a sledgehammer in a tidy basement (so clean up your basement before sending your child looking for something!).