Solved – Is p-value essentially useless and dangerous to use

bayesianhypothesis testingp-valuereproducible-researchstatistical significance

This article "The Odds, Continually Updated" from NY Times happened to catch my attention. To be short, it states that

[Bayesian statistics] is proving especially useful in approaching complex problems, including searches like the one the Coast Guard used in 2013 to find the missing fisherman, John Aldridge (though not, so far, in the hunt for Malaysia Airlines Flight 370)…….., Bayesian statistics are rippling through everything from physics to cancer research, ecology to psychology…

In the article, there are also some criticisms about the frequentist's p-value, for example:

Results are usually considered “statistically significant” if the p-value is less than 5 percent. But there is a danger in this tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.

Besides above, perhaps the most famous paper criticizing p-value is this one – "Scientific method: Statistical errors" by Regina Nuzzo from Nature, in which a lot of scientific issues raised by p-value approach has been discussed, like reproducibility concerns, p-value hacking, etc.

P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. …… Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously. …… “That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05”, and “She is a p-hacker, she always monitors data while it is being collected.”

Another thing is an interesting plot as following from here, with the comment about the plot:

No matter how small your effect may be, you can always do the hard work of gathering data in order to pass the threshold of p < .05. As long as the effect you're studying isn't non-existent, p-values just measure how much effort you've put into collecting data.

enter image description here

With all above, my questions are:

  1. What does Andrew Gelman's argument, in the second block quote, mean precisely? Why did he interpret 5-percent p-value as "one in 20 statistically significant results are noting but random noise"? I am not convinced since to me p-value is used to make inference on one single study. His point seems related to multiple testing.

    Update: Check Andrew Gelman's blog about this: No, I didn't say that! (Credits to @Scortchi, @whuber).

  2. Given the criticisms about p-value, and also given there are a lot of information criteria, like AIC, BIC, Mallow's $C_p$ for evaluating the significance of a model (hence variables), should we not use p-value for variable selection at all but use those model selection criteria?

  3. Are there any good practical guidances of using p-value for statistical analysis which could lead to more reliable research results?
  4. Would Bayesian modeling framework a better way to pursue, as some statistician advocate? Specifically, would Bayesian approach be more likely to resolve false finding or manipulating the data issues? I am not convinced here as well since the prior is very subjective in Bayesian approach. Are there any practical and well-known studies that show Bayesian approach is better than frequentist's p-value, or at least in some particular cases?

    Update: I would be particularly interested in whether there are cases that Bayesian approach is more reliable than frequentist's p-value approach. By "reliable", I mean the Bayesian approach is less likely to manipulate data for desired results. Any suggestions?


Update 6/9/2015

Just noticed the news, and thought it would be good to put it here for discussion.

Psychology journal bans P values

A controversial statistical test has finally met its end, at least in one journal. Earlier this month, the editors of Basic and Applied Social Psychology (BASP) announced that the journal would no longer publish papers containing P values because the statistics were too often used to support lower-quality research.

Along with a recent paper, "The fickle P value generates irreproducible results" from Nature, about P value.

Update 5/8/2016

Back in March, the American Statistical Association (ASA) released statements on statistical significance and p-values, "….The ASA statement is intended to steer research into a ‘post p<0.05 era.’"

This statement contains 6 principles that address the misuse of the p-value:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random
    chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Details:
"The ASA's statement on p-values: context, process, and purpose".

Best Answer

Here are some thoughts:

  1. As @whuber notes, I doubt Gelman said that (although he may have said something similar sounding). Five percent of cases where the null is true will yield significant results (type I errors) using an alpha of .05. If we assume that the true power for all studies where the null was false were $80\%$, the statement could only be true if the ratio of studies undertaken where the null was true to studies in which the null was false was $100/118.75 \approx 84\%$.
  2. Model selection criteria, such as the AIC, can be seen as a way of selecting an appropriate $p$-value. To understand this more fully, it may help to read @Glen_b's answer here: Stepwise regression in R – Critical p-value. Moreover, nothing prevents people from 'AIC-hacking', if the AIC became the requirement for publication.
  3. A good guide to fitting models in such a manner that you don't invalidate your $p$-values would be Frank Harrell's book, Regression Modeling Strategies.
  4. I am not dogmatically opposed to using Bayesian methods, but I do not believe they would solve this problem. For example, you can just keep collecting data until the credible interval no longer included whatever value you wanted to reject. Thus you have 'credible interval-hacking'. As I see it, the issue is that many practitioners are not intrinsically interested in the statistical analyses they use, so they will use whichever method is required of them in an unthinking and mechanical way. For more on my perspective here, it may help to read my answer to: Effect size as the hypothesis for significance testing.