Hypothesis Testing – ASA Discusses Limitations of $p$-values: What Are the Alternatives?

bayesianfrequentisthypothesis testingp-valuep-values

We already have multiple threads tagged as that reveal lots of misunderstandings about them. Ten months ago we had a thread about psychological journal that "banned" $p$-values, now American Statistical Association (2016) says that with our analysis we "should not end with the calculation of a $p$-value".

American Statistical Association (ASA) believes that the scientific
community could benefit from a formal statement clarifying several
widely agreed upon principles underlying the proper use and
interpretation of the $p$-value.

The committee lists other approaches as possible alternatives or supplements to $p$-values:

In view of the prevalent misuses of and misconceptions concerning
$p$-values, some statisticians prefer to supplement or even replace
$p$-values with other approaches. These include methods that emphasize
estimation over testing, such as confidence, credibility, or
prediction intervals; Bayesian methods; alternative measures of
evidence, such as likelihood ratios or Bayes Factors; and other
approaches such as decision-theoretic modeling and false discovery
rates. All these measures and approaches rely on further assumptions,
but they may more directly address the size of an effect (and its
associated uncertainty) or whether the hypothesis is correct.

So let's imagine post-$p$-values reality. ASA lists some methods that can be used in place of $p$-values, but why are they better? Which of them can be real-life replacement for a researcher that used $p$-values for all his life? I imagine that this kind of questions will appear in post-$p$-values reality, so maybe let's try to be one step ahead of them. What is the reasonable alternative that can be applied out-of-the-box? Why this approach should convince your lead researcher, editor, or readers?

As this follow-up blog entry suggests, $p$-values are unbeatable in their simplicity:

The p-value requires only a statistical model for the behavior of a
statistic under the null hypothesis to hold. Even if a model of an
alternative hypothesis is used for choosing a “good” statistic (which
would be used for constructing the p-value), this alternative model
does not have to be correct in order for the p-value to be valid and
useful (i.e.: control type I error at the desired level while offering
some power to detect a real effect). In contrast, other (wonderful and
useful) statistical methods such as Likelihood ratios, effect size
estimation, confidence intervals, or Bayesian methods all need the
assumed models to hold over a wider range of situations, not merely
under the tested null.

Are they, or maybe it is not true and we can easily replace them?

I know, this is broad, but the main question is simple: what is the best (and why), real-life alternative to $p$-values that can be used as a replacement?


ASA (2016). ASA Statement on Statistical Significance and $P$-values. The American Statistician. (in press)

Best Answer

I will focus this answer on the specific question of what are the alternatives to $p$-values.

There are 21 discussion papers published along with the ASA statement (as Supplemental Materials): by Naomi Altman, Douglas Altman, Daniel J. Benjamin, Yoav Benjamini, Jim Berger, Don Berry, John Carlin, George Cobb, Andrew Gelman, Steve Goodman, Sander Greenland, John Ioannidis, Joseph Horowitz, Valen Johnson, Michael Lavine, Michael Lew, Rod Little, Deborah Mayo, Michele Millar, Charles Poole, Ken Rothman, Stephen Senn, Dalene Stangl, Philip Stark and Steve Ziliak (some of them wrote together; I list all for future searches). These people probably cover all existing opinions about $p$-values and statistical inference.

I have looked through all 21 papers.

Unfortunately, most of them do not discuss any real alternatives, even though the majority are about the limitations, misunderstandings, and various other problems with $p$-values (for a defense of $p$-values, see Benjamini, Mayo, and Senn). This already suggests that alternatives, if any, are not easy to find and/or to defend.

So let us look at the list of "other approaches" given in the ASA statement itself (as quoted in your question):

[Other approaches] include methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates.

  1. Confidence intervals

Confidence intervals are a frequentist tool that goes hand-in-hand with $p$-values; reporting a confidence interval (or some equivalent, e.g., mean $\pm$ standard error of the mean) together with the $p$-value is almost always a good idea.

Some people (not among the ASA disputants) suggest that confidence intervals should replace the $p$-values. One of the most outspoken proponents of this approach is Geoff Cumming who calls it new statistics (a name that I find appalling). See e.g. this blog post by Ulrich Schimmack for a detailed critique: A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics. See also We cannot afford to study effect size in the lab blog post by Uri Simonsohn for a related point.

See also this thread (and my answer therein) about the similiar suggestion by Norm Matloff where I argue that when reporting CIs one would still like to have the $p$-values reported as well: What is a good, convincing example in which p-values are useful?

Some other people (not among the ASA disputants either), however, argue that confidence intervals, being a frequentist tool, are as misguided as $p$-values and should also be disposed of. See, e.g., Morey et al. 2015, The Fallacy of Placing Confidence in Confidence Intervals linked by @Tim here in the comments. This is a very old debate.

  1. Bayesian methods

(I don't like how the ASA statement formulates the list. Credible intervals and Bayes factors are listed separately from "Bayesian methods", but they are obviously Bayesian tools. So I count them together here.)

  • There is a huge and very opinionated literature on the Bayesian vs. frequentist debate. See, e.g., this recent thread for some thoughts: When (if ever) is a frequentist approach substantively better than a Bayesian? Bayesian analysis makes total sense if one has good informative priors, and everybody would be only happy to compute and report $p(\theta|\text{data})$ or $p(H_0:\theta=0|\text{data})$ instead of $p(\text{data at least as extreme}|H_0)$—but alas, people usually do not have good priors. An experimenter records 20 rats doing something in one condition and 20 rats doing the same thing in another condition; the prediction is that the performance of the former rats will exceed the performance of the latter rats, but nobody would be willing or indeed able to state a clear prior over the performance differences. (But see @FrankHarrell's answer where he advocates using "skeptical priors".)

  • Die-hard Bayesians suggest to use Bayesian methods even if one does not have any informative priors. One recent example is Krushke, 2012, Bayesian estimation supersedes the $t$-test, humbly abbreviated as BEST. The idea is to use a Bayesian model with weak uninformative priors to compute the posterior for the effect of interest (such as, e.g., a group difference). The practical difference with frequentist reasoning seems usually to be minor, and as far as I can see this approach remains unpopular. See What is an "uninformative prior"? Can we ever have one with truly no information? for the discussion of what is "uninformative" (answer: there is no such thing, hence the controversy).

  • An alternative approach, going back to Harold Jeffreys, is based on Bayesian testing (as opposed to Bayesian estimation) and uses Bayes factors. One of the more eloquent and prolific proponents is Eric-Jan Wagenmakers, who has published a lot on this topic in recent years. Two features of this approach are worth emphasizing here. First, see Wetzels et al., 2012, A Default Bayesian Hypothesis Test for ANOVA Designs for an illustration of just how strongly the outcome of such a Bayesian test can depend on the specific choice of the alternative hypothesis $H_1$ and the parameter distribution ("prior") it posits. Second, once a "reasonable" prior is chosen (Wagenmakers advertises Jeffreys' so called "default" priors), resulting Bayes factors often turn out to be quite consistent with the standard $p$-values, see e.g. this figure from this preprint by Marsman & Wagenmakers:

    Bayes factors vs p-values

    So while Wagenmakers et al. keep insisting that $p$-values are deeply flawed and Bayes factors are the way to go, one cannot but wonder... (To be fair, the point of Wetzels et al. 2011 is that for $p$-values close to $0.05$ Bayes factors only indicate very weak evidence against the null; but note that this can be easily dealt with in a frequentist paradigm simply by using a more stringent $\alpha$, something that a lot of people are advocating anyway.)

    One of the more popular papers by Wagenmakers et al. in the defense of Bayes factors is 2011, Why psychologists must change the way they analyze their data: The case of psi where he argues that infamous Bem's paper on predicting the future would not have reached their faulty conclusions if only they had used Bayes factors instead of $p$-values. See this thoughtful blog post by Ulrich Schimmack for a detailed (and IMHO convincing) counter-argument: Why Psychologists Should Not Change The Way They Analyze Their Data: The Devil is in the Default Prior.

    See also The Default Bayesian Test is Prejudiced Against Small Effects blog post by Uri Simonsohn.

  • For completeness, I mention that Wagenmakers 2007, A practical solution to the pervasive problems of $p$-values suggested to use BIC as an approximation to Bayes factor to replace the $p$-values. BIC does not depend on the prior and hence, despite its name, is not really Bayesian; I am not sure what to think about this proposal. It seems that more recently Wagenmakers is more in favour of Bayesian tests with uninformative Jeffreys' priors, see above.


For further discussion of Bayes estimation vs. Bayesian testing, see Bayesian parameter estimation or Bayesian hypothesis testing? and links therein.

  1. Minimum Bayes factors

Among the ASA disputants, this is explicitly suggested by Benjamin & Berger and by Valen Johnson (the only two papers that are all about suggesting a concrete alternative). Their specific suggestions are a bit different but they are similar in spirit.

  • The ideas of Berger go back to the Berger & Sellke 1987 and there is a number of papers by Berger, Sellke, and collaborators up until last year elaborating on this work. The idea is that under a spike and slab prior where point null $\mu=0$ hypothesis gets probability $0.5$ and all other values of $\mu$ get probability $0.5$ spread symmetrically around $0$ ("local alternative"), then the minimal posterior $p(H_0)$ over all local alternatives, i.e. the minimal Bayes factor, is much higher than the $p$-value. This is the basis of the (much contested) claim that $p$-values "overstate the evidence" against the null. The suggestion is to use a lower bound on Bayes factor in favour of the null instead of the $p$-value; under some broad assumptions this lower bound turns out to be given by $-ep\log(p)$, i.e., the $p$-value is effectively multiplied by $-e\log(p)$ which is a factor of around $10$ to $20$ for the common range of $p$-values. This approach has been endorsed by Steven Goodman too.

    Later update: See a nice cartoon explaining these ideas in a simple way.

    Even later update: See Held & Ott, 2018, On $p$-Values and Bayes Factors for a comprehensive review and further analysis of converting $p$-values to minimum Bayes factors. Here is one table from there:

    Mininum Bayes factors

  • Valen Johnson suggested something similar in his PNAS 2013 paper; his suggestion approximately boils down to multiplying $p$-values by $\sqrt{-4\pi\log(p)}$ which is around $5$ to $10$.


For a brief critique of Johnson's paper, see Andrew Gelman's and @Xi'an's reply in PNAS. For the counter-argument to Berger & Sellke 1987, see Casella & Berger 1987 (different Berger!). Among the APA discussion papers, Stephen Senn argues explicitly against any of these approaches:

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than $P$-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities.

See also references in Senn's paper, including the ones to Mayo's blog.

  1. ASA statement lists "decision-theoretic modeling and false discovery rates" as another alternative. I have no idea what they are talking about, and I was happy to see this stated in the discussion paper by Stark:

The "other approaches" section ignores the fact that the assumptions of some of those methods are identical to those of $p$-values. Indeed, some of the methods use $p$-values as input (e.g., the False Discovery Rate).


I am highly skeptical that there is anything that can replace $p$-values in actual scientific practice such that the problems that are often associated with $p$-values (replication crisis, $p$-hacking, etc.) would go away. Any fixed decision procedure, e.g. a Bayesian one, can probably be "hacked" in the same way as $p$-values can be $p$-hacked (for some discussion and demonstration of this see this 2014 blog post by Uri Simonsohn).

To quote from Andrew Gelman's discussion paper:

In summary, I agree with most of the ASA’s statement on $p$-values but I feel that the problems are deeper, and that the solution is not to reform $p$-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

And from Stephen Senn:

In short, the problem is less with $P$-values per se but with making an idol of them. Substituting another false god will not help.

And here is how Cohen put it into his well-known and highly-cited (3.5k citations) 1994 paper The Earth is round ($p<0.05$) where he argued very strongly against $p$-values:

[...] don't look for a magic alternative to NHST, some other objective mechanical ritual to replace it. It doesn't exist.