Alternatives to P-Value – Ziliak’s Opposition and Suggested Methods for Hypothesis Testing

bayesianhypothesis testingp-valuerstatistical significance

In a recent article discussing the demerits of relying on the p-value for statistical inference, called "Matrixx v. Siracusano and Student v. Fisher
Statistical significance on trial"
(DOI: 10.1111/j.1740-9713.2011.00511.x), Stephen T. Ziliak opposes the use of p-values. In the concluding paragraphs he says:

The data is the one thing that we already do know, and for certain.
What we actually want to know is something quite different: the
probability of a hypothesis being true (or at least practically
useful), given the data we have. We want to know the probability that
the two drugs are different, and by how much, given the available
evidence. The significance test – based as it is on the fallacy of the
transposed conditional, the trap that Fisher fell into – does not and
cannot tell us that probability. The power function, the expected
loss function, and many other decision-theoretic and Bayesian methods
descending from Student and Jeffreys, now widely available and free
on-line, do.

What is the power function, the expected loss function and "other decision-theoretic and Bayesian methods"? Are these methods widely used? Are they available in R? How are these new suggested methods implemented? How, for instance, would I use these methods to test my hypothesis in a dataset I would otherwise use conventional two-sample t-tests and p-values?

Best Answer

This sounds like another strident paper by a confused individual. Fisher didn't fall into any such trap, though many students of statistics do.

Hypothesis testing is a decision theoretic problem. Generally, you end up with a test with a given threshold between the two decisions (hypothesis true or hypothesis false). If you have a hypothesis which corresponds to a single point, such as $\theta=0$, then you can calculate the probability of your data resulting when it's true. But what do you do if it's not a single point? You get a function of $\theta$. The hypothesis $\theta\not= 0$ is such a hypothesis, and you get such a function for the probability of producing your observed data given that it's true. That function is the power function. It's very classical. Fisher knew all about it.

The expected loss is a part of the basic machinery of decision theory. You have various states of nature, and various possible data resulting from them, and some possible decisions you can make, and you want to find a good function from data to decision. How do you define good? Given a particular state of nature underlying the data you have obtained, and the decision made by that procedure, what is your expected loss? This is most simply understood in business problems (if I do this based on the sales I observed in the past three quarters, what is the expected monetary loss?).

Bayesian procedures are a subset of decision theoretic procedures. The expected loss is insufficient to specify uniquely best procedures in all but trivial cases. If one procedure is better than another in both state A and B, obviously you'll prefer it, but if one is better in state A and one is better in state B, which do you choose? This is where ancillary ideas like Bayes procedures, minimaxity, and unbiasedness enter.

The t-test is actually a perfectly good solution to a decision theoretic problem. The question is how you choose the cutoff on the $t$ you calculate. A given value of $t$ corresponds to a given value of $\alpha$, the probability of type I error, and to a given set of powers $\beta$, depending on the size of the underlying parameter you are estimating. Is it an approximation to use a point null hypothesis? Yes. Is it usually a problem in practice? No, just like using Bernoulli's approximate theory for beam deflection is usually just fine in structural engineering. Is having the $p$-value useless? No. Another person looking at your data may want to use a different $\alpha$ than you, and the $p$-value accommodates that use.

I'm also a little confused on why he names Student and Jeffreys together, considering that Fisher was responsible for the wide dissemination of Student's work.

Basically, the blind use of p-values is a bad idea, and they are a rather subtle concept, but that doesn't make them useless. Should we object to their misuse by researchers with poor mathematical backgrounds? Absolutely, but let's remember what it looked like before Fisher tried to distill something down for the man in the field to use.