Alternatives to P-Value – Ziliak’s Opposition and Suggested Methods for Hypothesis Testing

bayesianhypothesis testingp-valuerstatistical significance

In a recent article discussing the demerits of relying on the p-value for statistical inference, called "Matrixx v. Siracusano and Student v. Fisher
Statistical significance on trial" (DOI: 10.1111/j.1740-9713.2011.00511.x), Stephen T. Ziliak opposes the use of p-values. In the concluding paragraphs he says:

The data is the one thing that we already do know, and for certain.
What we actually want to know is something quite different: the
probability of a hypothesis being true (or at least practically
useful), given the data we have. We want to know the probability that
the two drugs are different, and by how much, given the available
evidence. The significance test – based as it is on the fallacy of the
transposed conditional, the trap that Fisher fell into – does not and
cannot tell us that probability. The power function, the expected
loss function, and many other decision-theoretic and Bayesian methods
descending from Student and Jeffreys, now widely available and free
on-line, do.

What is the power function, the expected loss function and "other decision-theoretic and Bayesian methods"? Are these methods widely used? Are they available in R? How are these new suggested methods implemented? How, for instance, would I use these methods to test my hypothesis in a dataset I would otherwise use conventional two-sample t-tests and p-values?

Best Answer

This sounds like another strident paper by a confused individual. Fisher didn't fall into any such trap, though many students of statistics do.

Hypothesis testing is a decision theoretic problem. Generally, you end up with a test with a given threshold between the two decisions (hypothesis true or hypothesis false). If you have a hypothesis which corresponds to a single point, such as $\theta=0$, then you can calculate the probability of your data resulting when it's true. But what do you do if it's not a single point? You get a function of $\theta$. The hypothesis $\theta\not= 0$ is such a hypothesis, and you get such a function for the probability of producing your observed data given that it's true. That function is the power function. It's very classical. Fisher knew all about it.

The expected loss is a part of the basic machinery of decision theory. You have various states of nature, and various possible data resulting from them, and some possible decisions you can make, and you want to find a good function from data to decision. How do you define good? Given a particular state of nature underlying the data you have obtained, and the decision made by that procedure, what is your expected loss? This is most simply understood in business problems (if I do this based on the sales I observed in the past three quarters, what is the expected monetary loss?).

Bayesian procedures are a subset of decision theoretic procedures. The expected loss is insufficient to specify uniquely best procedures in all but trivial cases. If one procedure is better than another in both state A and B, obviously you'll prefer it, but if one is better in state A and one is better in state B, which do you choose? This is where ancillary ideas like Bayes procedures, minimaxity, and unbiasedness enter.

The t-test is actually a perfectly good solution to a decision theoretic problem. The question is how you choose the cutoff on the $t$ you calculate. A given value of $t$ corresponds to a given value of $\alpha$, the probability of type I error, and to a given set of powers $\beta$, depending on the size of the underlying parameter you are estimating. Is it an approximation to use a point null hypothesis? Yes. Is it usually a problem in practice? No, just like using Bernoulli's approximate theory for beam deflection is usually just fine in structural engineering. Is having the $p$-value useless? No. Another person looking at your data may want to use a different $\alpha$ than you, and the $p$-value accommodates that use.

I'm also a little confused on why he names Student and Jeffreys together, considering that Fisher was responsible for the wide dissemination of Student's work.

Basically, the blind use of p-values is a bad idea, and they are a rather subtle concept, but that doesn't make them useless. Should we object to their misuse by researchers with poor mathematical backgrounds? Absolutely, but let's remember what it looked like before Fisher tried to distill something down for the man in the field to use.

Related Solutions

Solved – a good index of the degree of violation of normality and what descriptive labels could be attached to that index

A) What is the best single index of the degree to which the data violates normality?

B) Or is it just better to talk about multiple indices of normality violation (e.g., skewness, kurtosis, outlier prevalence)?

I would vote for B. Different violations have different consequences. For example, unimodal, symmetrical distributions with heavy tails make your CIs very wide and presumably reduce the power to detect any effects. The mean, however, still hits the "typical" value. For very skewed distributions, the mean for example, might not be a very sensible index of "the typical value".

C) How can confidence intervals be calculated (or perhaps a Bayesian approach) for the index?

I don't know about Bayesian statistics, but concerning classical test of normality, I'd like to cite Erceg-Hurn et al. (2008) [2]:

Another problem is that assumption tests have their own assumptions. Normality tests usually assume that data are homoscedastic; tests of homoscedasticity assume that data are normally distributed. If the normality and homoscedasticity assumptions are violated, the validity of the assumption tests can be seriously compromised. Prominent statisticians have described the assumption tests (e.g., Levene’s test, the Kolmogorov–Smirnov test) built into software such as SPSS as fatally flawed and recommended that these tests never be used (D’Agostino, 1986; Glass & Hopkins, 1996).

D) What kind of verbal labels could you assign to points on that index to indicate the degree of violation of normality (e.g., mild, moderate, strong, extreme, etc.)?

Micceri (1989) [1] did an analysis of 440 large scale data sets in psychology. He assessed the symmetry and the tail weight and defined criteria and labels. Labels for asymmetry range from 'relatively symmetric' to 'moderate --> extreme --> exponential asymmetry'. Labels for tail weight range from 'Uniform --> less than Gaussian --> About Gaussian --> Moderate --> Extreme --> Double exponential contamination'. Each classification is based on multiple, robust criteria.

He found, that from these 440 data sets only 28% were relatively symmetric, and only 15% were about Gaussian concerning tail weights. Therefore the nice title of the paper:

The unicorn, the normal curve, and other improbable creatures

I wrote an R function, that automatically assesses Micceri's criteria and also prints out the labels:

# This function prints out the Micceri-criteria for tail weight and symmetry of a distribution
micceri <- function(x, plot=FALSE) {
    library(fBasics)
    QS <- (quantile(x, prob=c(.975, .95, .90)) - median(x)) / (quantile(x, prob=c(.75)) - median(x))

    n <- length(x)
    x.s <- sort(x)
    U05 <- mean(x.s[(.95*n ):n])
    L05 <- mean(x.s[1:(.05*n)])
    U20 <- mean(x.s[(.80*n):n])
    L20 <- mean(x.s[1:(.20*n)])
    U50 <- mean(x.s[(.50*n):n])
    L50 <- mean(x.s[1:(.50*n)])
    M25 <- mean(x.s[(.375*n):(.625*n)])
    Q <- (U05 - L05)/(U50 - L50)
    Q1 <- (U20 - L20)/(U50 - L50)
    Q2 <- (U05 - M25)/(M25 - L05)

    # mean/median interval
    QR <- quantile(x, prob=c(.25, .75)) # Interquartile range
    MM <- abs(mean(x) - median(x)) / (1.4807*(abs(QR[2] - QR[1])/2))

    SKEW <- skewness(x)
    if (plot==TRUE) plot(density(x))

    tail_weight <- round(c(QS, Q=Q, Q1=Q1), 2)
    symmetry <- round(c(Skewness=SKEW, MM=MM, Q2=Q2), 2)

    cat.tail <- matrix(c(1.9, 2.75, 3.05, 3.9, 4.3,
                         1.8, 2.3, 2.5, 2.8, 3.3,
                        1.6, 1.85, 1.93, 2, 2.3,
                        1.9, 2.5, 2.65, 2.73, 3.3,
                        1.6, 1.7, 1.8, 1.85, 1.93), ncol=5, nrow=5)

    cat.sym <- matrix(c(0.31, 0.71, 2,
                        0.05, 0.18, 0.37,
                        1.25, 1.75, 4.70), ncol=3, nrow=3)


    ts <- c()
    for (i in 1:5) {ts <- c(ts, sum(abs(tail_weight[i]) > cat.tail[,i]) + 1)}

    ss <- c()
    for (i in 1:3) {ss <- c(ss, sum(abs(symmetry[i]) > cat.sym[,i]) + 1)}

    tlabels <- c("Uniform", "Less than Gaussian", "About Gaussian", "Moderate contamination", "Extreme contamination", "Double exponential contamination")

    slabels <- c("Relatively symmetric", "Moderate asymmetry", "Extreme asymmetry", "Exponential asymmetry")

    cat("Tail weight indexes:\n")
    print(tail_weight)
    cat(paste("\nMicceri category:", tlabels[max(ts)],"\n"))
    cat("\n\nAsymmetry indexes:\n")
    print(symmetry)
    cat(paste("\nMicceri category:", slabels[max(ss)]))

    tail.cat <- factor(max(ts), levels=1:length(tlabels), labels=tlabels, ordered=TRUE)
    sym.cat  <- factor(max(ss), levels=1:length(slabels), labels=slabels, ordered=TRUE)

    invisible(list(tail_weight=tail_weight, symmetry=symmetry, tail.cat=tail.cat, sym.cat=sym.cat))
}

Here's a test for the standard normal distribution, a $t$ with 8 df, and a log-normal:

> micceri(rnorm(10000))
Tail weight indexes:
97.5%   95%   90%     Q    Q1 
 2.86  2.42  1.88  2.59  1.76 

Micceri category: About Gaussian 


Asymmetry indexes:
Skewness   MM.75%       Q2 
    0.01     0.00     1.00 

Micceri category: Relatively symmetric



> micceri(rt(10000, 8))
Tail weight indexes:
97.5%   95%   90%     Q    Q1 
 3.19  2.57  1.94  2.81  1.79 

Micceri category: Extreme contamination 


Asymmetry indexes:
Skewness   MM.75%       Q2 
   -0.03     0.00     0.98 

Micceri category: Relatively symmetric



> micceri(rlnorm(10000))
Tail weight indexes:
97.5%   95%   90%     Q    Q1 
 6.24  4.30  2.67  3.72  1.93 

Micceri category: Double exponential contamination 


Asymmetry indexes:
Skewness   MM.75%       Q2 
    5.28     0.59     8.37 

Micceri category: Exponential asymmetry

[1] Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156-166. doi:10.1037/0033-2909.105.1.156

[2] Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: An easy way to maximize the accuracy and power of your research. American Psychologist, 63, 591-601.

Bayesian Credible Intervals – Decision-Theoretic Justification of Procedures

In univariate interval estimation, the set of possible actions is the set of ordered pairs specifying the endpoints of the interval. Let an element of that set be represented by $(a, b),\text{ } a \le b$.

Highest posterior density intervals

Let the posterior density be $f(\theta)$. The highest posterior density intervals correspond to the loss function that penalizes an interval that fails to contain the true value and also penalizes intervals in proportion to their length:

$L_{HPD}(\theta, (a, b); k) = I(\theta \notin [a, b]) + k(b – a), \text{} 0 < k \le max_{\theta} f(\theta)$,

where $I(\cdot)$ is the indicator function. This gives the expected posterior loss

$\tilde{L}_{HPD}((a, b); k) = 1 - \Pr(a \le \theta \le b|D) + k(b – a)$.

Setting $\frac{\partial}{\partial a}\tilde{L}_{HPD} = \frac{\partial}{\partial b}\tilde{L}_{HPD} = 0$ yields the necessary condition for a local optimum in the interior of the parameter space: $f(a) = f(b) = k$ – exactly the rule for HPD intervals, as expected.

The form of $\tilde{L}_{HPD}((a, b); k)$ gives some insight into why HPD intervals are not invariant to a monotone increasing transformation $g(\theta)$ of the parameter. The $\theta$-space HPD interval transformed into $g(\theta)$ space is different from the $g(\theta)$-space HPD interval because the two intervals correspond to different loss functions: the $g(\theta)$-space HPD interval corresponds to a transformed length penalty $k(g(b) – g(a))$.

Quantile-based credible intervals

Consider point estimation with the loss function

$L_q(\theta, \hat{\theta};p) = p(\hat{\theta} - \theta)I(\theta < \hat{\theta}) + (1-p)(\theta - \hat{\theta})I(\theta \ge \hat{\theta}), \text{ } 0 \le p \le 1$.

The posterior expected loss is

$\tilde{L}_q(\hat{\theta};p)=p(\hat{\theta}-\text{E}(\theta|\theta < \hat{\theta}, D)) + (1 - p)(\text{E}(\theta | \theta \ge \hat{\theta}, D)-\hat{\theta})$.

Setting $\frac{d}{d\hat{\theta}}\tilde{L}_q=0$ yields the implicit equation

$\Pr(\theta < \hat{\theta}|D) = p$,

that is, the optimal $\hat{\theta}$ is the $(100p)$% quantile of the posterior distribution, as expected.

Thus to get quantile-based interval estimates, the loss function is

$L_{qCI}(\theta, (a,b); p_L, p_U) = L_q(\theta, a;p_L) + L_q(\theta, b;p_U)$.