A) What is the best single index of the degree to which the data violates normality?
B) Or is it just better to talk about multiple indices of normality violation (e.g., skewness, kurtosis, outlier prevalence)?
I would vote for B. Different violations have different consequences. For example, unimodal, symmetrical distributions with heavy tails make your CIs very wide and presumably reduce the power to detect any effects. The mean, however, still hits the "typical" value. For very skewed distributions, the mean for example, might not be a very sensible index of "the typical value".
C) How can confidence intervals be calculated (or perhaps a Bayesian approach) for the index?
I don't know about Bayesian statistics, but concerning classical test of normality, I'd like to cite Erceg-Hurn et al. (2008) [2]:
Another problem is that assumption tests have their own assumptions. Normality tests usually assume that data are homoscedastic; tests of homoscedasticity assume that data are normally distributed. If the normality and homoscedasticity assumptions are violated, the validity of the assumption tests can be seriously compromised. Prominent statisticians have described the assumption tests (e.g., Levene’s test, the Kolmogorov–Smirnov test) built into software such as SPSS as fatally flawed and recommended that these tests never be used (D’Agostino, 1986; Glass & Hopkins, 1996).
D) What kind of verbal labels could you assign to points on that index to indicate the degree of violation of normality (e.g., mild, moderate, strong, extreme, etc.)?
Micceri (1989) [1] did an analysis of 440 large scale data sets in psychology. He assessed the symmetry and the tail weight and defined criteria and labels. Labels for asymmetry range from 'relatively symmetric' to 'moderate --> extreme --> exponential asymmetry'. Labels for tail weight range from 'Uniform --> less than Gaussian --> About Gaussian --> Moderate --> Extreme --> Double exponential contamination'.
Each classification is based on multiple, robust criteria.
He found, that from these 440 data sets only 28% were relatively symmetric, and only 15% were about Gaussian concerning tail weights. Therefore the nice title of the paper:
The unicorn, the normal curve, and other improbable creatures
I wrote an R
function, that automatically assesses Micceri's criteria and also prints out the labels:
# This function prints out the Micceri-criteria for tail weight and symmetry of a distribution
micceri <- function(x, plot=FALSE) {
library(fBasics)
QS <- (quantile(x, prob=c(.975, .95, .90)) - median(x)) / (quantile(x, prob=c(.75)) - median(x))
n <- length(x)
x.s <- sort(x)
U05 <- mean(x.s[(.95*n ):n])
L05 <- mean(x.s[1:(.05*n)])
U20 <- mean(x.s[(.80*n):n])
L20 <- mean(x.s[1:(.20*n)])
U50 <- mean(x.s[(.50*n):n])
L50 <- mean(x.s[1:(.50*n)])
M25 <- mean(x.s[(.375*n):(.625*n)])
Q <- (U05 - L05)/(U50 - L50)
Q1 <- (U20 - L20)/(U50 - L50)
Q2 <- (U05 - M25)/(M25 - L05)
# mean/median interval
QR <- quantile(x, prob=c(.25, .75)) # Interquartile range
MM <- abs(mean(x) - median(x)) / (1.4807*(abs(QR[2] - QR[1])/2))
SKEW <- skewness(x)
if (plot==TRUE) plot(density(x))
tail_weight <- round(c(QS, Q=Q, Q1=Q1), 2)
symmetry <- round(c(Skewness=SKEW, MM=MM, Q2=Q2), 2)
cat.tail <- matrix(c(1.9, 2.75, 3.05, 3.9, 4.3,
1.8, 2.3, 2.5, 2.8, 3.3,
1.6, 1.85, 1.93, 2, 2.3,
1.9, 2.5, 2.65, 2.73, 3.3,
1.6, 1.7, 1.8, 1.85, 1.93), ncol=5, nrow=5)
cat.sym <- matrix(c(0.31, 0.71, 2,
0.05, 0.18, 0.37,
1.25, 1.75, 4.70), ncol=3, nrow=3)
ts <- c()
for (i in 1:5) {ts <- c(ts, sum(abs(tail_weight[i]) > cat.tail[,i]) + 1)}
ss <- c()
for (i in 1:3) {ss <- c(ss, sum(abs(symmetry[i]) > cat.sym[,i]) + 1)}
tlabels <- c("Uniform", "Less than Gaussian", "About Gaussian", "Moderate contamination", "Extreme contamination", "Double exponential contamination")
slabels <- c("Relatively symmetric", "Moderate asymmetry", "Extreme asymmetry", "Exponential asymmetry")
cat("Tail weight indexes:\n")
print(tail_weight)
cat(paste("\nMicceri category:", tlabels[max(ts)],"\n"))
cat("\n\nAsymmetry indexes:\n")
print(symmetry)
cat(paste("\nMicceri category:", slabels[max(ss)]))
tail.cat <- factor(max(ts), levels=1:length(tlabels), labels=tlabels, ordered=TRUE)
sym.cat <- factor(max(ss), levels=1:length(slabels), labels=slabels, ordered=TRUE)
invisible(list(tail_weight=tail_weight, symmetry=symmetry, tail.cat=tail.cat, sym.cat=sym.cat))
}
Here's a test for the standard normal distribution, a $t$ with 8 df, and a log-normal:
> micceri(rnorm(10000))
Tail weight indexes:
97.5% 95% 90% Q Q1
2.86 2.42 1.88 2.59 1.76
Micceri category: About Gaussian
Asymmetry indexes:
Skewness MM.75% Q2
0.01 0.00 1.00
Micceri category: Relatively symmetric
> micceri(rt(10000, 8))
Tail weight indexes:
97.5% 95% 90% Q Q1
3.19 2.57 1.94 2.81 1.79
Micceri category: Extreme contamination
Asymmetry indexes:
Skewness MM.75% Q2
-0.03 0.00 0.98
Micceri category: Relatively symmetric
> micceri(rlnorm(10000))
Tail weight indexes:
97.5% 95% 90% Q Q1
6.24 4.30 2.67 3.72 1.93
Micceri category: Double exponential contamination
Asymmetry indexes:
Skewness MM.75% Q2
5.28 0.59 8.37
Micceri category: Exponential asymmetry
[1] Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156-166. doi:10.1037/0033-2909.105.1.156
[2] Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: An easy way to maximize the accuracy and power of your research. American Psychologist, 63, 591-601.
In univariate interval estimation, the set of possible actions is the set of ordered pairs specifying the endpoints of the interval. Let an element of that set be represented by $(a, b),\text{ } a \le b$.
Highest posterior density intervals
Let the posterior density be $f(\theta)$. The highest posterior density intervals correspond to the loss function that penalizes an interval that fails to contain the true value and also penalizes intervals in proportion to their length:
$L_{HPD}(\theta, (a, b); k) = I(\theta \notin [a, b]) + k(b – a), \text{} 0 < k \le max_{\theta} f(\theta)$,
where $I(\cdot)$ is the indicator function. This gives the expected posterior loss
$\tilde{L}_{HPD}((a, b); k) = 1 - \Pr(a \le \theta \le b|D) + k(b – a)$.
Setting $\frac{\partial}{\partial a}\tilde{L}_{HPD} = \frac{\partial}{\partial b}\tilde{L}_{HPD} = 0$ yields the necessary condition for a local optimum in the interior of the parameter space: $f(a) = f(b) = k$ – exactly the rule for HPD intervals, as expected.
The form of $\tilde{L}_{HPD}((a, b); k)$ gives some insight into why HPD intervals are not invariant to a monotone increasing transformation $g(\theta)$ of the parameter. The $\theta$-space HPD interval transformed into $g(\theta)$ space is different from the $g(\theta)$-space HPD interval because the two intervals correspond to different loss functions: the $g(\theta)$-space HPD interval corresponds to a transformed length penalty $k(g(b) – g(a))$.
Quantile-based credible intervals
Consider point estimation with the loss function
$L_q(\theta, \hat{\theta};p) = p(\hat{\theta} - \theta)I(\theta < \hat{\theta}) + (1-p)(\theta - \hat{\theta})I(\theta \ge \hat{\theta}), \text{ } 0 \le p \le 1$.
The posterior expected loss is
$\tilde{L}_q(\hat{\theta};p)=p(\hat{\theta}-\text{E}(\theta|\theta < \hat{\theta}, D)) + (1 - p)(\text{E}(\theta | \theta \ge \hat{\theta}, D)-\hat{\theta})$.
Setting $\frac{d}{d\hat{\theta}}\tilde{L}_q=0$ yields the implicit equation
$\Pr(\theta < \hat{\theta}|D) = p$,
that is, the optimal $\hat{\theta}$ is the $(100p)$% quantile of the posterior distribution, as expected.
Thus to get quantile-based interval estimates, the loss function is
$L_{qCI}(\theta, (a,b); p_L, p_U) = L_q(\theta, a;p_L) + L_q(\theta, b;p_U)$.
Best Answer
This sounds like another strident paper by a confused individual. Fisher didn't fall into any such trap, though many students of statistics do.
Hypothesis testing is a decision theoretic problem. Generally, you end up with a test with a given threshold between the two decisions (hypothesis true or hypothesis false). If you have a hypothesis which corresponds to a single point, such as $\theta=0$, then you can calculate the probability of your data resulting when it's true. But what do you do if it's not a single point? You get a function of $\theta$. The hypothesis $\theta\not= 0$ is such a hypothesis, and you get such a function for the probability of producing your observed data given that it's true. That function is the power function. It's very classical. Fisher knew all about it.
The expected loss is a part of the basic machinery of decision theory. You have various states of nature, and various possible data resulting from them, and some possible decisions you can make, and you want to find a good function from data to decision. How do you define good? Given a particular state of nature underlying the data you have obtained, and the decision made by that procedure, what is your expected loss? This is most simply understood in business problems (if I do this based on the sales I observed in the past three quarters, what is the expected monetary loss?).
Bayesian procedures are a subset of decision theoretic procedures. The expected loss is insufficient to specify uniquely best procedures in all but trivial cases. If one procedure is better than another in both state A and B, obviously you'll prefer it, but if one is better in state A and one is better in state B, which do you choose? This is where ancillary ideas like Bayes procedures, minimaxity, and unbiasedness enter.
The t-test is actually a perfectly good solution to a decision theoretic problem. The question is how you choose the cutoff on the $t$ you calculate. A given value of $t$ corresponds to a given value of $\alpha$, the probability of type I error, and to a given set of powers $\beta$, depending on the size of the underlying parameter you are estimating. Is it an approximation to use a point null hypothesis? Yes. Is it usually a problem in practice? No, just like using Bernoulli's approximate theory for beam deflection is usually just fine in structural engineering. Is having the $p$-value useless? No. Another person looking at your data may want to use a different $\alpha$ than you, and the $p$-value accommodates that use.
I'm also a little confused on why he names Student and Jeffreys together, considering that Fisher was responsible for the wide dissemination of Student's work.
Basically, the blind use of p-values is a bad idea, and they are a rather subtle concept, but that doesn't make them useless. Should we object to their misuse by researchers with poor mathematical backgrounds? Absolutely, but let's remember what it looked like before Fisher tried to distill something down for the man in the field to use.