The normality tests are conducted on a sample to test if the sample was drawn from a normal population.
Why would you want 'descriptive' methods like plotting an histogram or a QQ plot? Because sometimes the tests can just be 'wrong' and you have to check visually. Remember that in any goodness of fit test, you don't want to reject $H_0$, but if, for example, you have a very large sample size, the power (the probability of rejecting a false null hypothesis) of the test could be too high and you'll find yourself rejecting $H_0$ with a very high probability (because of small deviations), even if the data is not really that different from the theoretical distribution. If that is the case, a histogram with a QQ Plot may help you in deciding that you can work as if the sample was drawn from a normal distribution.
It actually isn't important that the differences be normally distributed, just that they are close enough so that the Central Limit Theorem has "taken over" at your sample size and the sample mean is close to normally distributed. Close enough can be quite far away even for small sample sizes; sample means of samples of size $12$ from a Uniform distribution used to be used to generate Normally distributed random variates in the early days of random number generation.
To see whether this is the case in an informal way, there are several checks we can do. One is to look at what the skewness and kurtosis of the sample mean with sample size $n=66$ from a population with the same skewness and kurtosis as you've observed would be. As it happens, these two statistics converge to $0$ and $3$ respectively at the rate $1/\sqrt{n}$, so in your case the sample mean would have an estimated skewness of $0.084$ and kurtosis of $3.14$, neither of which is far from the Normal distribution values of $0$ and $3$. This would indicate, again in an informal way, that the $t$-test should work reasonably well for you.
Another approach is to see how well the t-test performs on randomly-generated data that is distributed like your observed sample. Since I don't have access to your data, I'm a little limited in this regard; however, as an outline of one such approach, I'll assume your data follows a Pearson Type IV distribution (the member of the Pearson family of distributions that can have the given skewness and kurtosis) with parameters such that the true mean is $0$, standard deviation is $1$, and skewness and kurtosis as given. We generate $1000$ samples from the distribution with different means and see how well the $t$ test does in terms of Type I and Type II errors:
tstat <- matrix(0,1000,11)
for (j in 1:nrow(tstat)) {
x <- rpearsonIV(66, m=8.06, nu=8.7,
location=6.0087, scale=9.752)
for (i in 1:ncol(tstat)) {
delta <- (i-1)/10
tstat[j, i] <- t.test(x+delta,
alternative="greater")$p.value
}
}
tpower <- colMeans(tstat < 0.05)
And a plot of the power curve:
Overall, it appears, in an informal way, that the $t$-test should work well for you, although there may be tests that slightly outperform it.
One issue with finding alternative tests is making sure that you are testing what you want to test. For example, the Wilcoxon one-sample test assumes symmetry, so, in effect, tests for symmetry of the data around a median of $0$, which is not likely what you want. (Running it on the data generated above gives an 11% rejection rate when the null hypothesis is true and we are testing at the 95% level of confidence.) The sign test is highly robust, but tests for the median of the data equal to (in your case) $0$, which may not be what you want. So care is needed in formulating your null hypothesis to match your actual problem!
Best Answer
You should test the residuals in a one-way ANOVA to see if they are normal. Especially if the levels of the factor are significantly different, there is no reason to expect the aggregate data to be normal.
As an example, suppose the factor has three levels. Then the data for the three levels separately might be as generated below in R, so that we know the conditions for a one-way ANOVA are precisely met:
However, the Shapiro-Wilk test suggests that the aggregate data are not normal. Also, the kernel density estimator plotted through their histogram below shows right skewness somewhat similar to the data you show.
The residuals for this model are $X_{ij} - A_i,$ for $i = 1,2,3; j = 1, \dots, 50;$ where $A_i = \sum_j X_{ij}.$
Specifically, for my fake data, the Shapiro-Wilk test shows that the 150 residuals are consistent with normality: