Solved – Shapiro-Wilk test with multiple conditions in R

normality-assumptionr

I have to perform a Shapiro-Wilk normality test. My dataset is build like this:

Condition0 <- c(0.9201584, 0.8386860, 0.8092635, 0.8166590, 0.8653545)
Condition1 <- c(0.9905397, 0.9498400, 1.0378111, 0.9740314, 1.0355568)
Condition2 <- c(0.9529179, 1.0919274, 1.0089470, 1.0067670, 0.9686904)
Condition3 <- c(0.7402958, 0.7890059, 0.8060471, 0.8020820, 0.7931508)
Condition4 <- c(0.7725662, 0.6916708, 0.7698080, 0.7476060, 0.7602339)
Control <- c(0.7707546, 0.7035131, 0.7268695, 0.8217838, 0.7641010)

dataset <- data.frame(Condition0, Condition1, Condition2, Condition3, Condition4, Control, row.names = c("d1", "d2", "d3", "d4", "d5"))

As you see can there are 6 conditions with 5 results for each condition.

Is it right to do the Shapiro-Wilk test in this way, dividing the dataset for every single condition?

shapiro.test(dataset$Condition0)
shapiro.test(dataset$Condition1)
shapiro.test(dataset$Condition2)
shapiro.test(dataset$Condition3)
shapiro.test(dataset$Condition4)
shapiro.test(dataset$Control)

or, should I built my dataset in a different way and do the test in the whole dataset, regardless of the factor "condition"?
Like this:

    my_data <- c(0.9201584, 0.8386860, 0.8092635, 0.8166590, 0.8653545, 
              0.9905397, 0.9498400, 1.0378111, 0.9740314, 1.0355568,
              0.9529179, 1.0919274, 1.0089470, 1.0067670, 0.9686904,
              0.7402958, 0.7890059, 0.8060471, 0.8020820, 0.7931508,
              0.7725662, 0.6916708, 0.7698080, 0.7476060, 0.7602339,
             0.7707546, 0.7035131, 0.7268695, 0.8217838, 0.7641010)
cycle <- rep(c("d1", "d2", "d3", "d4", "d5"), 6)
Condition <- rep(c("Condition0","Condition1", "Condition2", "Condition3", "Condition4", "Control"), each = 5)

dataset2 <- data.frame(my_data, cycle, Condition)

shapiro.test(dataset2$my_data)

After the Shapiro-Wilk test I'll run an ANOVA or a Kruskall-Wallis (depends on the result) to see if there is any difference among different conditions. Like this:

my_lm <- lm(my_data ~ Condition, data= dataset2)
anova(my_lm)

Which one do you think is the correct one? Thank you for the answer!

Best Answer

What matters for ANOVA is the normality of the residuals rather than the raw normality. The Shapiro-Wilk test is only one of the possible ways of checking normality, others including boxplots, plot(resid(model)), and z-scores of skewness and kurtosis, stat.desc(model, norm=T) (with the pastecs package). Never rely on Shapiro alone. In fact, the z-scores are possibly the go-to scores, as they are robust to sample size, and don't require you being terribly experienced with instances of non-normality. The key figures for the z-scores are 'skew.2SE' and 'kurt.2SE', and the scores are expected to lie below 0.96 if the distribution is normal.

Now, if those residuals are non-normal, then consider solutions. Besides non-parametric alternatives to ANOVA, you might try replacing your dependent variable with its transformation, e.g., log(), and then checking the residuals again.

Aside from that, if you want to check the normality of the variables themselves, do it per condition. Personally, I would enter the data in long format, with condition as one column, and then use subsetting, i.e., dataset[dataset$condition=='0',], dataset[dataset$condition=='1',]...

For a more advanced discussion of normality (along the lines of @Glen_b's comments), see this question.

Related Question