Solved – Shapiro-Wilk test with multiple conditions in R

normality-assumptionr

I have to perform a Shapiro-Wilk normality test. My dataset is build like this:

Condition0 <- c(0.9201584, 0.8386860, 0.8092635, 0.8166590, 0.8653545)
Condition1 <- c(0.9905397, 0.9498400, 1.0378111, 0.9740314, 1.0355568)
Condition2 <- c(0.9529179, 1.0919274, 1.0089470, 1.0067670, 0.9686904)
Condition3 <- c(0.7402958, 0.7890059, 0.8060471, 0.8020820, 0.7931508)
Condition4 <- c(0.7725662, 0.6916708, 0.7698080, 0.7476060, 0.7602339)
Control <- c(0.7707546, 0.7035131, 0.7268695, 0.8217838, 0.7641010)

dataset <- data.frame(Condition0, Condition1, Condition2, Condition3, Condition4, Control, row.names = c("d1", "d2", "d3", "d4", "d5"))

As you see can there are 6 conditions with 5 results for each condition.

Is it right to do the Shapiro-Wilk test in this way, dividing the dataset for every single condition?

shapiro.test(dataset$Condition0)
shapiro.test(dataset$Condition1)
shapiro.test(dataset$Condition2)
shapiro.test(dataset$Condition3)
shapiro.test(dataset$Condition4)
shapiro.test(dataset$Control)

or, should I built my dataset in a different way and do the test in the whole dataset, regardless of the factor "condition"?
Like this:

    my_data <- c(0.9201584, 0.8386860, 0.8092635, 0.8166590, 0.8653545, 
              0.9905397, 0.9498400, 1.0378111, 0.9740314, 1.0355568,
              0.9529179, 1.0919274, 1.0089470, 1.0067670, 0.9686904,
              0.7402958, 0.7890059, 0.8060471, 0.8020820, 0.7931508,
              0.7725662, 0.6916708, 0.7698080, 0.7476060, 0.7602339,
             0.7707546, 0.7035131, 0.7268695, 0.8217838, 0.7641010)
cycle <- rep(c("d1", "d2", "d3", "d4", "d5"), 6)
Condition <- rep(c("Condition0","Condition1", "Condition2", "Condition3", "Condition4", "Control"), each = 5)

dataset2 <- data.frame(my_data, cycle, Condition)

shapiro.test(dataset2$my_data)

After the Shapiro-Wilk test I'll run an ANOVA or a Kruskall-Wallis (depends on the result) to see if there is any difference among different conditions. Like this:

my_lm <- lm(my_data ~ Condition, data= dataset2)
anova(my_lm)

Which one do you think is the correct one? Thank you for the answer!

Best Answer

What matters for ANOVA is the normality of the residuals rather than the raw normality. The Shapiro-Wilk test is only one of the possible ways of checking normality, others including boxplots, plot(resid(model)), and z-scores of skewness and kurtosis, stat.desc(model, norm=T) (with the pastecs package). Never rely on Shapiro alone. In fact, the z-scores are possibly the go-to scores, as they are robust to sample size, and don't require you being terribly experienced with instances of non-normality. The key figures for the z-scores are 'skew.2SE' and 'kurt.2SE', and the scores are expected to lie below 0.96 if the distribution is normal.

Now, if those residuals are non-normal, then consider solutions. Besides non-parametric alternatives to ANOVA, you might try replacing your dependent variable with its transformation, e.g., log(), and then checking the residuals again.

Aside from that, if you want to check the normality of the variables themselves, do it per condition. Personally, I would enter the data in long format, with condition as one column, and then use subsetting, i.e., dataset[dataset$condition=='0',], dataset[dataset$condition=='1',]...

For a more advanced discussion of normality (along the lines of @Glen_b's comments), see this question.

Related Solutions

Solved – Some of the data is not normally distributed, what test should i use

With such relatively small samples, I would not expect definitive results from either the Shapiro-Wilk or the Kolmogorov-Smirnov tests. Usually, the latter has poorer power than the former so I wonder why K-S (alone) finds group M data non-normal. Even though all six of the P-values for normality tests are about the same, I would want to see whether there are far outliers in any of the three groups; if not, I would not worry much about nonnormality.

I think your main problem may be heteroscedasticity, and I would use an ANOVA procedure designed to take possibly-unequal group variances into account. You may be familiar with the Welch two-sample t test, which does not assume equal variances of the two groups. In its procedure 'oneway.test', R implements a one-way ANOVA that does not assume equal variances. (Adjustments for unequal variances are similar to those of the Welch t test.) I would use this test in preference to a Kruskal-Wallis test because that test explicitly requires populations to be of the 'same shape', which implies 'equal variances'.

I do not know whether SPSS has implemented a one-way ANOVA procedure that does not require homoscedasticity.

The following normal data are simulated (in R) to have relatively modest differences among group means and markedly different variances among group variances.

set.seed(2020)  # for reproducibility
a = rnorm(20, 100, 10)
b = rnorm(20, 105, 5)
c = rnorm(20, 112, 15)
x = c(a,b,c)
g = as.factor(rep(1:3, each=20))

boxplot(x ~ g, col="skyblue2")

The "Welchified" one-way ANOVA test finds significant differences among groups at about the 2% level of significance. (In a standard one-way ANOVA the denominator df would be 57; here ddf are about 31, adjusting for heteroscedasticity.)

oneway.test(x ~ g)

        One-way analysis of means (not assuming equal variances)

data:  x and g
F = 4.5939, num df = 2.000, denom df = 31.383, p-value = 0.01779

Ad hoc Welch two-sample t test show groups A and B to differ at the 2% level (so, of course, A and C differ also). There is no significant difference between B and C. According to the Bonferroni method of protecting against false discovery, it is reasonable to conclude that A differs from B and C.

Perhaps your data are sufficiently similar to my simulated data that your data can be profitably analyzed using the methods I show above.

Best Answer

Related Solutions

Solved – Some of the data is not normally distributed, what test should i use

Related Question