Solved – Normality assumption and sample size

nonparametricnormality-assumptionsample-size

I know this is a very debated topic, even on this site, but I still couldn't find an answer to my problem.

Recently I am working with large samples (300, 400 and more). For now, I am trying to use simple techniques, such as correlation, T-tests, and ANOVA, all of which require the normality assumption (from what I have read so far in the textbooks, online etc.) I also read that, if the sample size is large enough, the normality assumption is not so much of a problem and these techniques are robust to the violation of normality.
Is normality a problem given my sample size? Should the data be at least bell-shaped, even if the tests fail to accept normality? Or could I get away even with extremely skewed or lumped data when using these techniques?

Should I apply parametric techniques or should I just stick to the non-parametric ones, which, from what I know, have lower power?

LATER EDIT for those who want to find more about my data:

I have a variable which represents the number of days a user has been employed in the program (mean=176, median=167, stdev=87, IQR=113, Skewness=0.61, kurtosis=-1.64) sample=340users. The histogram does show that maybe the variable has the potential to split into 'fairly normal' groups (however I have not found this factor).

This is the variable that for now i am trying to explain, in terms of 'The Number of Weeks The User has worked in the first 8 weeks', which takes values from 1 to 8, so I assume it is ordinal, and has a negative skew(most of them have worked 8 weeks out of 8). So, the main question would be is there a relationship between how much a user will stay employed and the amount of work he does in the first 8 weeks?

Also, later I would compare the length of employment with other possible factors, for which currently I don't have the data (education, gender, age, ..)and try to build a more 'elaborate' model, but right now I try to analyse what I have with 'simple' statistics.

For now, I have done a Spearman correlation between length of employment and weeks worked out of first 8, had a coefficient of 0.450 (which means low correlation), so I am trying to see if those who have worked an amount of no of weeks in the first 8 differentiate themselves in terms of length of employment from those who have worked fewer/more weeks. So I studied the distributions of length of employment for each of the groups (a group means a certain no of weeks they have worked in the first 8). In each group I have around 25-40 cases (except weeks worked=0 where I have 9cases and weeks worked=9 where there are 156).
The normality test (Shapiro-Wilk, which I knew were suitable with such sample size) showed that only 2 groups out of 9 are normally distributed. So my thought was to drop ANOVA and t-tests and head for Mann-Whitney U Test and Kruskal-Wallis Test. However now I am re-considering maybe looking at ANOVA and t-tests because the groups do not look that non-normal.
Thanks.

Best Answer

Disputes about normality with large N are often to do with tests of normality, not normality per se. For larger sample sizes passing a test of normality, like Shapiro-Wilks is not required. Consider the following in R.

findNonNormal <- function(n = 5000){
    p <- 1
    while(p > 0.05) {
        y <- rnorm(n)
        p <- shapiro.test(y)$p.value
        }
    y
    }

y <- findNonNormal()
hist(y)
qqnorm(y)

The results show a remarkably normal distribution that the test says is not normal. That's because the power of the test is so high with that N that it finds non normal distributions with very small deviations. You could easily find similar results with the N's you mentioned.

Generally, passing an eyeball test of normality is all that's needed. This eyeball test needs to be adjusted with N. If you feel you cannot do the assessment just do some simulations with a similar N and see what typical data from a truly normal distribution look like.

If your data really are not normal don't do the parameteric tests. But, contrary to your belief, a large N with reasonably normal distributions is when the power of a parametric test becomes most valuable. It allows one to make estimates of the parameters in the population, and the better and larger the sample the more accurate those estimates will be.

Additionally, if you're looking at a t-test, for example, the distribution of the data is going to be bimodal with a strong effect. It's because there are two means in the distribution. So, the requirement is not that the data look normal but the residuals look normal. This is true for your ANOVA as well.

Related Question