Solved – Skewed but bell-shaped still considered as normal distribution for ANOVA

anovanormal distributionskewness

This could be a pretty basic question, I'm a little rusty on my stats knowledge.

Background: I am monitoring website load time performance. To do so, I have a script running and capturing data points (About 400) on load time through various Agents. Every Agent is located in different geographic locations, but they measure the same steps.

I would like to determine if there is statistical difference between the agents. So if one is consistently reporting slower load time performance I would like to know if its because of the Agent or not. I would include images but I need 10 reputation points and I just found out about this website.

Problem: I have two sets of data from different agents measuring the seconds it takes a website to download, both are bell-shaped but are heavily skewed to the right. Can I still perform ANOVA to determine if there is difference, even though they are skewed?

Thanks in advance

Best Answer

If the distributions are similar (in particular have the same variance) and the group sizes are identical (balanced design), you probably have no reason to worry. Formally, the normality assumption is violated and it can matter but it is less important than the equality of variance assumption and simulation studies have shown ANOVA to be quite robust to such violations as long as the sample size and the variance are the same across all cells of the design. If you combine several violations (say non-normality and heteroscedasticity) or have an unbalanced design, you cannot trust the F test anymore.

That said, the distribution will also have an impact on the error variance and even if the nominal error level is preserved, non-normal data can severely reduce the power to detect a given difference. Also, when you are looking at skewed distributions, a few large values can have a big influence on the mean. Consequently, it's possible that two groups really have different means (in the sample and in the population) but that most of the observations (i.e. most of the test runs in your case) are in fact very similar. The mean therefore might not be what you are interested in (or at least not be all you are interested in).

In a nutshell, you could probably still use ANOVA as inference will not necessarily be threatened but you might also want to consider alternatives to increase power or learn more about your data.

Also note that strictly speaking the normality assumption applies to the distribution of the residuals, you should therefore look at residual plots or at least at the distribution in each cell, not at the whole data set at once.

Related Question