Solved – ANOVA with outlier group

anovaheteroscedasticityoutliersr

This is my first question (previously the search function has been enough), so please bear with me. I have a very simple experimental design with one outcome variable and 5 groups. My typical strategy in this case would be to run a simple ANOVA and then use something like a Tukey's test to calculate significance between groups.

In this case, one of the groups has a mean that is way above the other 4. If I exclude the group with the very high mean there are many significant pairs in the data. Including the very high group gives significance only in comparisons with that group. The groups don't have equal variance which I know is a problem, but I'm not sure how to deal with it. I've tried a "robust" anova using the package robustbase, but I haven't been able to figure out a suitable post-hoc test. Any help you can offer on how to analyze something like this would be greatly appreciated.

Here's a simplified version of the code:

#Baseline condition (this is what the Test conditions need to be compared with)
Baseline = c(450,400,200,250)

#Negative control
Control = c(13,22,17,20)

#Test conditions
Test1 = c(200,400,450,300) 
Test2 = c(120,140,90,80) 
VeryHighTest = c(2700,2500,1800,1750)

#Constructing a data frame for ANOVA including all data########
Labels.all =     c(rep('Baseline',4),rep('Control',4),rep('Test1',4),rep('Test2',4),rep('VeryHighTest',4))
data.all = c(Baseline,Control,Test1,Test2,VeryHighTest)
df.allValues = data.frame(Labels=Labels.all, Values=data.all)

#Constructing data frame for ANOVA excluding the VeryHigh group#######
Labels.low = Labels.all[1:16]
data.low = data.all[1:16]
df.lowValues = data.frame(Labels=Labels.low, Values=data.low)

############ANOVAs##############
anova.all = aov(Values ~ Labels, data = df.allValues)
summary(anova.all) #P value on the order of 10^-9
anova.low = aov(Values ~ Labels, data = df.lowValues)
summary(anova.low) #P value < 0.0001

##########Post-hocs##############
phoc.low = TukeyHSD(anova.low)         #Many comparisons are significant  
phoc.low
phoc.all = TukeyHSD(anova.all)         #!!!!Only comparisons with VERYHIGHTEST are significant!!!!#
phoc.all

Would I be justified in excluding the VeryHighGroup from my analysis because the variance is so high and then maybe do a single T-test between VeryHighGroup and all the other groups combined? Clearly, I'm out of my statistical depth.

Here's the residual plot for each group.

Residual Plot

Best Answer

I don't think this calls for robust statistics or post hoc tests at all. The elephant visible in the room is that the data are best analysed on logarithmic scale.

Here is a plot of the raw data. I ordered the replicates by magnitude to make clearer that variability is about constant on that scale. In practice I wouldn't transform; I would use a generalised linear model with logarithmic link.

enter image description here

Related Question