Solved – Running several one-way ANOVA tests on different groups of the same data without inflating type I error

anovapost-hocrt-test

I have 2 categorical independent variables (industry & location) and 1 continuous dependent variable (performance metrics). I need to find significantly different industries by mean performance metrics in each location separately. Sounds like a task for ANOVA, but running one-way ANOVA for each location separately in my understanding inflates the type I error. Running two-way ANOVA will result in either comparison of mean performance metrics by location, or same by industry, or comparing all possible combinations of industries and locations, however I'm not interested in comparing industry performance across different locations. E.g. I am interested in comparing Canada:Energy to Canada:Basic Materials , but not interested in comparing Mexico:Energy to Canada:Basic Materials. Also sample sizes of each location are different, however share of observations from each industry is the same in each location, so not sure how suitable is the data for two-way ANOVA.

Sample dataset (contingency table of the counts):

         Basic Materials Energy Financials
  Canada              10     10         20
  Mexico              15     15         30
  USA                  5      5         10

Sample R code:

DATA <- data.frame(performance=rnorm(120),
               location=c(rep('USA',20),rep('Canada',40),rep('Mexico',60)),
               industry=rep(c('Basic Materials','Energy','Financials','Financials'),30))
table(DATA[,-1])
TukeyHSD(aov(performance~location*industry,data=DATA))

Any suggestions (preferably accompanied by some R code)?

Best Answer

Running the full model with the interaction will be informative as it will be able to tell you if the performance across the three industries is different between the three countries. This together with plots of the data will tell you if it would be interesting to do post-hoc tests/contrasts that need be be corrected to adjust for the additional error.

You could do this in R as follows:

lm1 <- lm(performance ~industry*location, data=DATA)
lm2 <- lm(performance ~industry+location, data=DATA)
anova(lm1,lm2)
library(effects)
plot(effect("industry*location", lm1))

The anova and the plot suggests there is no difference between the three countries in performance across the industries (for this random data example):

Model 1: performance ~ industry * location
Model 2: performance ~ industry + location
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    111 104.08                           
2    115 108.19 -4   -4.1008 1.0933 0.3635

Running separate models for the three different countries is easy with the phia package, which will automatically adjust for doing multiple tests. For example, determining if industry is different for each country you can do:

custom.contr <- contrastCoefficients(location ~ USA, location ~ Mexico, 
                location ~ Canada, industry ~ Basics - Financials - Energy, 
                data=DATA, normalize=TRUE)
names(custom.contr$location) <- c("USA", "Mexico","Canada")
names(custom.contr$industry) <- c("industry")
testInteractions(lm1,custom=custom.contr)

Which will show you there is no difference between the three countries:

F Test: 
P-value adjustment method: holm
                     Value  Df Sum of Sq      F Pr(>F)
   USA : industry  0.53426   1     1.713 1.5093 0.6655
Mexico : industry -0.04385   1     0.035 0.0305 0.8617
Canada : industry  0.29764   1     1.063 0.9369 0.6703
Residuals                  111   125.949  
Related Question