The question in short
I run a regression in R and made a boxplot of the response variable with grouping by one of the predictor variables. On this boxplot I'd like to add some information about the statistical model. What information (and how to display it (it is not an issue of programming)) would you suggest me to provide?
The developed question
I have several predictors: two categorical, non-ordinal predictors and one continuous predictor (below coded in R)
set.seed(81)
pred1 = rep(c('Car', 'Bike', 'Train', 'Airplane'), 6)
pred2 = rep(c('High', 'Low', 'Middle'), 8)
pred3 = rnorm(24)
resp = c(rnorm(12, sd = 1), rnorm(12, sd = 5))
resp
is the response variable. I ran a regression with sandwich estimates:
require(sandwich)
require(lmtest)
m = aov(resp ~ pred1 + pred2)
coeftest(m, sandwich)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.49642 0.73911 -0.6716 0.51034
pred1Bike 1.55917 1.16568 1.3376 0.19769
pred1Car 1.23873 1.24080 0.9983 0.33135
pred1Train 2.50882 0.91468 2.7428 0.01338 *
pred2Low 0.11613 1.00540 0.1155 0.90932
pred2Middle 0.51476 0.90924 0.5661 0.57829
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And I boxplotted the groups for pred1
:
require(ggplot2)
ggplot(data.frame(pred1, resp), aes(x=pred1, y=resp)) + geom_boxplot()
On this plot I'd like to add some letters in order to indicate the groups which are statistically similar (p.value < 0.05) as discussed here. Something like this:
ggplot(data.frame(pred1, resp), aes(x=pred1, y=resp)) + geom_boxplot() + annotate('text', x=1:4, y=6, label=c('a','b','a','b'), size = 8, color='red')
My question is:
How can I find these p.values for the pairwise comparisons with my robust regression? I can do what follows where m
is a simple aov model:
TukeyHSD(m)
But the following doesn't work:
TukeyHSD(coeftest(m, sandwich))
I might missunderstand what these pairwise comparisons are, and what the results I currently have mean! Please let me know if you feel this! The aim of my question is for me to understand what is the best way to display the results of my statistical model on a boxplot.
Note: the variables pred2
and pred3
are used to withdraw some part of variance that I don't want to be accounted to the effect of pred1
(as pred1
, pred2
and pred3
are correlated in my case). Therefore, I guess it is better not to run simple pairwise t-test in order to get the p.values I'd like to add at the top of each boxplot.
Best Answer
One solution is actually given as an example in the book on the
multcomp
package, section 4.6:Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. Boca Raton, FL: CRC Press.
One only needs to slightly adapt your code (everything needs to be in one data.frame instead of floating around):
Note that
glht
as default uses the single-step method as adjustment for alpha-error accumulation. If you want something else, you need to feed it toadjusted()
For example, using the Bonferroni-Holm correction (which I tend to use as I don't understand what single-step actually does):
If you want no alpha error correction, which I do not recommend, this is also possible: