Solved – Pairwise comparisons for a regression with sandwich estimates (in R)

boxplotdata visualizationp-valueregressionsandwich

The question in short

I run a regression in R and made a boxplot of the response variable with grouping by one of the predictor variables. On this boxplot I'd like to add some information about the statistical model. What information (and how to display it (it is not an issue of programming)) would you suggest me to provide?


The developed question

I have several predictors: two categorical, non-ordinal predictors and one continuous predictor (below coded in R)

set.seed(81)
pred1 = rep(c('Car', 'Bike', 'Train', 'Airplane'), 6)
pred2 = rep(c('High', 'Low', 'Middle'), 8)
pred3 = rnorm(24)
resp = c(rnorm(12, sd = 1), rnorm(12, sd = 5))

resp is the response variable. I ran a regression with sandwich estimates:

require(sandwich)
require(lmtest)


m = aov(resp ~ pred1 + pred2)
coeftest(m, sandwich)

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -0.49642    0.73911 -0.6716  0.51034  
pred1Bike    1.55917    1.16568  1.3376  0.19769  
pred1Car     1.23873    1.24080  0.9983  0.33135  
pred1Train   2.50882    0.91468  2.7428  0.01338 *
pred2Low     0.11613    1.00540  0.1155  0.90932  
pred2Middle  0.51476    0.90924  0.5661  0.57829  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And I boxplotted the groups for pred1:

require(ggplot2)
ggplot(data.frame(pred1, resp), aes(x=pred1, y=resp)) + geom_boxplot()

enter image description here

On this plot I'd like to add some letters in order to indicate the groups which are statistically similar (p.value < 0.05) as discussed here. Something like this:

ggplot(data.frame(pred1, resp), aes(x=pred1, y=resp)) + geom_boxplot() + annotate('text', x=1:4, y=6, label=c('a','b','a','b'), size = 8, color='red')

enter image description here

My question is:

How can I find these p.values for the pairwise comparisons with my robust regression? I can do what follows where m is a simple aov model:

TukeyHSD(m)

But the following doesn't work:

TukeyHSD(coeftest(m, sandwich))

I might missunderstand what these pairwise comparisons are, and what the results I currently have mean! Please let me know if you feel this! The aim of my question is for me to understand what is the best way to display the results of my statistical model on a boxplot.

Note: the variables pred2 and pred3 are used to withdraw some part of variance that I don't want to be accounted to the effect of pred1 (as pred1, pred2and pred3 are correlated in my case). Therefore, I guess it is better not to run simple pairwise t-test in order to get the p.values I'd like to add at the top of each boxplot.

Best Answer

One solution is actually given as an example in the book on the multcomp package, section 4.6:

Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. Boca Raton, FL: CRC Press.

One only needs to slightly adapt your code (everything needs to be in one data.frame instead of floating around):

require(multcomp)
require(sandwich)

set.seed(81)
pred3 = rnorm(24)
df <- data.frame(pred1 = rep(c('Car', 'Bike', 'Train', 'Airplane'), 6), pred2 = rep(c('High', 'Low', 'Middle'), 8)
, resp = c(rnorm(12, sd = 1), rnorm(12, sd = 5)))

m <- aov(resp ~ pred1 + pred2, df)

tukey <- glht(m, linfct = mcp(pred1 = "Tukey") , vcov = sandwich)

summary(tukey, test = adjusted())

##          Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = resp ~ pred1 + pred2, data = df)
## 
## Linear Hypotheses:
##                       Estimate Std. Error t value Pr(>|t|)  
## Bike - Airplane == 0     1.559      1.166    1.34    0.547  
## Car - Airplane == 0      1.239      1.241    1.00    0.748  
## Train - Airplane == 0    2.509      0.915    2.74    0.058 .
## Car - Bike == 0         -0.320      1.422   -0.23    0.996  
## Train - Bike == 0        0.950      1.149    0.83    0.838  
## Train - Car == 0         1.270      1.225    1.04    0.726  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## (Adjusted p values reported -- single-step method)

Note that glht as default uses the single-step method as adjustment for alpha-error accumulation. If you want something else, you need to feed it to adjusted()

For example, using the Bonferroni-Holm correction (which I tend to use as I don't understand what single-step actually does):

summary(tukey, test = adjusted("holm"))

If you want no alpha error correction, which I do not recommend, this is also possible:

summary(tukey , test = adjusted("none"))
Related Question