Solved – Pairwise comparisons for a regression with sandwich estimates (in R)

boxplotdata visualizationp-valueregressionsandwich

The question in short

I run a regression in R and made a boxplot of the response variable with grouping by one of the predictor variables. On this boxplot I'd like to add some information about the statistical model. What information (and how to display it (it is not an issue of programming)) would you suggest me to provide?

The developed question

I have several predictors: two categorical, non-ordinal predictors and one continuous predictor (below coded in R)

set.seed(81)
pred1 = rep(c('Car', 'Bike', 'Train', 'Airplane'), 6)
pred2 = rep(c('High', 'Low', 'Middle'), 8)
pred3 = rnorm(24)
resp = c(rnorm(12, sd = 1), rnorm(12, sd = 5))

resp is the response variable. I ran a regression with sandwich estimates:

require(sandwich)
require(lmtest)


m = aov(resp ~ pred1 + pred2)
coeftest(m, sandwich)

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -0.49642    0.73911 -0.6716  0.51034  
pred1Bike    1.55917    1.16568  1.3376  0.19769  
pred1Car     1.23873    1.24080  0.9983  0.33135  
pred1Train   2.50882    0.91468  2.7428  0.01338 *
pred2Low     0.11613    1.00540  0.1155  0.90932  
pred2Middle  0.51476    0.90924  0.5661  0.57829  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And I boxplotted the groups for pred1:

require(ggplot2)
ggplot(data.frame(pred1, resp), aes(x=pred1, y=resp)) + geom_boxplot()

enter image description here

On this plot I'd like to add some letters in order to indicate the groups which are statistically similar (p.value < 0.05) as discussed here. Something like this:

ggplot(data.frame(pred1, resp), aes(x=pred1, y=resp)) + geom_boxplot() + annotate('text', x=1:4, y=6, label=c('a','b','a','b'), size = 8, color='red')

enter image description here

My question is:

How can I find these p.values for the pairwise comparisons with my robust regression? I can do what follows where m is a simple aov model:

TukeyHSD(m)

But the following doesn't work:

TukeyHSD(coeftest(m, sandwich))

I might missunderstand what these pairwise comparisons are, and what the results I currently have mean! Please let me know if you feel this! The aim of my question is for me to understand what is the best way to display the results of my statistical model on a boxplot.

Note: the variables pred2 and pred3 are used to withdraw some part of variance that I don't want to be accounted to the effect of pred1 (as pred1, pred2and pred3 are correlated in my case). Therefore, I guess it is better not to run simple pairwise t-test in order to get the p.values I'd like to add at the top of each boxplot.

Best Answer

One solution is actually given as an example in the book on the multcomp package, section 4.6:

Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. Boca Raton, FL: CRC Press.

One only needs to slightly adapt your code (everything needs to be in one data.frame instead of floating around):

require(multcomp)
require(sandwich)

set.seed(81)
pred3 = rnorm(24)
df <- data.frame(pred1 = rep(c('Car', 'Bike', 'Train', 'Airplane'), 6), pred2 = rep(c('High', 'Low', 'Middle'), 8)
, resp = c(rnorm(12, sd = 1), rnorm(12, sd = 5)))

m <- aov(resp ~ pred1 + pred2, df)

tukey <- glht(m, linfct = mcp(pred1 = "Tukey") , vcov = sandwich)

summary(tukey, test = adjusted())

##          Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = resp ~ pred1 + pred2, data = df)
## 
## Linear Hypotheses:
##                       Estimate Std. Error t value Pr(>|t|)  
## Bike - Airplane == 0     1.559      1.166    1.34    0.547  
## Car - Airplane == 0      1.239      1.241    1.00    0.748  
## Train - Airplane == 0    2.509      0.915    2.74    0.058 .
## Car - Bike == 0         -0.320      1.422   -0.23    0.996  
## Train - Bike == 0        0.950      1.149    0.83    0.838  
## Train - Car == 0         1.270      1.225    1.04    0.726  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## (Adjusted p values reported -- single-step method)

Note that glht as default uses the single-step method as adjustment for alpha-error accumulation. If you want something else, you need to feed it to adjusted()

For example, using the Bonferroni-Holm correction (which I tend to use as I don't understand what single-step actually does):

summary(tukey, test = adjusted("holm"))

If you want no alpha error correction, which I do not recommend, this is also possible:

summary(tukey , test = adjusted("none"))

Related Solutions

Robust Regression – Robust Regression Inference and Sandwich Estimators

I think there are a few approaches. I haven't looked at them all and not sure which is the best:

The sandwich package:

library(sandwich)    
coeftest(model, vcov=sandwich)

But this doesn't give me the same answers I get from Stata for some reason. I've never tried to work out why, I just don't use this package.

The rms package: I find this a bit of a pain to work with but usually get good answers with some effort. And it is the most useful for me.
```
model = ols(a~b, x=TRUE)    
robcov(model)
```
You can code it from scratch (see this blog post). It looks like the most painful option, but remarkably easy and this option often works the best.

A simple / quick explanation is that Huber-White or Robust SE are derived from the data rather than from the model, and thus are robust to many model assumptions. But as always, a quick Google search will lay this out in excruciating detail if you're interested.

Solved – Robust OLS versus ML with sandwich estimator

In Stata's User Guide it is stated that Stata uses the "White" formula for the heteroskedasticity-robust variance-covariance matrix of the estimator. Then, at least in the context of the Normal Linear Regression Model $$y_i = \mathbf x_i'\beta +u_i$$ we should obtain the exact same results using either OLS or ML.

With observations and errors i.i.d. (hence homoskedastic also), and all the other nice assumptions, the (full) asymptotic variance-covariance matrix of the ML estimator is

$$\operatorname {Avar}\left[\sqrt n(\hat \beta_{ML}-\beta)\right] = \Big[E(H_i)\Big]^{-1}E(s_is_i')\Big[E(H_i)\Big]^{-1}$$

Where $H_i$ is the Hessian matrix (2nd derivative of log-likelihood) for the $i$-th observation,

$$H_i = -\frac1{\sigma^2}\mathbf x_i\mathbf x_i'$$

and $s_i$ is the score vector (1st derivative of the loglikelihood) for the $i$-th observation,

$$s_i = \frac1{\sigma^2}u_i\mathbf x_i$$ Under homoskedasticity, the information matrix equality holds, $E(s_is_i')=-E(H_i)$ and so the expression simplifies to

$$\operatorname {Avar}\left[\sqrt n(\hat \beta_{ML}-\beta)\right] = -\Big[E(H_i)\Big]^{-1} = \Big[E(s_is_i')\Big]^{-1}$$

two equalities that provide the two alternative estimator variance formulas we use in ML estimation (using sample means instead of expected values). Note that in finite samples these two estimates do not give identical results -but for large samples, they should be "close".

But let's say we suspect the existence of heteroskedasticity and we want to account for it, not by attempting to specify a functional form for it (as is the older approach) but by calculating only a "heteroskedasticity-robust" variance-covariance matrix for our tests and inference, obtaining the coefficient estimates themselves in the usual way.

In such a case, and in order to by-pass the problem of estimating $n$ different variances, we use the result that, at least for some forms of misspecification (heteroskedasticity included), specifying the wrong log-likelihood (in our case, ignoring the heteroskedasticity), nevertheless still gives us a consistent ML estimator, the "Quasi-MLE", whose asymptotic variance-covariance matrix is consistently estimated by the full version of the theoretical asymptotic variance of the misspecified model, (i.e. we ignore the "information matrix equality" that previously simplified matters). In other words,

$$\operatorname {\hat Avar}_{Robust}\left[\sqrt n(\hat \beta_{ML}-\beta)\right] \\= \Big[\frac 1n\sum_{i=1}^n\hat H_i\Big]^{-1}\left(\frac 1n\sum_{i=1}^n(\hat s_i \hat s_i')\right)\Big[\frac 1n\sum_{i=1}^n\hat H_i\Big]^{-1}$$

$$=n\cdot \Big[\frac 1{\hat \sigma^2}\sum_{i=1}^n\mathbf x_i\mathbf x_i'\Big]^{-1}\left(\frac 1{(\hat \sigma^2)^2}\sum_{i=1}^n\hat u_i^2 \mathbf x_i\mathbf x_i'\right)\Big[\frac 1{\hat \sigma^2}\sum_{i=1}^n\mathbf x_i\mathbf x_i'\Big]^{-1}$$

$$=n\cdot \left(\mathbf X'\mathbf X\right)^{-1}\left(\sum_{i=1}^n\hat u_i^2 \mathbf x_i\mathbf x_i'\right)\left(\mathbf X'\mathbf X\right)^{-1}$$

Since in the Normal linear regression model, the ML estimator coincides with the OLS estimator for the coefficients, the residual series will be identical, so the above expression is numerically equal to the heteroskedasticity-robust variance covariance matrix of the (centered and scaled) OLS estimator.

But the OP says in a comment that he obtains different results for the two cases with his software. This may be due to some "finite-sample" bias correction that creeps in in the one case but not in the other, which nevertheless is not part of the original "White" expression, but was proposed later by White himself but also by Davidson and MacKinnon based on results from Monte-Carlo simulations. Alternatively, it may be the case, that the actual code written for reflecting the above equation for the ML estimation, instead of ignoring the variance estimate, it calculates $\hat \sigma^4$ directly and doesn't just square the estimated variance, using a formula that might produce different results than squared estimated variance, and so the variances do not really cancel out (note that in the OLS framework, the variances are absent from the beginning).

Best Answer

Related Solutions

Robust Regression – Robust Regression Inference and Sandwich Estimators

Solved – Robust OLS versus ML with sandwich estimator

Related Question