Solved – How are robust standard errors calculated in the case of logistic regression

robust-standard-error

I mean: the Huber/White/sandwich estimator of standard errors. It seems to me that, in the case of continuous outcomes, robust estimators of standard errors are rather simple, given that variance of residuals for each observation is calculated as the squared (estimated) residuals from the regression.
But I can't figure out how this apply to a binary outcome: after estimating a probability for each observation, how can I estimate the variance of residuals for that observation, given it will necessarily have an outcome of 0 or 1, corresponding, in the logit scale, to either $+\infty$ or $-\infty$? In particular, I'm thinking at the case where I have continuous predictors (so I just have one occurence for each value of my set of regressors).

PS: I read some criticisms about the use of robust standard errors for logistic regression, because, if the estimates of variances are biased, then also the parameter estimates themselves are (given average and variance are linked, in the binomial case). However, I'm not wondering about whether robust standard errors make sense here, but simply how they are calculated.

Best Answer

The definition is completely analogous if you use the so-called working residuals and if regressors and residuals are weighted appropriately. The working weights are part of the usual iteratively weighted least squares (IWLS) algorithm employed for generalized linear models (including logistic regression).

The main idea is to use derivative of the log-likelihood with respect to the mean or the corresponding linear predictor as a measure of deviation. Alternatively, you can use the so-called scores or estimating functions as a measure of deviation, i.e., the derivative of the log-likelihood with respect to the coefficients of the linear predictor. All these ideas are employed in our sandwich package for R to enable object-oriented implementations of the different sandwich estimators. See Zeileis (2006, Object-Oriented Computation of Sandwich Estimators, doi:10.18637/jss.v016.i09) for the details.

A worked illustration in R using a logistic regression for labor market participation is:

data("SwissLabor", package = "AER")
m <- glm(participation ~ ., data = SwissLabor, family = binomial)

The classical Eicker/Huber/White sandwich covariance matrix can be computed using our sandwich package:

library("sandwich")
vcov_sandwich <- sandwich(m)

And by hand you can easily extract the working residuals and regressors, appropriately weighted with the working weights:

r <- residuals(m, "working") * sqrt(weights(m, "working"))
X <- model.matrix(m) * sqrt(weights(m, "working"))

Then, the formula for the covariance matrix is just the same as in the linear regression model

vcov_byhand <- solve(t(X) %*% X) %*% t(X) %*% diag(r^2) %*% X %*% solve(t(X) %*% X)
all.equal(vcov_sandwich, vcov_byhand)
## [1] TRUE

Two caveats: (1) Of course, one shouldn't compute the sandwich like this but more efficient and numerically more stable matrix computations can be employed. (2) For data with independent binary responses, these sandwich covariances are not robust against anything. They are consistent, that's ok, but there is no way to misspecify the likelihood without misspecifying the model equation.

Related Solutions

Solved – cluster-robust standard errors are smaller than unclustered ones in fgls with cluster fixed effects

Robust clustered standard errors can change your standard errors in both directions. That is, clustered standard errors can be larger or smaller than conventional standard errors. The direction in which standard errors will change depends on the sign of the intra-class correlation. This post explains robust standard errors in greater detail.

Solved – How to get ANOVA table with robust standard errors

The ANOVA in linear regression models is equivalent to the Wald test (and the likelihood ratio test) of the corresponding nested models. So when you want to conduct the corresponding test using heteroskedasticity-consistent (HC) standard errors, this cannot be obtained from a decomposition of the sums of squares but you can carry out the Wald test using a HC covariance estimate. This idea is used in both Anova() and linearHypothesis() from the car package and coeftest() and waldtest() from the lmtest package. The latter three can also be used with plm objects.

A simple (albeit not very interesting/meaningful) example is the following. We use the standard model from the ?plm manual page and want to carry out a Wald test for the significance of both log(pcap) and unemp. We need these packages:

library("plm")
library("sandwich")
library("car")
library("lmtest")

The model (under the alternative) is:

data("Produc", package = "plm")
mod <- plm(log(gsp) ~ log(pc) + log(emp) + log(pcap) + unemp,
  data = Produc, index = c("state", "year"))

First, let's look at the marginal Wald tests with HC standard errors for all individual coefficients:

coeftest(mod, vcov = vcovHC)

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
log(pc)    0.2920069  0.0617425  4.7294 2.681e-06 ***
log(emp)   0.7681595  0.0816652  9.4062 < 2.2e-16 ***
log(pcap) -0.0261497  0.0603262 -0.4335   0.66480    
unemp     -0.0052977  0.0024958 -2.1226   0.03411 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And then we carry out a Wald test for both log(pcap) and unemp:

linearHypothesis(mod, c("log(pcap)", "unemp"), vcov = vcovHC)

Linear hypothesis test

Hypothesis:
log(pcap) = 0
unemp = 0

Model 1: restricted model
Model 2: log(gsp) ~ log(pc) + log(emp) + log(pcap) + unemp

Note: Coefficient covariance matrix supplied.

  Res.Df Df  Chisq Pr(>Chisq)  
1    766                       
2    764  2 7.2934    0.02608 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Alternatively, we can also fit the model under the null hypothesis (mod0 say) without the two coefficients and then call waldtest():

mod0 <- plm(log(gsp) ~ log(pc) + log(emp),
  data = Produc, index = c("state", "year"))
waldtest(mod0, mod, vcov = vcovHC)

Wald test

Model 1: log(gsp) ~ log(pc) + log(emp)
Model 2: log(gsp) ~ log(pc) + log(emp) + log(pcap) + unemp
  Res.Df Df  Chisq Pr(>Chisq)  
1    766                       
2    764  2 7.2934    0.02608 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test statistic and p-value computed by linearHypothesis() and waldtest() is exactly the same. Just the interface and output formatting is somewhat different. In some cases one or the other is simpler or more intuitive.

Note: If you supply a covariance matrix estimate (i.e., a matrix like vocvHC(mod)) instead of a covariance matrix estimator (i.e., a function like vocvHC), make sure that you supply the HC covariance matrix estimate of the model under the alternative, i.e., the non-restricted model.