Solved – Inverse probability weighting (IPW): standard errors after weighting observations

propensity-scoresrstandard errorweighted-regression

When using propensity scores for inverse probability weighting (IPW) the standard errors for the parameters in the regression model may be affected. I have seen several examples of people using different types of standard errors (classical, robust, bootstrap) and am unsure which ones are correct to use and why. Classical weighting would use weights to indicate the precision of individual observation – this is not the case for IPW, where weighting indicates the importance of observations (but not their precision).

If you want to add references to R packages, that would be appreciated, but I am primarily interested in the methods and why they should or should not be used.

Best Answer

Lunceford and Davidian (2004) derive the asymptotic standard errors for IPW estimators. These rely on a generalized estimating equations approach and assume the propensity scores are estimated using a method that can be represented as a system of estimating equations (e.g., logistic regression, but not random forests). Their proof also indicates that IPW estimators are smooth and asymptotically normal, making them amenable to bootstrapping. They also find that excluding the propensity score estimation from the estimating equations and treating the weights as fixed yields conservative estimates of the standard error. This is equivalent to using a robust standard error in the outcome regression model.

This leads to three ways to validly estimate the standard error in IPW:

Using generalized estimating equations with the propensity scores and outcome model included together. This can be manually programmed using geex in R, and some R packages like PSweightcan also compute them. In SAS, PROC CAUSALTRT automatically computes the correct standard errors, and in Stata, teffects ipw uses the same approach.
Using a robust standard error for the outcome model. This will generally be conservative and is the simplest and most flexible approach because it can be used with weight-estimation methods that are not implemented in those packages or can't be represented as systems of estimating equations, like generalized boosted modeling, which is a somewhat popular method. To do this in R, you would use survey::vcovHC() after a glm() or lm() call with the outcome model, survey::svyglm(), which is recommended in the twang and WeightIt documentation, or geepack::geeglm() as recommended by Hernán and Robins (2020). In SAS, you would use PROC SURVEYREG, and in Stata you would use supply the weights to the aweights argument in any regression model, which automatically requests robust standard errors.
Using the bootstrap. The bootstrap, where you include the propensity score estimation and effect estimation within each replication, is a very effective method because it does not rely on asymptotic arguments, can be used with weight-estimation methods that can't be or aren't implemented in the packages named above, and can be used for any estimand, regardless of whether analytical standard errors have been derived for them (e.g., for the rate ratio in a negative binomial outcome model). The difficulty is that one needs to know how to program a bootstrap and one needs to be prepared to wait a potentially long time when the estimation procedure takes a while, e.g., for some machine learning methods. Also, the bootstrap will tend to yield different estimates each time, adding an additional layer of uncertainty into the estimation. Using the bootstrap is easiest in R with the boot package.

I am most inclined to use robust standard errors because of their flexibility and ease of use. For a serious project where conservative standard errors could be a liability and I had a lot of time, I would use the bootstrap.

Related Solutions

Solved – How to get ANOVA table with robust standard errors

The ANOVA in linear regression models is equivalent to the Wald test (and the likelihood ratio test) of the corresponding nested models. So when you want to conduct the corresponding test using heteroskedasticity-consistent (HC) standard errors, this cannot be obtained from a decomposition of the sums of squares but you can carry out the Wald test using a HC covariance estimate. This idea is used in both Anova() and linearHypothesis() from the car package and coeftest() and waldtest() from the lmtest package. The latter three can also be used with plm objects.

A simple (albeit not very interesting/meaningful) example is the following. We use the standard model from the ?plm manual page and want to carry out a Wald test for the significance of both log(pcap) and unemp. We need these packages:

library("plm")
library("sandwich")
library("car")
library("lmtest")

The model (under the alternative) is:

data("Produc", package = "plm")
mod <- plm(log(gsp) ~ log(pc) + log(emp) + log(pcap) + unemp,
  data = Produc, index = c("state", "year"))

First, let's look at the marginal Wald tests with HC standard errors for all individual coefficients:

coeftest(mod, vcov = vcovHC)

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
log(pc)    0.2920069  0.0617425  4.7294 2.681e-06 ***
log(emp)   0.7681595  0.0816652  9.4062 < 2.2e-16 ***
log(pcap) -0.0261497  0.0603262 -0.4335   0.66480    
unemp     -0.0052977  0.0024958 -2.1226   0.03411 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And then we carry out a Wald test for both log(pcap) and unemp:

linearHypothesis(mod, c("log(pcap)", "unemp"), vcov = vcovHC)

Linear hypothesis test

Hypothesis:
log(pcap) = 0
unemp = 0

Model 1: restricted model
Model 2: log(gsp) ~ log(pc) + log(emp) + log(pcap) + unemp

Note: Coefficient covariance matrix supplied.

  Res.Df Df  Chisq Pr(>Chisq)  
1    766                       
2    764  2 7.2934    0.02608 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Alternatively, we can also fit the model under the null hypothesis (mod0 say) without the two coefficients and then call waldtest():

mod0 <- plm(log(gsp) ~ log(pc) + log(emp),
  data = Produc, index = c("state", "year"))
waldtest(mod0, mod, vcov = vcovHC)

Wald test

Model 1: log(gsp) ~ log(pc) + log(emp)
Model 2: log(gsp) ~ log(pc) + log(emp) + log(pcap) + unemp
  Res.Df Df  Chisq Pr(>Chisq)  
1    766                       
2    764  2 7.2934    0.02608 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test statistic and p-value computed by linearHypothesis() and waldtest() is exactly the same. Just the interface and output formatting is somewhat different. In some cases one or the other is simpler or more intuitive.

Note: If you supply a covariance matrix estimate (i.e., a matrix like vocvHC(mod)) instead of a covariance matrix estimator (i.e., a function like vocvHC), make sure that you supply the HC covariance matrix estimate of the model under the alternative, i.e., the non-restricted model.

Clinical Trials – Comparing Two or More Treatments with Inverse Probability of Treatment Weighting

You can use twang with more than 2 treatment levels -- I use it all the time to obtain propensity scores for multiple (i.e. >2) treatments and it's one of my all time favorite R packages because there is no need to guess the functional relationships between your treatments and covariates. Since twang uses gradient boosted regression, it can fit nonlinear relationships and interactions automatically. All you need to do is assess the balance statistics after obtaining the propensity scores using the get.weights function. Using the mnps function in twang instead of the ps function, you can obtain propensity score weights for multiple treatments. The good folks at RAND have even prepared a nice tutorial for multiple treatment propensity score weighting.

One word of caution is in order however with twang. It isn't the most efficiently written program and it cannot be easily parrallelized, so if you have a lot of data (e.g. > 100,000 observations), it can be quite slow. Also, sometimes weights can be assigned which are too large and therefore influence your results too much. I perform a simple check on the weights to make sure that no weight is overly large (e.g. no weight accounts for more than say 5% of the total of all weights). Note too that the twang propensity scores weights are not standardized so they do not add to one. They can be through some manipulation, but I rarely find that I need to do this.

Lastly, work by Elizabeth Stuart has shown propensity scores built with gradient boosting methods outperforms other methods. Therefore, I'd strongly advocate the use of twang.

Best Answer

Related Solutions

Solved – How to get ANOVA table with robust standard errors

Clinical Trials – Comparing Two or More Treatments with Inverse Probability of Treatment Weighting

Related Question