Solved – Inverse probability weighting (IPW): standard errors after weighting observations

propensity-scoresrstandard errorweighted-regression

When using propensity scores for inverse probability weighting (IPW) the standard errors for the parameters in the regression model may be affected. I have seen several examples of people using different types of standard errors (classical, robust, bootstrap) and am unsure which ones are correct to use and why. Classical weighting would use weights to indicate the precision of individual observation – this is not the case for IPW, where weighting indicates the importance of observations (but not their precision).

If you want to add references to R packages, that would be appreciated, but I am primarily interested in the methods and why they should or should not be used.

Best Answer

Lunceford and Davidian (2004) derive the asymptotic standard errors for IPW estimators. These rely on a generalized estimating equations approach and assume the propensity scores are estimated using a method that can be represented as a system of estimating equations (e.g., logistic regression, but not random forests). Their proof also indicates that IPW estimators are smooth and asymptotically normal, making them amenable to bootstrapping. They also find that excluding the propensity score estimation from the estimating equations and treating the weights as fixed yields conservative estimates of the standard error. This is equivalent to using a robust standard error in the outcome regression model.

This leads to three ways to validly estimate the standard error in IPW:

  1. Using generalized estimating equations with the propensity scores and outcome model included together. This can be manually programmed using geex in R, and some R packages like PSweightcan also compute them. In SAS, PROC CAUSALTRT automatically computes the correct standard errors, and in Stata, teffects ipw uses the same approach.
  2. Using a robust standard error for the outcome model. This will generally be conservative and is the simplest and most flexible approach because it can be used with weight-estimation methods that are not implemented in those packages or can't be represented as systems of estimating equations, like generalized boosted modeling, which is a somewhat popular method. To do this in R, you would use survey::vcovHC() after a glm() or lm() call with the outcome model, survey::svyglm(), which is recommended in the twang and WeightIt documentation, or geepack::geeglm() as recommended by HernĂ¡n and Robins (2020). In SAS, you would use PROC SURVEYREG, and in Stata you would use supply the weights to the aweights argument in any regression model, which automatically requests robust standard errors.
  3. Using the bootstrap. The bootstrap, where you include the propensity score estimation and effect estimation within each replication, is a very effective method because it does not rely on asymptotic arguments, can be used with weight-estimation methods that can't be or aren't implemented in the packages named above, and can be used for any estimand, regardless of whether analytical standard errors have been derived for them (e.g., for the rate ratio in a negative binomial outcome model). The difficulty is that one needs to know how to program a bootstrap and one needs to be prepared to wait a potentially long time when the estimation procedure takes a while, e.g., for some machine learning methods. Also, the bootstrap will tend to yield different estimates each time, adding an additional layer of uncertainty into the estimation. Using the bootstrap is easiest in R with the boot package.

I am most inclined to use robust standard errors because of their flexibility and ease of use. For a serious project where conservative standard errors could be a liability and I had a lot of time, I would use the bootstrap.

Related Question