propensity-scores – Handling Unbalanced Variables After IPTW with Entropy Balancing

propensity-scoresrtreatment-effectweighted-data

After using inverse probability of treatment weighting (IPTW) on the variables of my dataset, there is still an imbalance in one covariate between the two groups. My outcome is binary (yes/no) and it is not a longitudinal study.

One example is:

library(WeightIt)
W.out <- weightit(treat ~ age + married + race,
                  data = lalonde, estimand = "ATE", method = "ps")
bal.tab(W.out, threshold=0.1)

Age is not balanced.

How can I make all the variables balanced? Is it possible to "re-weight"? How?
Is it possible to apply directly "entropy balancing" instead of IPTW
in this case? Can somebody explain to me entropy balancing? I tried
reading the original paper (here) but I didn't understand it so
much. How is entropy balance computed? Can it be always used at the same conditions as IPTW or are there particular conditions?
If entropy balancing is able to adjust with Standardized differences of almost 0, then why is it so little used in the medical field?
I noticed that in some papers there is the cohort after 1st weighting, then 2nd weighting, etc.. can someone explain how you obtain this? how many weighting do you have to do?

For instance, if I want to use this code:

W.out <- weightit(treat ~ age + married + race,
                  data = lalonde, estimand = "ATE", method = "ebal")

What are the parameters that I have to set and that I have to pay attention for in order to know that I applied the method correctly? Is there a way to visualize the scores from which the weights were obtained from? as in the case of IPTW (W.out$ps)

Best Answer

These are some good questions. I'll do my best to give simple answers to them.

Entropy balancing (EB) for the ATT (which is not your query) is IPTW. It implicitly estimates a propensity score (PS) using logistic regression, but instead of doing so with maximum likelihood, it does so using a different algorithm that yields exact mean balance on the included covariates. This is described in Zhao & Percival (2017) and Zhou (2019), among others.

However, it was not known that this was what EB was when it was first described in Hainmueller (2012). Hainmueller considered EB an optimization problem: estimate weights for each individual such that the following characteristics hold: the covariate means are exactly balanced after weighting, the weights are positive, and the "negative entropy" of the weights is minimized. The negative entropy is a measure of variability, so EB weights are meant to be less extreme than standard IPTW weights. Instead of having to do the optimization problem and estimate $n$ parameters (i.e., a weight for each individual in the sample), Hainmueller discovered a trick where you can just estimate one parameter for each variable to be balanced. The reason this trick is possible is because of the later-discovered fact that EB is a special kind of logistic regression, and in logistic regression you just estimate one parameter for each variable (i.e., the regression coefficient).

For the ATE, unfortunately, it's a different story. The nice equivalence between logistic regression and EB doesn't hold, but WeightIt still relies on the trick of estimating one parameter per variable (actually two, one for each treatment group) instead of estimating a weight for each unit. How WeightIt does it is irrelevant, but to summarize, it performs EB twice, once for each treatment group, and estimates weights for each treatment group that yield exact mean balance on the covariates between each treatment group and the overall sample.

Since the goal of IPTW is to achieve balance, EB skips the step of estimating a PS and goes straight to balance, while ensuring the weights have minimal variability. For this reason, it performs excellently in simulations and real data. It is in line with the philosophy of matching as nonparametric preprocessing described by Ho et al. (2007), who identify the PS tautology, which is that a good PS achieves balance, but the only way to evaluate a PS is to assess whether it has achieved balance. So EB skips the middleman and goes straight to balance, skipping over the steps of estimating a PS, checking balance, if balance isn't good, choosing a different PS specification, etc. EB guarantees exact mean balance on the covariates right away.

There are two philosophies to estimating PSs, which I described in detail in this post, which mentions EB and its alternatives. First, there is the philosophy of trying to estimate the PS as accurately as possible, because then the "magical" properties of the PS that guarantee unbiasedness in large samples come into play. Second, there is the philosophy of estimating PSs that yield balance with no attempt to estimate the true PS or even an accurate one. EB falls squarely in the second camp, omitting a PS entirely. However, one weakness of this is that the magical properties of the PS cannot come into play: you can only balance the terms you request to be balanced, and there is no guarantee the rest of the covariate distribution (i.e., moments beyond the means, features of the joint distribution like covariances) will be balanced unless those are specifically requested, too. An analyst at SAS said, wisely, "When a metric becomes a target, it ceases to be a metric"; that is, measured covariate balance is a metric of the PS's ability to balance unmeasured features of the covariate distribution (and by unmeasured I mean unseen features of the distribution of observed covariates, not unmeasured covariates), and achieving measured balance automatically using EB doesn't tell you about the unmeasured features of the covariate distribution. You can no longer rely on the theoretical properties of the PS to balance the distributions.

Okay, I know I've been a little theoretical and technical here. I'l bring it back to answering your questions directly.

How can I make all the variables balanced? Is it possible to "re-weight"? How?

You can use EB directly on the covariates; you don't need to re-weight (i.e., apply entropy balancing to the propensity score-weighted sample). That is, if your IPTWs didn't yield balance, toss them out and use a different method of estimating weights. EB is one, but there are others. My favorite is energy balancing, which is also implemented in WeightIt. (It actually is possible to combine IPTW and EB, which was one of the winning methods in the 2016 ACIC data competition. It has not been studied beyond that, though.)

Is it possible to apply directly "entropy balancing" instead of IPTW in this case? Can somebody explain to me entropy balancing? I tried reading the original paper (here) but I didn't understand it so much. How is entropy balance computed? Can it be always used at the same conditions as IPTW or are there particular conditions?

I attempted to answer this above, but I'll summarize. EB for the ATE skips the PS and estimates weights that exactly balance the covariate means and ensure the weights have minimal variability. The specific method of estimation is a very simple optimization that runs extremely fast. For the ATT, the story is slightly different, and more connections to standard IPTW exist. For a treatment at a single time point, EB can be used in the exact same situations IPTW can, including for binary, multi-category, and continuous treatments, for the ATT or ATE, for subgroup analysis, etc. The estimates from EB have the exact same interpretations as those from IPTW. There are many extensions to entropy balancing, including for longitudinal treatments and when you have a single treated unit and multiple controls (this is called the synthetic control method). For the ATT, it performs almost uniformly better than logistic regression-based PS weighting except in pathological circumstances.

If entropy balancing is able to adjust with Standardized differences of almost 0, then why is it so little used in the medical field?

Mostly because medical researchers have not heard of it, and even if they have, they might be scared to use it because it sounds complicated, even though it isn't. It is very popular in labor economics and is getting more popular in medicine and other fields as well, slowly. It deserves way more attention and, in my opinion, should be the first method a researcher tries, not a backup when IPTW fails. It must be accompanied by a robust assessment of balance because the theoretical properties of the propensity score do not apply (for the ATE, but they actually do for the ATT); this includes assessing balance beyond the means using, e.g., KS statistics and balance statistics for interactions and polynomial terms, which are all available in cobalt.

I noticed that in some papers there is the cohort after 1st weighting, then 2nd weighting, etc.. can someone explain how you obtain this? how many weighting do you have to do?

I'm not exactly sure what you're referring to, but this is probably multiple attempts to estimate a single set of weights that balance the covariates. E.g., you try a logistic regression, then a logistic regression with squared terms added, then with some interactions added, etc. Only the properties of the final set of weights (i.e., those that yield the best balance without sacrificing precision) should be reported and used in effect estimation, but it is important to describe your process of estimating weights in your manuscript to ensure your procedure is replicable. (There are some contexts where multiple sets of weights are combined together, but that is an advanced matter that is beyond the scope of your question.)

Go forth, and use entropy balancing!

Related Solutions

R – Reduced Effective Sample Size in Balanced Population After Inverse Probability Treatment Weighting

First let's talk about weighting. Weighting involves estimating a weight for each unit in the sample and then estimating the treatment effect in a way that incorproates the weights, such as weighted linear regression, weighted maximum likelihood, or a weighted difference (or ratio, etc.) in means. The weights also enter the balance statistics (e.g., standardized mean difference and KS statistics).

Separate from IPTW, it's important to understand that weighted samples have lower precision than the corresponding unweighted sample. By precision, I mean the standard error of the estimate of the desired quantity (e.g., the causal mean or difference between causal means). If we think of uncertainty arising due to the sampling of individuals, the estimate will change a lot if a unit with a large weight is swapped out for another unit in the population. Similarly, having a unit with a weight of 2 is not the same as having 2 units; choosing to use one unit's information twice doesn't give you more information about your sample; you are just changing the importance of a unit's contribution to your estimates.

The ratio of the variance (i.e., square of the standard error) of the estimate of a mean from a weighted sample to the estimate of a mean from an unweighted sample is known as the "design effect" (i.e., the effect of the "design" on the precision of the estimate) and was derived by Shook-Sa and Hudgens (2020) to be $$ \text{deff}_w = \frac{N \sum_{i}w_i^2}{(\sum_{i} w_i)^2} $$ where $N$ is the sample size of the group and $w_i$ is the weight for unit $i$. The effective sample size (ESS) is defined as $$ \text{ESS} = \frac{(\sum_i w_i)^2}{\sum_i w_i^2} = \frac{N}{\text{deff}_w} $$ and represents the size of an unweighted sample that contains the same precision as a weighted sample. When the weights are scaled to have an average of 1 (i.e., and a sum of $N$), the ESS can be equivalently written as $$ \text{ESS} = \frac{N}{1 + \text{Var}(w^*)} $$ where $\text{Var}(w^*)$ is the variance of the scaled weights computed using the population formula. This latter formula makes it easy to see that as the variance of the weights increases, the ESS gets smaller.

The ESS isn't the same as the sample size, but it functions like it. That is, if you have a sample of 1000 units but after weighting the sample has an ESS of 500, then you will have (approximately) the same precision as if you only had 500 units in an unweighed sample. That is the price we pay for using the weights to remove confounding. In this way, matching and weighting both function the same way, which is to trade precision for unbiasedness. The fact that weighting retains all units doesn't mean the estimates will be as precise as had you not done weighting at all; in fact, the ESS from IPTW can be smaller than the remaining sample size after matching.

The frequencies of events/conditions in the dataset is not a useful way to think about the changes caused by the weighting. The frequencies are the same as they prior to weighting; you will still have the same number of events, the same number of patients with conditions, etc., but those patients will contribute different amounts of information to the estimation of the effect and change the precision accordingly. It is useful, though, to think about weighted means and proportions of variables, since these represent balance in the weighted sample. So while the proportion of married units in one group may change from .2 to .4, that doesn't mean the actual number of married units has changed. It means the weighted sample is meant to represent a population in which 40% are married.

So, to sum up:

The size of the weighted sample and the number of units with each condition (e.g., the number of events) is the same as the unweighted sample. Weighting does not change the sample size or number of events; it changes the relative contribution of each individual to the estimation of the treatment effect
The effective sample size (ESS) is a measure that approximately captures the degree of precision remaining in the sample after weighting; it is not a literal sample size. It is a diagnostic statistic used for analysts to decide whether their weights have degraded their precision to an unacceptable degree. It should be reported in papers but very rarely is.
The frequency of events is not a useful way to think about prevalence in a weighted sample. The frequency is the same in the weighted and unweighted samples, but the amount of information contained in each event changes depending on its weight. The weighted proportions/rates capture this information. Balance should be assessed on the proportion of events, not their frequency, so computing the weighted mean of a binary variable is the appropriate way to characterize prevalence after weighting, not a frequency count.

IPTW – Calculating Outcomes in Inverse Probability Treatment Weighting (IPTW) Using R

Weights are not applied to individual variables. They are applied to the whole sample once estimated. So question 1 doesn't make sense. Instead of using svyglm(), let's use lm(), which has a simpler interface. Running lm(Y ~ treat, data = data, weights = weights), which fits a weighted least squares regression, and looking at the coefficient on treat is the same as computing the weighted difference in outcome means

with(data, 
  weighted.mean(Y[treat == 1], weights[treat == 1]) - 
    weighted.mean(Y[treat == 0], weights[treat == 0]))

They are two ways of doing the same thing. This is called the Hajek estimator of the treatment effect. We use the former because it is more straightforward to compute standard errors. Using svyglm() does the same thing but uses a different interface. (Note the standard errors from lm() are incorrect and need to be adjusted, but the ones from svyglm() are approximately correct.)

In question 2, you ask why we would further adjust for covariates after weighting and why the estimate changes. We further adjust for covariates for two reasons: 1) to reduce bias due to remaining imbalance, i.e., when the weights don't exactly balance the covariate means, which is always the case for standard IPTW (but not for all method like entropy balancing, which does perfectly balance the means), and 2) to increase the precision of the effect estimate (decrease the standard error) by explaining variability in the outcome. If you exactly balance your covariate means, then it doesn't matter whether you include covariates in the outcome model or not; the estimate will be the same, which is demonstrated in Hainmueller (2012) who proposes entropy balancing.

For question 3, a rate is a mean. It is the frequency divided by the sample size. We can compute a weighted rate for a binary variable by computing the weighted mean of that variable. It doesn't matter whether the weights were designed to balance a variable or not, you can still compute a weighted mean using those weights. That is, to compute the weighted death rate under control in the weighted sample, you just run

with(data,
  weighted.mean(Y[treat == 0], weights[treat == 0]))

This is also equal to the intercept in the weighted least squares model for the outcome when no covariates are included. So the weights are the IPTW weights, the only weights that are being estimated.

It might be that the ease of using the software is obfuscating some of the details for you. I recommend doing everything manually so you understand what each step is actually doing. For example, estimate the propensity scores (PS) manually:

ps.fit <- glm(treat ~ age + educ + married + re74, data = lalonde,
              family = binomial)
ps <- ps.fit$fitted.values

You can see we get one PS for each unit. Now, compute the IPTW ATT weights from the PS using the formula:

weights <- ifelse(lalonde$treat == 1, 1, ps / (1 - ps))

You can see we get one weights for each unit. Hopefully this makes it clear the we don't have weights for variables; we have one weight for each unit, and this set of weights balances the covariates and is used in the outcome model to estimate the treatment effect. The reason we use it in the outcome model is because it balances the covariates.

We can assess balance using the weights by computing the weighted difference in proportion or the standardized mean difference after weighting:

# Weighted difference in proportion for `married`
with(lalonde,
  weighted.mean(married[treat == 1], weights[treat == 1]) -
    weighted.mean(married[treat == 0], weights[treat == 0]))

# Weighted SMD for `age`
with(lalonde,
  (weighted.mean(age[treat == 1], weights[treat == 1]) -
     weighted.mean(age[treat == 0], weights[treat == 0])) /
       sd(age[treat == 1]))

You should see that these align with the bal.tab() output.

Finally, if balance is acceptable, we can computed the weighted outcome means and their difference or use lm() or svyglm() to estimate the treatment effect as a coefficient in the outcome regression model:

# Weighted outcome mean under control
m0 <- with(lalonde, weighted.mean(re78[treat == 0], weights[treat == 0]))

# Weighted outcome mean under treatment
m1 <- with(lalonde, weighted.mean(re78[treat == 1], weights[treat == 1]))

# Difference in weighted means: the treatment effect estimate
m1 - m0

# Using linear regression to estimate treatment effect
lm(re78 ~ treat, data = lalonde, weights = weights) |>
  coef()

The first step of estimating the propensity scores and weights is done by weightit(), and the second step is done by bal.tab(). But it is important to run these analyses yourself manually to understand where these values are coming from. Hopefully that elucidates the method for you.

Best Answer

Related Solutions

R – Reduced Effective Sample Size in Balanced Population After Inverse Probability Treatment Weighting

IPTW – Calculating Outcomes in Inverse Probability Treatment Weighting (IPTW) Using R

Related Question