As some of the information you provided states, the two are not the same. I like better the terminology of conditional (on covariates) and unconditional (marginal) estimates. There is a very subtle language problem that clouds the issue greatly. Analysts who tend to love "population average effects" have a dangerous tendency to try to estimate such effects from a sample with no reference to any population distribution of subject characteristics. In this sense the estimates should not be called population average estimates but instead should be called sample average estimates. It is very important to note that sample average estimates have a low chance of being transportable to the population from which the sample came or in fact to any population. One reason for this is the somewhat arbitrary selection criteria for how subjects get into studies.
As an example, if one compared treatment A and treatment B in a binary logistic model adjusted for sex, one obtains a treatment effect that is specific to both males and females. If the sex variable is omitted from the model, a sample average odds ratio effect for treatment is obtained. This in effect is a comparison of some of the males on treatment A with some of the females on treatment B, due to non-collapsibility of the odds ratio. If one had a population with a different female:male frequency, this average treatment effect coming from a marginal odds ratio for treatment, will no longer apply.
So if one wants a quantity that pertains to individual subjects, full conditioning on covariates is required. And these conditional estimates are the ones that transport to populations, not the so-called "population average" estimates.
Another way to think about it: think of an ideal study for comparing treatment to no treatment. This would be a multi-period randomized crossover study. Then think about the next best study: a randomized trial on identical twins where one of the twins in each pair is randomly selected to get treatment A and the other is selected to get treatment B. Both of these ideal studies are mimicked by full conditioning, i.e., full covariate adjustment to get conditional and not marginal effects from the more usual parallel group randomized controlled trial.
You'll want to check out McCaffrey et al. (2013) for advice on this, not Austin & Stuart (2015), which is for binary treatments only. It's not clear to me which causal estimand you want, so I'll explain how to get weights for both.
The ATE for any pair of treatments is the effect of moving everyone from one treatment to the another. In your example, one ATE would be the effect of moving the entire population from A to B, while another might be the effect of moving the entire population from B to D.
To estimate ATE weights, you take the inverse of the estimated probability of being the group actually assigned. So, for an individual in group A, their weight would be $w_{ATE,i}=\frac{1}{e_{A,i}}$. More generally, the weights are
$$w_{ATE,i} = \sum_{j=1}^p{\frac{I(Z_i=j)}{e_{j,i}}}$$
where $j$ indexes treatment group, $I(Z_i=j)=1$ if $Z_i=j$ and $0$ otherwise, and $e_{j,i}=P(Z_i=j|X_i)$.
The ATT involves choosing one group to be the "treated" or focal group. Each ATT is a comparison between another treatment group and this focal group for members of the focal groups. If we let group B be the focal group, one ATT is the effect of moving from A to B for those in group B. Another ATT is the effect of moving from D to B for those in group B.
The weights for the focal group are equal to 1, and the weights for the non-focal group are equal to the probability of being in the focal group divided by the probability of being the group actually assigned. So,
$$w_{ATT(f),i} = I(Z_i=j)+e_{f,i}\sum_{j \ne f}^p{\frac{I(Z_i=j)}{e_{j,i}}}= e_{f,i} w_{ATE,i}$$
where $f$ is the focal group. So, just as in the binary ATT case, the ATT weights are formed by multiplying the ATE weights by the propensity score for the focal group (i.e., the probability of being in the "treated" group). The binary ATT case, the focal group is group 1, so the probability of being in the focal group is just the propensity score.
Note all of these formulas apply to the binary treatment case.
Using WeightIt
in R, you would specify
w.out <- weightit(Treatment ~ X1 + X2 + X2, data = data, estimand = "ATT", focal = "B")
to estimate the ATT weights for B as the focal group using multinomial logistic regression. After checking balance (e.g., using cobalt
), you can estimate the outcome model as
fit <- glm(Y ~ relevel(Treatment, "B"), data = data, weights = w.out$weights)
You need to make sure the focal group is the reference level of the treatment variable for the coefficients to be valid ATT estimates.
Best Answer
The method to estimate representative treatment effects using regression is called g-computation and works with any outcome type as long as the effect measure can be specified as a contrast between means (e.g., a mean difference, a ratio between marginal probabilities, a ratio between marginal odds, etc.). Here's how this works:
This method of g-computation estimates the ATE. To estimate the ATT, steps 2 and 3 should be done using only the treated units. The control units are still used to fit the model in 1, but only the treated units are used to compute the predicted values.
To get standard errors, you can use bootstrapping or the delta method (the latter of which is exactly accurate when the outcome model is linear and the contrast is the difference in means but only an approximation otherwise).
In R, this is really easy using the
marginaleffects
package:This works for any GLM, e.g., logistic regression, Poisson regression etc. To compute contrasts that aren't the difference in means/risk difference, just supply an argument to
comparison
andtransform
(e.g., to get the risk ratio/relative risk, you would setcomparison = "lnratioavg", transform = "exp"
).This quantity is related to an AME, though that term is a bit ambiguous because of the multiple meanings of the word "marginal". The word "marginal" in AME means the instantaneous rate of change when the predictor is changed by a tiny amount. For a binary predictor, we are not changing it by a tiny amount; we are going from 0 to 1 (or whatever values you have). So AME is not an accurate way to describe this contrast, though I often use it because it is very closely related in computation and concept to a true AME. Rather, this is a "contrast between the average adjusted predictions". Kind of a mouthful.