Problem
I am playing around with what I thought would be a simple, straightforward simulation of recovering a lack of treatment effect under confounding in a longitudinal setting. I am trying to recover this null effect using inverse probability weighting (IPW) and a marginal structural model (MSM), as described in Chapter 21.2 of Hernán and Robins (2020). However, I don't seem to be able to get an unbiased estimate and I am at a loss of why.
Setup
In the most basic form of my simulations, there are three types of variables at each time point t: confounder L, treatment A, and outcome Y. The following DAG represents the setup for two timesteps $t \in \{0, 1\}$ with potentially further steps following:
Data from this DAG for a slightly longer period of $t \in \{0,…,5\}$ can be simulate for example by using the R package simcausal:
library(tidyverse)
library(glue)
library(simcausal)
set.seed(42)
D <- DAG.empty() +
node("L", t=0:5, distr="rnorm", mean = {if (t == 0) {0} else {0.5 * L[t-1]}}, sd = 0.5) +
node("A", t=0:5, distr="rbern", prob = plogis(-1 + L[t] * 1.06)) +
node("Y", t=0:5, distr="rnorm", mean = 10 + 1.5 * L[t], sd = 1)
D <- set.DAG(D)
#> ...automatically assigning order attribute to some nodes...
#> node L_0, order:1
#> node A_0, order:2
#> node Y_0, order:3
#> node L_1, order:4
#> ...
#> node Y_4, order:15
#> node L_5, order:16
#> node A_5, order:17
#> node Y_5, order:18
dat <- as_tibble(sim(D,n=1000))
#> simulating observed dataset from the DAG object
dat
#> # A tibble: 1,000 x 19
#> ID L_0 A_0 Y_0 L_1 A_1 Y_1 L_2 A_2 Y_2 L_3
#> <int> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 -0.0203 0 10.0 -0.237 0 8.24 -0.953 0 7.52 -0.0643
#> 2 2 -0.776 0 7.28 0.0400 0 10.9 -0.0523 0 12.8 0.318
#> 3 3 0.584 1 10.7 0.175 1 11.0 -0.455 1 9.78 0.440
#> 4 4 -0.137 0 9.38 -0.558 0 8.26 -0.902 0 9.04 -0.855
#> 5 5 -0.234 0 10.1 0.252 1 9.38 0.236 0 8.81 0.0685
#> 6 6 -0.619 0 8.54 0.911 1 11.7 0.0840 0 10.8 -0.357
#> 7 7 -0.00388 1 10.5 -0.498 0 8.17 0.00416 0 10.9 -0.857
#> 8 8 -0.400 1 9.94 -0.420 0 9.54 0.0131 0 10.0 -0.149
#> 9 9 -0.267 1 9.32 -0.718 0 9.56 -0.174 0 7.95 0.154
#> 10 10 0.644 1 10.9 1.12 0 11.4 0.625 0 12.1 0.512
#> # ... with 990 more rows, and 8 more variables: A_3 <int>, Y_3 <dbl>,
#> # L_4 <dbl>, A_4 <int>, Y_4 <dbl>, L_5 <dbl>, A_5 <int>, Y_5 <dbl>
Single time step
From this data, I am able to recover the true null effect for a single time step. Limiting the analysis to a single time, we know from the DAG that $Y^{a} \perp A
| L $. Therefore, conditioning on $L$ should be sufficient to control for the confounding. Following the description in chapter 12.4 of Hernán and Robins (2020), I can use logistic regression to obtain nonstabilised IPWs $W^{A} = 1 / f(A|L)$ and fit the MSM $\mathbb{E}[Y^{a}]=\beta_0 + \beta_1a$ using a linear regression weighted by $W^{A}$. Doing this for $t=5$ gives the correct answer:
ps_mod <- glm(A_5 ~ L_5, data = dat, family = binomial)
ps <- predict(ps_mod, type = "response")
ipw <- 1 / if_else(dat$A_5 == 0, 1-ps, ps)
lm(Y_5 ~ A_5, data = dat) %>% summary()
#>
#> Call:
#> lm(formula = Y_5 ~ A_5, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.8492 -0.8697 -0.0358 0.9217 3.8088
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.86092 0.04907 200.953 < 2e-16 ***
#> A_5 0.50499 0.09097 5.551 3.63e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.307 on 998 degrees of freedom
#> Multiple R-squared: 0.02996, Adjusted R-squared: 0.02898
#> F-statistic: 30.82 on 1 and 998 DF, p-value: 3.631e-08
lm(Y_5 ~ A_5, data = dat, weights = ipw) %>% summary()
#>
#> Call:
#> lm(formula = Y_5 ~ A_5, data = dat, weights = ipw)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -6.7440 -1.1592 -0.0744 1.1355 5.2034
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10.01113 0.05736 174.522 <2e-16 ***
#> A_5 0.07162 0.08120 0.882 0.378
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.813 on 998 degrees of freedom
#> Multiple R-squared: 0.0007788, Adjusted R-squared: -0.0002224
#> F-statistic: 0.7779 on 1 and 998 DF, p-value: 0.378
All time steps
Chapter 21.2 of Hernán and Robins (2020) explains how the above can be extended to multiple time steps. The weights for $A_t$ are now calculated for each step conditional on the treatment and covariate histories $\bar A_{t-1}$ and $\bar L_{t-1}$. The nonstabilised weights for the entire sequence $\bar A$ are then obtained through multiplication as:
$$
W^{\bar A} = \prod_{t=0}^T \frac{1}{f(A_t|\bar A_{t-1}, \bar L_{t})}
$$
In this example, $f(A_t|\bar A_{t-1}, \bar L_{t-1}) = f(A_t|L_t)$, which I estimate again using logistic regression. I should then be able to proceed to estimate the MSM $\mathbb{E}[Y^{\bar a}]=\beta_0 + \beta_1cum(\bar a)$ with appropriately weighted linear regression. Note that I hypothesised the treatment effect to depend on the cumulative sum of treatments received. Unfortunately, this strategy does not seem to work:
ipw <- rep(1, 1000)
for (i in 0:5) {
form <- as.formula(glue("A_{i} ~ L_{i}"))
ps_mod <- glm(form, data = dat, family = binomial)
ps <- predict(ps_mod, type = "response")
ipw <- ipw * 1 / if_else(dat[[glue("A_{i}")]] == 0, 1-ps, ps)
}
summary(ipw)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.409 12.118 27.875 68.280 63.604 4585.662
lm(Y_5 ~ I(A_5 + A_4 + A_3 + A_2 + A_1 + A_0), data = dat) %>% summary()
#>
#> Call:
#> lm(formula = Y_5 ~ I(A_5 + A_4 + A_3 + A_2 + A_1 + A_0), data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.9975 -0.8653 -0.0425 0.8785 4.1043
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.66836 0.07233 133.674 < 2e-16 ***
#> I(A_5 + A_4 + A_3 + A_2 + A_1 + A_0) 0.20102 0.03516 5.717 1.43e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.305 on 998 degrees of freedom
#> Multiple R-squared: 0.03171, Adjusted R-squared: 0.03074
#> F-statistic: 32.68 on 1 and 998 DF, p-value: 1.434e-08
lm(Y_5 ~ I(A_5 + A_4 + A_3 + A_2 + A_1 + A_0), data = dat, weights = ipw) %>% summary()
#>
#> Call:
#> lm(formula = Y_5 ~ I(A_5 + A_4 + A_3 + A_2 + A_1 + A_0), data = dat,
#> weights = ipw)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -62.204 -4.413 -0.365 4.310 81.938
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.64302 0.11080 87.029 < 2e-16 ***
#> I(A_5 + A_4 + A_3 + A_2 + A_1 + A_0) 0.23199 0.03438 6.748 2.54e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 10.15 on 998 degrees of freedom
#> Multiple R-squared: 0.04363, Adjusted R-squared: 0.04267
#> F-statistic: 45.53 on 1 and 998 DF, p-value: 2.544e-11
Question
I would be extremely grateful for any pointers as to why the confounded effect persist in the last model above. I've been wrecking my brain for the last 3 days but I can't figure out where I am going/thinking wrong!
I've tried various simulations (finally ending up at the simple version presented here). I've also calculated stabilisied IPWs or included each treatment individually in the MSM (to the same effect). I've also calculated IPWs using the functions in the ipw package, always ending up with the essentially same result.
Best Answer
One potential thought is that you are not estimating your standard errors correctly (you need to use robust standard errors). It may be that correcting the standard errors yields confidence intervals that contain the true (null) value. You can also try stabilizing the weights, which can improve their performance.
Another thought is that the PS model at each time point should include all past covariates and treatment as predictors, not just the covariates at the immediately preceding time point. That is, the model for
A_1
should beA_1 ~ L_1 + A_0 + L_0 + Y_0
. I agree it shouldn't be necessary given the DAG but was thinking there could be something I missed, and it would otherwise be standard practice.