Solved – Including confounders in a model

confoundingmodel selectionregression

Suppose that you are performing a linear regression examining the main effect $x_1$ and want to adjust for possible confounders $x_2, x_3, x_4$. Is it better to have an unadjusted model and a model adjusted for all potential confounders? Or should you also consider models adjusted for only some of the confounders (e.g. $x_2$, $x_2$ and $x_3$, etc.)?

Best Answer

I assume you're trying to estimate the causal effect of $x_1$ on $y$, rather than just trying to predict $y$. In general, to find a correct conditioning set (if one exists), you need to know the causal relationships among the variables. Here are some examples to illustrate why:

  1. You must be sure that the variables you're controlling for are actually confounders. If any one of them is not a confounder but instead a common effect, then you must not control for it. For example, say that $x_1$ and $y$ both influence $x_2$; then controlling for $x_2$ will induce an association between $x_1$ and $y$. This association will bias your estimate of the true effect of $x_1$ on $y$.

X2 is a common effect of X1 and Y

  1. If $x_2$ mediates the relationship between $x_1$ and $y$ – that is, if $x_1$ influences $x_2$, and then $x_2$ influences $y$ – then conditioning on $x_2$ will remove this indirect effect of $x_1$ on $y$ from your estimate. If you are interested in only the direct effect of $x_1$ on $y$, then this is the right approach, but if you want to estimate the total effect, then you should not control for $x_2$.

X2 mediates the effect of X1 on Y

  1. In some cases there is no set of variables we can condition on to get an unbiased estimate of the effect. Here is an example: the "M-structure".

U1 influences X1 and X2; U2 influences X2 and Y; X2 influences X1 and Y

In this case, the true effect of $X_1$ on $Y$ is zero. However, there are two unobserved confounders $U_1$ and $U_2$, and one observed confounder $X_2$. If we had observed $U_1$ and $U_2$ we could condition on all three confounders and get an unbiased estimate. However, since we only observed $X_2$ we are stuck. If we don't control for $X_2$ it will confound our estimate of the effet of $X_1$ on $Y$. But if we do control for it, we induce an association between $U_1$ and $U_2$, and therefore between $X_1$ and $Y$.

There are many cases in which you should not control for a particular covariate. If you don't know the causal structure, you may accidentally bias your estimate by controlling for the wrong set of covariates. In this case you can apply a causal structure learning algorithm as a first step, before you try to estimate the causal effect of $x_1$ on $y$.

Related Question