Solved – the interpretation of the coefficient of a covariate control variable in a multiple linear regression

multiple regressionregression coefficients

I was reading the Rubin: Causal inference and Angrist, J.D. and Pischke: most harmless econometrics. Both of them are great textbooks. During my reading, I have the following question: what is the correct interpretation of the coefficient of a covariate variable (rather than the treatment variable) in a multiple linear regression model?

Let's look at a simple random experiment setting. Let $D_i$ be the treatment variable of subject $i$, with $D_i=1$ means treatment and $D_i=0$ means control. Let $X_i$ be a covariate variable, e.g., age. Let $Y_i(1)$ be the value of the response variable if subject $i$ is in treatment group, i.e., $D_i=1$ and similarly, we can define $Y_i(0)$. Then we write $Y^{obs}_i=D_iY_i(1)+(1-D_i)Y_i(0)$, which is the observed value of the response variable for subject $i$. So we are working under the Rubin potential outcome framework.

As the data comes from a simple random experiment, one could directly run a linear regression without including the covariate variable like this
\begin{equation}
Y^{obs}_i=\alpha+\tau D_i+\epsilon_i.
\end{equation}
One could show that $\tau=E[Y^{obs}_i|D_i=1]-E[Y^{obs}_i|D_i=0]$. Note that this is always true, even if the data does not come from random experiment. So $\tau$ means: if you could draw infinitely number of samples from the joint distribution of $(D_i,Y^{obs}_i)$, then you obtain the average in the observed response variable for the group of people with $D=1$ and the group of people with $D=0$, then you take the difference, that's $\tau$. Now if the data does come from random experiment, then we further have $\tau=E[Y_i(1)-Y_i(0)]$, which is the treatment effect. Whether the data comes from random experiment or not, the OLS estimator $\hat{\tau}$ is unbiased and consistent of $\tau$.

Now, we know that we could include the covariate $X_i$ in our regression and get
\begin{equation}
Y^{obs}_i=\alpha'+\tau D_i+\beta X_i+\epsilon_i.
\end{equation}
Then on page 122 of Rubin: Causal Inference says "…irrespective of whether the regression function is truly linear in the covariates in the population, the OLS $\hat{\tau}$ is consistent for $\tau$". I understand that. But the book never say anything about $\beta$, the coefficient of the covariate $X_i$. In practice, if someone tell you $\beta$ is negative and significant, what exactly does that mean? Think about it, what if $Y$ is the wage and $D$ is whether the subject attends a specific training program. Now if $\beta$ is negative and significant, can we say there is a discrimination against age? What is the correct meaning of $\beta$? What if the data does not come from random experiment, then will there be anything different for $\beta$? just like the difference in the interpretation of $\tau$?

Best Answer

In the statistical sense of this (regression) model, there is no difference between treatment $D_i$ and covariate $X_i$. Aside from the type of variable (continuous/categorical) they are both predictors/independent variables (this would also apply when treatment $D_i$ was continuous, or covariate $X_i$ categorical). Moreover, 'statistically' speaking, everything which you can infer from $τ$ applies to $β$ as well.

Now comes the less statistical part, and a more methodological one: along the theory or hypothesis you are studying, these variables are not equal. One may be of particular interest. Especially when trying to make causal inferences, you want to obtain an 'as pure as possible' notion of its effect on the outcome of interest and (if a frequentist) its significance. That is why you correct/adjust for the effects of other variables (often called confounders; correcting for confounding bias). Now the model needs to be focused around correcting for other variables which can confound the association of interest. If done correctly, you might get a good estimate of the (approximately) unbiased 'true' association of treatment $D_i$ on the outcome. However, you've only selected confounders related to treatment $D_i$'s effect on outcome. You might have omitted some confounders for covariate $X_i$ from the model, because you did not expect them to influence the association between $D_i$ and outcome.

Because of this, causal inferences based covariate $X_i$'s $β$ are not completely corrected for (AKA still biased by confounding).

If in your example the training program is only confounded by age (because we have some theory about this), causal inference for $D_i$'s effect on wages becomes possible. For age however, treatment $D_i$ might not be the only confounder, ergo*, effect estimate $β$ might not be 'pure' and would not be an unbiased effect estimate for the effect of covariate $X_i$/age on wages.

*(always wanted to use that word)