Solved – adding control variables in the regression in the experiment setting

controlling-for-a-variableexperiment-designlinear modelmultiple regressionregression

After conducting the experiment and doing the statistical analysis, usually we run a linear regression $Y_i=\alpha+\beta X_i+\epsilon_i$, where $X_i$ is the treatment variable (randomly assign people to treatment or control group). Often times, the researcher may find that there is another attributes observed on subjects, say gender, then the researcher may also incorporate this variable into the regression, i.e., $Y_i=\alpha'+\beta' X_i+\gamma' Z+\epsilon'_i$, where $Z$ is the attributes of subjects (e.g., gender).

My question is:

(1) if the randomization procedure is effective, then in 2 groups, the distribution of the attribute $Z$ should roughly the same right? Then what do we gain from the statistical analysis by incorporating the attributes variable $Z$ into the regression?

(2) before adding $Z$, we can show that $E[\epsilon_i|X_i]=0$ and $\beta=E[Y_i|X_i=1]-E[Y_i|X_i=0]$ and by randomization, we know that $E[Y_i|X_i=1]-E[Y_i|X_i=0]=E[Y_{1i}-Y_{0i}]$, where $Y_{1i}$ means the distribution of the response variable in the treatment group and $Y_{0i}$ is for the control group. Now if we add $Z$ into the regression, can we prove in the new regression $E[\epsilon'_i|X_i,Z_i]=0$? Also, will $\beta'=\beta$?

I couldn't understand this at an intuitive level. However, I made the following effort to understand this problem. Here is what I did:

Let $Z$ also be a dummy variable, takes value 0 or 1, then we have 4 combinations of $(X,Z)$. I generate 4 populations using R

Y11 = rnorm(N,m11,s11),  Y10 = rnorm(N,m10,s10)

Y01 = rnorm(N,m01,s01),  Y00 = rnorm(N,m00,s00) 

where m01,m11,m10,m00, s11,s01,s10,s00 are different numbers. Then I can compute the mean of these 4 populations

M11 = mean(Y11), M10 = mean(Y10), M01 = mean(Y01), M00 = mean(Y00)

Then I generate random variable $Z$ (say, the gender of each individual, I know it is not a good example) such that I explicitly made $Z$ is correlated with some of these $Y_{ab}$'s. Specifically, I did

Z = (Y11-M11>0)&(Y01-M01>0)

Then I create two mixture distribution

Mix1 = ifelse(Z>0,Y11,Y10)

Mix2 = ifelse(Z>0,Y01,Y00)

One can think of the distribution of Mix1 as: you have a population, where there are man and woman, and for some reason some individuals are men and others are women, then you force all of them to take treatment A, then you observe the distribution of the response variable, which gives you the distribution of Mix1. So for Mix2, it is like that you made a copy of the population I just described, then you force all of them to take treatment B, then you observe another distribution of the response variables, which gives you the distribution of Mix2.

Then I generate a complete list of independent Bernoulli random variables to represent the randomization procedure did in the experiment.

D = rbinom(N,1,0.4) # with 0.4 probability one guy is assigned to Treatment A

Then I created the "observed data set" from experiment

data = ifelse(D>0,Mix1,Mix2)

Now, I run two regression, one with $Z$ and one without $Z$, I have

Res1 = lm(data~D+Z)

Res2 = lm(data~D)

The results are that $\hat{\beta}\approx\hat{\beta'}\approx$mean(Mix1-Mix2), such that I could claim that $\beta=\beta'=$mean(Mix1-Mix2). And I found that $\hat{\beta}'$, which comes from the regression with $Z$ included, is closer to the truth, mean(Mix1-Mix2), and it has less standard error smaller than $\hat{\beta}$.

So this R exercise seems to suggest that as long as you perform randomization to the treatment variable, then when you do the analysis, you could add whatever attributes variable you observed on the subjects, and just put them into the regression and this will help you with finding the true counterfactual mean treatment effect than you just run a simple linear regression with the treatment variable only. (Of course you cannot add an interaction term, since in that case, the meaning of the coefficient of $X$ changes)

At intuitive level, why it is true? And can someone provide a mathematical proof of that? I know how to prove it when we don't have $Z$ in the regression.

Best Answer

If randomization was successful, then potential outcomes are independent of the treatment unconditionally but also conditionally of covariates. That is $$ (Y_{i0}, Y_{i1}, Z_i)\perp \hspace{-0.2cm}\perp X_i \quad \text{which implies} \quad (Y_{i0}, Y_{i1}) \perp \hspace{-0.2cm}\perp X_i \mid Z_i $$ For more information on this have a look at these lecture notes by Clément de Chaisemartin, section 5.2 on page 79.

Using the Frisch-Waugh theorem and the omitted variable bias formula, you know that your estimated coefficient on $X_i$ is $$ \widehat{\beta} = \beta + \gamma \frac{Cov(X_i,Z_i)}{Var(X_i)} $$

and since $X_i$ and $Z_i$ are independent, $Cov(X_i,Z_i) = 0$, and therefore your coefficient estimate is unchanged whether or not you include/exclude $Z_i$.
In your simulation exercise you are making one error: you should do the simulation repeatedly and see what happens to your $\widehat{\beta}$ and $\widehat{\beta}'$. Once you do this, say, a hundred times you will find that both estimates converge on the true population value.

However, in randomized experiments it has a big advantage to include other covariates: it reduces the standard errors and makes inference on your treatment more precise. This is because additional covariates soak up part of the residual variance which reduces the overall estimation error (that's the intuitive explanation).
The mathematical proof for this is provided in this answer.

Related Question