Regression – How to Extract Dependence on a Single Variable When Independent Variables Are Correlated

correlationregression

I have a dataset in which I have measured a dependent variable (let's call it $Y$) along with several independent variables $(X_1, X_2, X_3)$. The independent variables are correlated with one another to some extent. I would like to understand how $Y$ varies with $X_2$ when $X_1$ and $X_3$ are held constant. What approach will allow me to extract this relationship given the correlation between the independent variables?

I have looked into principal component analysis, but that casts the data in terms of linear combinations that include $X_2$, thus not separating the $X_2$ dependence.

A example dataset (csv format).

Best Answer

Aksakal's answer is correct. By controlling for all variables in a regression, you "keep them constant" and you are able to identify the partial correlation between your regressor of interest. Let me give you an example to make this clearer.

First, let us create some correlated $X$s.

 ex <- rnorm(1000)
 x1 <- 5*ex + rnorm(1000)
 x2 <- -3*ex + rnorm(1000)
 x3 <- 4*ex + rnorm(1000)

Now, since all these variables are generated by some underlying variable $ex$, they are clearly correlated. You can check this using cor(x1,x2), for instance.

Now, let us generate the dependent variable with known parameters.

 y <- 1*x1 + 2*x2 + 3*x3 + rnorm(1000)

Here we know that $\beta_1=1, \beta_2=2, \beta_3=3$. I have picked them arbitrarily. Let us now see if Aksakal's approach can uncover these parameters:

 lm(y ~ x1+x2+x3)

If it works, the estimated parameters should be close to the ones we have picked. Here the result:

 Call:
 lm(formula = y ~ x1 + x2 + x3)

 Coefficients:
 (Intercept)           x1           x2           x3  
    -0.01224      0.99805      1.99746      2.99670  

As you can see, all parameters have been uncovered.

Having said that, there are many caveats involved here as well. Most importantly, you should not interpret these coefficients in a causal way. Depending on your actual situation, it might help if you explain a bit more what you are trying to estimate so that people can evaluate whether this method is appropriate (or whether answering your research question is feasible at all). For instance, why do you think your independent variables are correlated? Is it that $X_1$ might have an effect on $X_2$ and this has an effect on $y$? If this is the setup you have in mind, then depending on your field, you may want to look into mediator/moderator analysis or into quasi-experimental methods. Hence you see you might benefit from elaborating a bit more on your situation.

Related Question