Multiple Regression – Preparing Independent Variables for Multiple Linear Regression

multiple regression

I have 3 independent variables A, B, C and want to run a multiple linear regression to predict Y. After studying the correlations between:

1) A, Y
2) B, Y
3) A/B, Y
4) (A-B)/(A+B), Y

It turns out that 4) has the highest correlation than all other cases by at least > 0.10. Both 3) and 4) make sense to me as variables A and B are complementary: that is, they represent the items bought in store A versus store B and there are only 2 stores in this problem.

Now, in a simple linear regression I have little doubt that the higher the correlation of the single independent variable the better the fit. But in a multiple linear regression, does it make sense to select the independent variables by looking at the formulas or ratios that shows the highest correlations against Y? In this case using 4) in the regression over 1), 2) or 3) because it has the highest correlation.

Best Answer

Although manipulating variables to maximize correlation has an intuitive appeal, it is not usually a good approach, for several reasons:

  1. In multiple regression the individual correlations can be low between independent and dependent variables, yet the least-correlated IVs can be the most important predictors of the DVs. See this thread for an example and this one for a theoretical discussion. This suggests that looking at the individual (bivariate) correlations can be useless or misleading.

  2. You can (accidentally or arbitrarily) make the correlation arbitrarily close to $\pm 1$ by means of a transformation of the dependent values that creates a single extreme outlier. In the following example $A$ (blue) and $B$ (red) are normally distributed--but always positive--and $C=A+B$ plus normally distributed error, except for a single outlying value at $(A,B,C)=(1, 1/16,20)$. Despite this strong relationship between $C$ and untransformed values of $A$ and $B$, the correlation of $A$ and $C$ is -0.2 (notice the sign is wrong!), the correlation of $B$ and $C$ is -0.25 (wrong sign again), and the correlation of $(A-B)/(A+B)$ and $C$ is +.45 (much stronger than either of the other correlations).

    A and B vs. C

    (A-B)/(A+B) vs. C

  3. Typically, one re-expresses independent variables in order to establish a more linear relationship with the dependent variable. You can test that visually if you like, making sure to discount high-leverage outlying values that might appear with some re-expressions.

Related Question