Solved – Linear relationship between explanatory variables in multiple regression

multicollinearitymultiple regression

I was reading the multiple regression chapter of Data Analysis and Graphics Using R: An Example-Based Approach and was a bit confused to find out that it recommends checking for linear relationships between explanatory variables (using a scatterplot) and, in case there aren't any, transforming them so they do become more linearly related. Here are some excerpts of this:

6.3 A strategy for fitting multiple regression models

(…)

Examine the scatterplot matrix involving all the explanatory variables. (Including the dependent variable is, at this point, optional.) Look first for evidence of non-linearity in the plots of explanatory variables against each other.

(…)

This point identifies a model search strategy – seek models in which regression relationships between explanatory variables follow a "simple" linear form. Thus, if some pairwise plots show evidence of non-linearity, consider use of transformation(s) to give more nearly linear relationships. While it may not necessarily prove possible, following this strategy, to adequately model the regression relationship, this is a good strategy, for the reasons given below, to follow in starting the search.

(…)

If relationships between explanatory variables are approximately linear, perhaps after transformation, it is then possible to interpret plots of predictor variables against the response variable with confidence.

(…)

It may not be possible to find transformations of one or more of the explanatory variables that ensure the the (pairwise) relationships shown in the panels appear linear. This can create problems both for the interpretation of the diagnostic plots for any fitted regression equation and for the interpretation of the coefficients in the fitted equation. See Cook and Weisberg(1999).

Shouldn't I be worried about linear relationships between dependent variables (because of the risk of multicollinearity) instead of actively pursuing them? What are the advantages of having approximately linearly related variables?

The authors do address the issue of multicollinearity later in the chapter, but this recommendations seem to be at odds with avoiding multicollinearity.

Best Answer

There are two points here:

  1. The passage recommends transforming IVs to linearity only when there is evidence of nonlinearity. Nonlinear relationships among IVs can also cause collinearity and, more centrally, may complicate other relationships. I am not sure I agree with the advice in the book, but it's not silly.

  2. Certainly very strong linear relationships can be causes of collinearity, but high correlations are neither necessary nor sufficient to cause problematic collinearity. A good method of diagnosing collinearity is the condition index.

EDIT in response to comment

Condition indexes are described briefly here as "square root of the maximum eigenvalue divided by the minimum eigenvalue". There are quite a few posts here on CV that discuss them and their merits. The seminal texts on them are two books by David Belsley: Conditioning diagnostics and Regression Diagnostics (which has a new edition, 2005, as well).

Related Question