Solved – Linear relationship between explanatory variables in multiple regression

multicollinearitymultiple regression

I was reading the multiple regression chapter of Data Analysis and Graphics Using R: An Example-Based Approach and was a bit confused to find out that it recommends checking for linear relationships between explanatory variables (using a scatterplot) and, in case there aren't any, transforming them so they do become more linearly related. Here are some excerpts of this:

6.3 A strategy for fitting multiple regression models

(…)

Examine the scatterplot matrix involving all the explanatory variables. (Including the dependent variable is, at this point, optional.) Look first for evidence of non-linearity in the plots of explanatory variables against each other.

(…)

This point identifies a model search strategy – seek models in which regression relationships between explanatory variables follow a "simple" linear form. Thus, if some pairwise plots show evidence of non-linearity, consider use of transformation(s) to give more nearly linear relationships. While it may not necessarily prove possible, following this strategy, to adequately model the regression relationship, this is a good strategy, for the reasons given below, to follow in starting the search.

(…)

If relationships between explanatory variables are approximately linear, perhaps after transformation, it is then possible to interpret plots of predictor variables against the response variable with confidence.

(…)

It may not be possible to find transformations of one or more of the explanatory variables that ensure the the (pairwise) relationships shown in the panels appear linear. This can create problems both for the interpretation of the diagnostic plots for any fitted regression equation and for the interpretation of the coefficients in the fitted equation. See Cook and Weisberg(1999).

Shouldn't I be worried about linear relationships between dependent variables (because of the risk of multicollinearity) instead of actively pursuing them? What are the advantages of having approximately linearly related variables?

The authors do address the issue of multicollinearity later in the chapter, but this recommendations seem to be at odds with avoiding multicollinearity.

Best Answer

There are two points here:

The passage recommends transforming IVs to linearity only when there is evidence of nonlinearity. Nonlinear relationships among IVs can also cause collinearity and, more centrally, may complicate other relationships. I am not sure I agree with the advice in the book, but it's not silly.
Certainly very strong linear relationships can be causes of collinearity, but high correlations are neither necessary nor sufficient to cause problematic collinearity. A good method of diagnosing collinearity is the condition index.

EDIT in response to comment

Condition indexes are described briefly here as "square root of the maximum eigenvalue divided by the minimum eigenvalue". There are quite a few posts here on CV that discuss them and their merits. The seminal texts on them are two books by David Belsley: Conditioning diagnostics and Regression Diagnostics (which has a new edition, 2005, as well).

Transforming the response (aka dependent variable, outcome)

Box-Cox transformations offer a possible way for choosing a transformation of the response. After fitting your regression model containing untransformed variables with the R function lm, you can use the function boxCox from the car package to estimate $\lambda$ (i.e. the power parameter) by maximum likelihood. Because your dependent variable isn't strictly positive, Box-Cox transformations will not work and you have to specify the option family="yjPower" to use the Yeo-Johnson transformations (see the original paper here and this related post):

boxCox(my.regression.model, family="yjPower", plotit = TRUE)

This produces a plot like the following one:

Box-Cox lambdaplot

The best estimate of $\lambda$ is the value that maximizes the profile likelhod which in this example is about 0.2. Usually, the estimate of $\lambda$ is rounded to a familiar value that is still within the 95%-confidence interval, such as -1, -1/2, 0, 1/3, 1/2, 1 or 2.

To transform your dependent variable now, use the function yjPower from the car package:

depvar.transformed <- yjPower(my.dependent.variable, lambda)

In the function, the lambda should be the rounded $\lambda$ you have found before using boxCox. Then fit the regression again with the transformed dependent variable.

Important: Rather than just log-transform the dependent variable, you should consider to fit a GLM with a log-link. Here are some references that provide further information: first, second, third. To do this in R, use glm:

glm.mod <- glm(y~x1+x2, family=gaussian(link="log"))

where y is your dependent variable and x1, x2 etc. are your independent variables.

Transformations of predictors

Transformations of strictly positive predictors can be estimated by maximum likelihood after the transformation of the dependent variable. To do so, use the function boxTidwell from the car package (for the original paper see here). Use it like that: boxTidwell(y~x1+x2, other.x=~x3+x4). The important thing here is that option other.x indicates the terms of the regression that are not to be transformed. This would be all your categorical variables. The function produces an output of the following form:

boxTidwell(prestige ~ income + education, other.x=~ type + poly(women, 2), data=Prestige)

          Score Statistic   p-value MLE of lambda
income          -4.482406 0.0000074    -0.3476283
education        0.216991 0.8282154     1.2538274

In that case, the score test suggests that the variable income should be transformed. The maximum likelihood estimates of $\lambda$ for income is -0.348. This could be rounded to -0.5 which is analogous to the transformation $\text{income}_{new}=1/\sqrt{\text{income}_{old}}$.

Another very interesting post on the site about the transformation of the independent variables is this one.

Disadvantages of transformations

While log-transformed dependent and/or independent variables can be interpreted relatively easy, the interpretation of other, more complicated transformations is less intuitive (for me at least). How would you, for example, interpret the regression coefficients after the dependent variables has been transformed by $1/\sqrt{y}$? There are quite a few posts on this site that deal exactly with that question: first, second, third, fourth. If you use the $\lambda$ from Box-Cox directly, without rounding (e.g. $\lambda$=-0.382), it is even more difficult to interpret the regression coefficients.

Modelling nonlinear relationships

Two quite flexible methods to fit nonlinear relationships are fractional polynomials and splines. These three papers offer a very good introduction to both methods: First, second and third. There is also a whole book about fractional polynomials and R. The R package mfp implements multivariable fractional polynomials. This presentation might be informative regarding fractional polynomials. To fit splines, you can use the function gam (generalized additive models, see here for an excellent introduction with R) from the package mgcv or the functions ns (natural cubic splines) and bs (cubic B-splines) from the package splines (see here for an example of the usage of these functions). Using gam you can specify which predictors you want to fit using splines using the s() function:

my.gam <- gam(y~s(x1) + x2, family=gaussian())

here, x1 would be fitted using a spline and x2 linearly as in a normal linear regression. Inside gam you can specify the distribution family and the link function as in glm. So to fit a model with a log-link function, you can specify the option family=gaussian(link="log") in gam as in glm.

Have a look at this post from the site.

Solved – Low correlation between predictor variables in linear regression

There can be collinearity even with low correlation among all the variables. Suppose there were 10 IVs. 9 of them are completely uncorrelated. The 10th is the sum of the other 9. If you run a correlation matrix, all correlations will be low but there will be perfect colinearity.

Best Answer

Related Solutions

R Regression – Transforming Variables for Multiple Regression in R

Transforming the response (aka dependent variable, outcome)

Transformations of predictors

Disadvantages of transformations

Modelling nonlinear relationships

Solved – Low correlation between predictor variables in linear regression

Related Question