Solved – Multicollinearity between ln(x) and ln(x)^2

logarithmmulticollinearityvariance-inflation-factor

I am running a negative binomial model and one of my predictor variables is a count variable. Since this variable was heavily skewed, I decided to log-transform it.

However, the effect of this variable is hypothesized to be non-linear. However, as soon as I include the squared term in my model, I obtain VIFs of these two variables that are >20, while all other predictors remain stable at VIFs between 1 and 5.

To my current understanding, the relationship should not be linear and hence multicollineairy should not arise.

Can anyone explain the cause of the multi-collinearty and give possible solutions to this problem?

Best Answer

Except for very small counts, $\log(x)^2$ is essentially a linear function of $\log(x)$:

Figure showing plots and linear fits

The colored lines are least squares fits to $\log(x)^2$ vs $\log(x)$ for various ranges of counts $x$. They are extremely good once $x$ exceeds $10$ (and still awfully good even when $x\gt 4$ or so).

Introducing the square of a variable sometimes is used to test goodness of fit, but (in my experience) is rarely a good choice as an explanatory variable. To account for a nonlinear response, consider these options:

  • Study the nature of the nonlinearity. Select appropriate variables and/or transformation to capture it.

  • Keep the count itself in the model. There will still be collinearity for larger counts, so consider creating a pair of orthogonal variables from $x$ and $\log(x)$ in order to achieve a numerically stable fit.

  • Use splines of $x$ (and/or $\log(x)$) to model the nonlinearity.

  • Ignore the problem altogether. If you have enough data, a large VIF may be inconsequential. Unless your purpose is to obtain precise coefficient estimates (which your willingness to transform suggests is not the case), then collinearity scarcely matters anyway.