Solved – High correlation among two variables but VIFs do not indicate collinearity

correlationmulticollinearityvariance-inflation-factor

What would you go for in assessing collinearity – correlation or VIFs?

Say you run pairplots and calculate Pearson correlation coefficients between pairs of explanatory variables. Two of them have a correlation coefficient of around 0.8, which is rather high. This would suggest that including both variables in the same regression model might not be a good idea. But say you include them anyway. You then run the vif command from the car package. The Variance Inflation Factors (VIFs) are all below a threshold value (say 3, for example). This indicates no problematic collinearity.

Would you say that:

1) Both can be included in the model if VIFs are low, regardless of high correlation?

2) High correlation means one variable should be dropped immediately at the beginning, regardless of VIFs?

(This is not related to any particular data or model, hence no specific data example provided, but I have experienced this in the past. I know there have been threads related to this question, but I have not found an actual answer to my specific question.)

Best Answer

I would use condition indexes rather than either VIFs or correlations; I wrote my dissertation about this, but you can also see the work of David Belsley, e.g. this book. But if I had to choose between VIFs and correlations, I'd go with VIFs. Belsley shows that fairly high correlations are not always problematic.

If you are using R, another method that seems good to me is to use the perturb package to see if the collinearity is problematic.

Related Solutions

Solved – How to deal with high correlation among predictors in multiple regression

The key problem is not correlation but collinearity (see works by Belsley, for instance). This is best tested using condition indexes (available in R, SAS and probably other programs as well. Correlation is neither a necessary nor a sufficient condition for collinearity. Condition indexes over 10 (per Belsley) indicate moderate collinearity, over 30 severe, but it also depends on which variables are involved in the collinearity.

If you do find high collinearity, it means that your parameter estimates are unstable. That is, small changes (sometimes in the 4th significant figure) in your data can cause big changes in your parameter estimates (sometimes even reversing their sign). This is a bad thing.

Remedies are

Getting more data
Dropping one variable
Combining the variables (e.g. with partial least squares) and
Performing ridge regression, which gives biased results but reduces the variance on the estimates.

Solved – Collinearity diagnostics problematic only when the interaction term is included

Yes, this is usually the case with non-centered interactions. A quick look at what happens to the correlation of two independent variables and their "interaction"

set.seed(12345)
a = rnorm(10000,20,2)
b = rnorm(10000,10,2)
cor(a,b)
cor(a,a*b)

> cor(a,b)
[1] 0.01564907
> cor(a,a*b)
[1] 0.4608877

And then when you center them:

c = a - 20
d = b - 10
cor(c,d)
cor(c,c*d)

> cor(c,d)
[1] 0.01564907
> cor(c,c*d)
[1] 0.001908758

Incidentally, the same can happen with including polynomial terms (i.e., $X,~X^2,~...$) without first centering.

So you can give that a shot with your pair.

As to why centering helps - but let's go back to the definition of covariance

\begin{align} \text{Cov}(X,XY) &= E[(X-E(X))(XY-E(XY))] \\ &= E[(X-\mu_x)(XY-\mu_{xy})] \\ &= E[X^2Y-X\mu_{xy}-XY\mu_x+\mu_x\mu_{xy}] \\ &= E[X^2Y]-E[X]\mu_{xy}-E[XY]\mu_x+\mu_x\mu_{xy} \\ \end{align}

Even given independence of X and Y

\begin{align} \qquad\qquad\qquad\, &= E[X^2]E[Y]-\mu_x\mu_x\mu_y-\mu_x\mu_y\mu_x+\mu_x\mu_x\mu_y \\ &= (\sigma_x^2+\mu_x^2)\mu_y-\mu_x^2\mu_y \\ &= \sigma_x^2\mu_y \\ \end{align}

This doesn't related directly to your regression problem, since you probably don't have completely independent $X$ and $Y$, and since correlation between two explanatory variables doesn't always result in multicollinearity issues in regression. But it does show how an interaction between two non-centered independent variables causes correlation to show up, and that correlation could cause multicollinearity issues.

Intuitively to me, having non-centered variables interact simply means that when $X$ is big, then $XY$ is also going to be bigger on an absolute scale irrespective of $Y$, and so $X$ and $XY$ will end up correlated, and similarly for $Y$.

Best Answer

Related Solutions

Solved – How to deal with high correlation among predictors in multiple regression

Solved – Collinearity diagnostics problematic only when the interaction term is included

Related Question