The key problem is not correlation but collinearity (see works by Belsley, for instance). This is best tested using condition indexes (available in R
, SAS
and probably other programs as well. Correlation is neither a necessary nor a sufficient condition for collinearity. Condition indexes over 10 (per Belsley) indicate moderate collinearity, over 30 severe, but it also depends on which variables are involved in the collinearity.
If you do find high collinearity, it means that your parameter estimates are unstable. That is, small changes (sometimes in the 4th significant figure) in your data can cause big changes in your parameter estimates (sometimes even reversing their sign). This is a bad thing.
Remedies are
- Getting more data
- Dropping one variable
- Combining the variables (e.g. with partial least squares) and
- Performing ridge regression, which gives biased results but reduces the variance on the estimates.
Yes, this is usually the case with non-centered interactions. A quick look at what happens to the correlation of two independent variables and their "interaction"
set.seed(12345)
a = rnorm(10000,20,2)
b = rnorm(10000,10,2)
cor(a,b)
cor(a,a*b)
> cor(a,b)
[1] 0.01564907
> cor(a,a*b)
[1] 0.4608877
And then when you center them:
c = a - 20
d = b - 10
cor(c,d)
cor(c,c*d)
> cor(c,d)
[1] 0.01564907
> cor(c,c*d)
[1] 0.001908758
Incidentally, the same can happen with including polynomial terms (i.e., $X,~X^2,~...$) without first centering.
So you can give that a shot with your pair.
As to why centering helps - but let's go back to the definition of covariance
\begin{align}
\text{Cov}(X,XY) &= E[(X-E(X))(XY-E(XY))] \\
&= E[(X-\mu_x)(XY-\mu_{xy})] \\
&= E[X^2Y-X\mu_{xy}-XY\mu_x+\mu_x\mu_{xy}] \\
&= E[X^2Y]-E[X]\mu_{xy}-E[XY]\mu_x+\mu_x\mu_{xy} \\
\end{align}
Even given independence of X and Y
\begin{align}
\qquad\qquad\qquad\, &= E[X^2]E[Y]-\mu_x\mu_x\mu_y-\mu_x\mu_y\mu_x+\mu_x\mu_x\mu_y \\
&= (\sigma_x^2+\mu_x^2)\mu_y-\mu_x^2\mu_y \\
&= \sigma_x^2\mu_y \\
\end{align}
This doesn't related directly to your regression problem, since you probably don't have completely independent $X$ and $Y$, and since correlation between two explanatory variables doesn't always result in multicollinearity issues in regression. But it does show how an interaction between two non-centered independent variables causes correlation to show up, and that correlation could cause multicollinearity issues.
Intuitively to me, having non-centered variables interact simply means that when $X$ is big, then $XY$ is also going to be bigger on an absolute scale irrespective of $Y$, and so $X$ and $XY$ will end up correlated, and similarly for $Y$.
Best Answer
I would use condition indexes rather than either VIFs or correlations; I wrote my dissertation about this, but you can also see the work of David Belsley, e.g. this book. But if I had to choose between VIFs and correlations, I'd go with VIFs. Belsley shows that fairly high correlations are not always problematic.
If you are using R, another method that seems good to me is to use the
perturb
package to see if the collinearity is problematic.