Solved – Collinearity diagnostics problematic only when the interaction term is included

multicollinearityrvariance-decompositionvariance-inflation-factor

I've run a regression on U.S. counties, and am checking for collinearity in my 'independent' variables. Belsley, Kuh, and Welsch's Regression Diagnostics suggests looking at the Condition Index and Variance Decomposition Proportions:

library(perturb)
## colldiag(, scale=TRUE) for model with interaction
Condition
Index   Variance Decomposition Proportions
           (Intercept) inc09_10k unins09 sqmi_log pop10_perSqmi_log phys_per100k nppa_per100k black10_pct hisp10_pct elderly09_pct inc09_10k:unins09
1    1.000 0.000       0.000     0.000   0.000    0.001             0.002        0.003        0.002       0.002      0.001         0.000            
2    3.130 0.000       0.000     0.000   0.000    0.002             0.053        0.011        0.148       0.231      0.000         0.000            
3    3.305 0.000       0.000     0.000   0.000    0.000             0.095        0.072        0.351       0.003      0.000         0.000            
4    3.839 0.000       0.000     0.000   0.001    0.000             0.143        0.002        0.105       0.280      0.009         0.000            
5    5.547 0.000       0.002     0.000   0.000    0.050             0.093        0.592        0.084       0.005      0.002         0.000            
6    7.981 0.000       0.005     0.006   0.001    0.150             0.560        0.256        0.002       0.040      0.026         0.001            
7   11.170 0.000       0.009     0.003   0.000    0.046             0.000        0.018        0.003       0.250      0.272         0.035            
8   12.766 0.000       0.050     0.029   0.015    0.309             0.023        0.043        0.220       0.094      0.005         0.002            
9   18.800 0.009       0.017     0.003   0.209    0.001             0.002        0.001        0.047       0.006      0.430         0.041            
10  40.827 0.134       0.159     0.163   0.555    0.283             0.015        0.001        0.035       0.008      0.186         0.238            
11  76.709 0.855       0.759     0.796   0.219    0.157             0.013        0.002        0.004       0.080      0.069         0.683            

## colldiag(, scale=TRUE) for model without interaction
Condition
Index   Variance Decomposition Proportions
           (Intercept) inc09_10k unins09 sqmi_log pop10_perSqmi_log phys_per100k nppa_per100k black10_pct hisp10_pct elderly09_pct
1    1.000 0.000       0.001     0.001   0.000    0.001             0.003        0.004        0.003       0.003      0.001        
2    2.988 0.000       0.000     0.001   0.000    0.002             0.030        0.003        0.216       0.253      0.000        
3    3.128 0.000       0.000     0.002   0.000    0.000             0.112        0.076        0.294       0.027      0.000        
4    3.630 0.000       0.002     0.001   0.001    0.000             0.160        0.003        0.105       0.248      0.009        
5    5.234 0.000       0.008     0.002   0.000    0.053             0.087        0.594        0.086       0.004      0.001        
6    7.556 0.000       0.024     0.039   0.001    0.143             0.557        0.275        0.002       0.025      0.035        
7   11.898 0.000       0.278     0.080   0.017    0.371             0.026        0.023        0.147       0.005      0.038        
8   13.242 0.000       0.001     0.343   0.006    0.000             0.000        0.017        0.129       0.328      0.553        
9   21.558 0.010       0.540     0.332   0.355    0.037             0.000        0.003        0.003       0.020      0.083        
10  50.506 0.989       0.148     0.199   0.620    0.393             0.026        0.004        0.016       0.087      0.279        

?HH::vif suggests that VIFs >5 are problematic:

library(HH)
## vif() for model with interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         8.378646         16.329881          1.653584          2.744314          1.885095          1.471123          1.436229          1.789454 
    elderly09_pct inc09_10k:unins09 
         1.547234         11.590162 

## vif() for model without interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         1.859426          2.378138          1.628817          2.716702          1.882828          1.471102          1.404482          1.772352 
    elderly09_pct 
         1.545867 

Whereas John Fox's Regression Diagnostics suggests looking at the square root of the VIF:

library(car)
## sqrt(vif) for model with interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         2.894589          4.041025          1.285917          1.656597          1.372987          1.212898          1.198428          1.337705 
    elderly09_pct inc09_10k:unins09 
         1.243879          3.404433 
## sqrt(vif) for model without interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         1.363608          1.542121          1.276251          1.648242          1.372162          1.212890          1.185108          1.331297 
    elderly09_pct 
         1.243329 

In the first two cases (where a clear cutoff is suggested), the model is problematic only when the interaction term is included.

The model with the interaction term has until this point been my preferred specification.

I have two questions given this quirk of the data:

  1. Does an interaction term always worsen the collinearity of the data?
  2. Since the two variables without the interaction term are not above the threshold, am I ok using the model with the interaction term. Specifically, the reason I think this might be ok is that I'm using the King, Tomz, and Wittenberg (2000) method to interpret the coefficients (negative binomial model), where I generally hold the other coefficients at the mean, and then interpret what happens to predictions of my dependent variable when I move inc09_10k and unins09 around independently and jointly.

Best Answer

Yes, this is usually the case with non-centered interactions. A quick look at what happens to the correlation of two independent variables and their "interaction"

set.seed(12345)
a = rnorm(10000,20,2)
b = rnorm(10000,10,2)
cor(a,b)
cor(a,a*b)

> cor(a,b)
[1] 0.01564907
> cor(a,a*b)
[1] 0.4608877

And then when you center them:

c = a - 20
d = b - 10
cor(c,d)
cor(c,c*d)

> cor(c,d)
[1] 0.01564907
> cor(c,c*d)
[1] 0.001908758

Incidentally, the same can happen with including polynomial terms (i.e., $X,~X^2,~...$) without first centering.

So you can give that a shot with your pair.


As to why centering helps - but let's go back to the definition of covariance

\begin{align} \text{Cov}(X,XY) &= E[(X-E(X))(XY-E(XY))] \\ &= E[(X-\mu_x)(XY-\mu_{xy})] \\ &= E[X^2Y-X\mu_{xy}-XY\mu_x+\mu_x\mu_{xy}] \\ &= E[X^2Y]-E[X]\mu_{xy}-E[XY]\mu_x+\mu_x\mu_{xy} \\ \end{align}

Even given independence of X and Y

\begin{align} \qquad\qquad\qquad\, &= E[X^2]E[Y]-\mu_x\mu_x\mu_y-\mu_x\mu_y\mu_x+\mu_x\mu_x\mu_y \\ &= (\sigma_x^2+\mu_x^2)\mu_y-\mu_x^2\mu_y \\ &= \sigma_x^2\mu_y \\ \end{align}

This doesn't related directly to your regression problem, since you probably don't have completely independent $X$ and $Y$, and since correlation between two explanatory variables doesn't always result in multicollinearity issues in regression. But it does show how an interaction between two non-centered independent variables causes correlation to show up, and that correlation could cause multicollinearity issues.


Intuitively to me, having non-centered variables interact simply means that when $X$ is big, then $XY$ is also going to be bigger on an absolute scale irrespective of $Y$, and so $X$ and $XY$ will end up correlated, and similarly for $Y$.