Solved – Collinearity diagnostics problematic only when the interaction term is included

multicollinearityrvariance-decompositionvariance-inflation-factor

I've run a regression on U.S. counties, and am checking for collinearity in my 'independent' variables. Belsley, Kuh, and Welsch's Regression Diagnostics suggests looking at the Condition Index and Variance Decomposition Proportions:

library(perturb)
## colldiag(, scale=TRUE) for model with interaction
Condition
Index   Variance Decomposition Proportions
           (Intercept) inc09_10k unins09 sqmi_log pop10_perSqmi_log phys_per100k nppa_per100k black10_pct hisp10_pct elderly09_pct inc09_10k:unins09
1    1.000 0.000       0.000     0.000   0.000    0.001             0.002        0.003        0.002       0.002      0.001         0.000            
2    3.130 0.000       0.000     0.000   0.000    0.002             0.053        0.011        0.148       0.231      0.000         0.000            
3    3.305 0.000       0.000     0.000   0.000    0.000             0.095        0.072        0.351       0.003      0.000         0.000            
4    3.839 0.000       0.000     0.000   0.001    0.000             0.143        0.002        0.105       0.280      0.009         0.000            
5    5.547 0.000       0.002     0.000   0.000    0.050             0.093        0.592        0.084       0.005      0.002         0.000            
6    7.981 0.000       0.005     0.006   0.001    0.150             0.560        0.256        0.002       0.040      0.026         0.001            
7   11.170 0.000       0.009     0.003   0.000    0.046             0.000        0.018        0.003       0.250      0.272         0.035            
8   12.766 0.000       0.050     0.029   0.015    0.309             0.023        0.043        0.220       0.094      0.005         0.002            
9   18.800 0.009       0.017     0.003   0.209    0.001             0.002        0.001        0.047       0.006      0.430         0.041            
10  40.827 0.134       0.159     0.163   0.555    0.283             0.015        0.001        0.035       0.008      0.186         0.238            
11  76.709 0.855       0.759     0.796   0.219    0.157             0.013        0.002        0.004       0.080      0.069         0.683            

## colldiag(, scale=TRUE) for model without interaction
Condition
Index   Variance Decomposition Proportions
           (Intercept) inc09_10k unins09 sqmi_log pop10_perSqmi_log phys_per100k nppa_per100k black10_pct hisp10_pct elderly09_pct
1    1.000 0.000       0.001     0.001   0.000    0.001             0.003        0.004        0.003       0.003      0.001        
2    2.988 0.000       0.000     0.001   0.000    0.002             0.030        0.003        0.216       0.253      0.000        
3    3.128 0.000       0.000     0.002   0.000    0.000             0.112        0.076        0.294       0.027      0.000        
4    3.630 0.000       0.002     0.001   0.001    0.000             0.160        0.003        0.105       0.248      0.009        
5    5.234 0.000       0.008     0.002   0.000    0.053             0.087        0.594        0.086       0.004      0.001        
6    7.556 0.000       0.024     0.039   0.001    0.143             0.557        0.275        0.002       0.025      0.035        
7   11.898 0.000       0.278     0.080   0.017    0.371             0.026        0.023        0.147       0.005      0.038        
8   13.242 0.000       0.001     0.343   0.006    0.000             0.000        0.017        0.129       0.328      0.553        
9   21.558 0.010       0.540     0.332   0.355    0.037             0.000        0.003        0.003       0.020      0.083        
10  50.506 0.989       0.148     0.199   0.620    0.393             0.026        0.004        0.016       0.087      0.279

?HH::vif suggests that VIFs >5 are problematic:

library(HH)
## vif() for model with interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         8.378646         16.329881          1.653584          2.744314          1.885095          1.471123          1.436229          1.789454 
    elderly09_pct inc09_10k:unins09 
         1.547234         11.590162 

## vif() for model without interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         1.859426          2.378138          1.628817          2.716702          1.882828          1.471102          1.404482          1.772352 
    elderly09_pct 
         1.545867

Whereas John Fox's Regression Diagnostics suggests looking at the square root of the VIF:

library(car)
## sqrt(vif) for model with interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         2.894589          4.041025          1.285917          1.656597          1.372987          1.212898          1.198428          1.337705 
    elderly09_pct inc09_10k:unins09 
         1.243879          3.404433 
## sqrt(vif) for model without interaction
        inc09_10k           unins09          sqmi_log pop10_perSqmi_log      phys_per100k      nppa_per100k       black10_pct        hisp10_pct 
         1.363608          1.542121          1.276251          1.648242          1.372162          1.212890          1.185108          1.331297 
    elderly09_pct 
         1.243329

In the first two cases (where a clear cutoff is suggested), the model is problematic only when the interaction term is included.

The model with the interaction term has until this point been my preferred specification.

I have two questions given this quirk of the data:

Does an interaction term always worsen the collinearity of the data?
Since the two variables without the interaction term are not above the threshold, am I ok using the model with the interaction term. Specifically, the reason I think this might be ok is that I'm using the King, Tomz, and Wittenberg (2000) method to interpret the coefficients (negative binomial model), where I generally hold the other coefficients at the mean, and then interpret what happens to predictions of my dependent variable when I move inc09_10k and unins09 around independently and jointly.

Best Answer

Yes, this is usually the case with non-centered interactions. A quick look at what happens to the correlation of two independent variables and their "interaction"

set.seed(12345)
a = rnorm(10000,20,2)
b = rnorm(10000,10,2)
cor(a,b)
cor(a,a*b)

> cor(a,b)
[1] 0.01564907
> cor(a,a*b)
[1] 0.4608877

And then when you center them:

c = a - 20
d = b - 10
cor(c,d)
cor(c,c*d)

> cor(c,d)
[1] 0.01564907
> cor(c,c*d)
[1] 0.001908758

Incidentally, the same can happen with including polynomial terms (i.e., $X,~X^2,~...$) without first centering.

So you can give that a shot with your pair.

As to why centering helps - but let's go back to the definition of covariance

\begin{align} \text{Cov}(X,XY) &= E[(X-E(X))(XY-E(XY))] \\ &= E[(X-\mu_x)(XY-\mu_{xy})] \\ &= E[X^2Y-X\mu_{xy}-XY\mu_x+\mu_x\mu_{xy}] \\ &= E[X^2Y]-E[X]\mu_{xy}-E[XY]\mu_x+\mu_x\mu_{xy} \\ \end{align}

Even given independence of X and Y

\begin{align} \qquad\qquad\qquad\, &= E[X^2]E[Y]-\mu_x\mu_x\mu_y-\mu_x\mu_y\mu_x+\mu_x\mu_x\mu_y \\ &= (\sigma_x^2+\mu_x^2)\mu_y-\mu_x^2\mu_y \\ &= \sigma_x^2\mu_y \\ \end{align}

This doesn't related directly to your regression problem, since you probably don't have completely independent $X$ and $Y$, and since correlation between two explanatory variables doesn't always result in multicollinearity issues in regression. But it does show how an interaction between two non-centered independent variables causes correlation to show up, and that correlation could cause multicollinearity issues.

Intuitively to me, having non-centered variables interact simply means that when $X$ is big, then $XY$ is also going to be bigger on an absolute scale irrespective of $Y$, and so $X$ and $XY$ will end up correlated, and similarly for $Y$.

Related Solutions

Solved – the fastest method for determining collinearity degree

The condition number is the statistic you want to look at. It's an exhaustive measure of colinearity of your design space and much cheaper to compute than the VIF. The formula is:

$$2\log\left(\frac{d_1(X_s)}{d_p(X_s)}\right) (1)$$

where $d_p$ ($d_1$) is the smallest (largest) singular value of $X_s$ and $X_s$ are the centred, re-scaled variables.

Reading your comment: this is what I was suspecting. To compute the condition # you don't need all 100e6 data points. Just samples, randomly, in the order of 100e3 (by experimenting with simulated datasets you can convince yourself that to get reliable results you should target about 5*k, where k is the # of non collinear variables so even 100e3 is very large already).

That should already give you a pretty good idea what variables are causing collinearity.

Also, you have specialized algorithm to compute the first and last few singular values and the last few singular vector only. There are algorithms do get these w/o computing the full SVD of $X$ (SVD-S). However, I don't know if these are implemented in R so to make the example below accessible I'll just use a small example and the classical SVD from R.

A high condition number (typically when the ratio (1) is larger than 10) tells you that $X_s'X_s$ is ill-conditioned, that there are components that can be re-written as (near) linear combination of the other variables. Below I give a brief (small scale) example of how SVD can be used to uncover such relations.

n<-100
p<-20
#non-ill conditioned part of the dataset.
x<-matrix(rnorm(n*p),nc=p)
x<-scale(x)
#introduce a variable that causes x to be 
#ill conditioned.
y<-x%*%c(rnorm(3),rep(0,p))[1:p]
y<-scale(y)
x<-cbind(x,y)
p<-ncol(x)

A<-svd(x,nu=0)
#x is ill-conditioned: this ratio is larger 
#than 10. (step 1)
2*log(A$d[1]/A$d[p])

#check what is causing it: (step 2)
round(A$v[,ncol(A$v)],2)
#you can write the last variable as (.23*x_1+.5*x_2-.45*x_3)/(-.7) [1]
#here the relation is exact because:
min(A$d)
#is 0. if min(A$d)>0 then this gives you how much there is noise 
#there is arround [1].

#so I remove the last variable. (step 3)
x<-x[,-ncol(x)]
#no more ill-condition.
2*log(A$d[1]/A$d[p-1])

This is for the linear algebra of the problem when there is a single variable causing the ill-regression. In most case you will have more than one (near) exact relationship and you will have to repeat steps 1 through 3. In practice, the computational specifics will depend on how clever is the approach you use to solve the SVD problem.

You can make yourself an idea of how many exact relationships there are in your dataset by computing

$$2\log\left(\frac{d_1(X_s)}{d_j(X_s)}\right)$$

for all $j$'s. To do this you only need the singular values which you can get at cost $O(p^2)$

Solved – what is the difference between collinearity and interaction

An interaction may arise when considering the relationship among three or more variables, and describes a situation in which the simultaneous influence of two variables on a third is not additive. Most commonly, interactions are considered in the context of regression analyses.

The presence of interactions can have important implications for the interpretation of statistical models. If two variables of interest interact, the relationship between each of the interacting variables and a third "dependent variable" depends on the value of the other interacting variable. In practice, this makes it more difficult to predict the consequences of changing the value of a variable, particularly if the variables it interacts with are hard to measure or difficult to control.

Collinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Collinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

Bottom line: Interactions don't imply collinearity and collinearity does not imply there are interactions.

Best Answer

Related Solutions

Solved – the fastest method for determining collinearity degree

Solved – what is the difference between collinearity and interaction

Related Question