Solved – the fastest method for determining collinearity degree

large datamulticollinearityvariance-inflation-factor

Given a large amount of data, how would I determine the degree of collinearity between all the variables? Preferably without relying on calculating linear regression between a variable and every other variable? Specifically, I need to find the collinearity between any single variable and every single other variable.

For this instance, we'll say the data has 10k variables and 100 million lines. I don't need a software package per se, a link to a relevant paper is fine.

Best Answer

The condition number is the statistic you want to look at. It's an exhaustive measure of colinearity of your design space and much cheaper to compute than the VIF. The formula is:

$$2\log\left(\frac{d_1(X_s)}{d_p(X_s)}\right) (1)$$

where $d_p$ ($d_1$) is the smallest (largest) singular value of $X_s$ and $X_s$ are the centred, re-scaled variables.

Reading your comment: this is what I was suspecting. To compute the condition # you don't need all 100e6 data points. Just samples, randomly, in the order of 100e3 (by experimenting with simulated datasets you can convince yourself that to get reliable results you should target about 5*k, where k is the # of non collinear variables so even 100e3 is very large already).

That should already give you a pretty good idea what variables are causing collinearity.

Also, you have specialized algorithm to compute the first and last few singular values and the last few singular vector only. There are algorithms do get these w/o computing the full SVD of $X$ (SVD-S). However, I don't know if these are implemented in R so to make the example below accessible I'll just use a small example and the classical SVD from R.

A high condition number (typically when the ratio (1) is larger than 10) tells you that $X_s'X_s$ is ill-conditioned, that there are components that can be re-written as (near) linear combination of the other variables. Below I give a brief (small scale) example of how SVD can be used to uncover such relations.

n<-100
p<-20
#non-ill conditioned part of the dataset.
x<-matrix(rnorm(n*p),nc=p)
x<-scale(x)
#introduce a variable that causes x to be 
#ill conditioned.
y<-x%*%c(rnorm(3),rep(0,p))[1:p]
y<-scale(y)
x<-cbind(x,y)
p<-ncol(x)

A<-svd(x,nu=0)
#x is ill-conditioned: this ratio is larger 
#than 10. (step 1)
2*log(A$d[1]/A$d[p])

#check what is causing it: (step 2)
round(A$v[,ncol(A$v)],2)
#you can write the last variable as (.23*x_1+.5*x_2-.45*x_3)/(-.7) [1]
#here the relation is exact because:
min(A$d)
#is 0. if min(A$d)>0 then this gives you how much there is noise 
#there is arround [1].

#so I remove the last variable. (step 3)
x<-x[,-ncol(x)]
#no more ill-condition.
2*log(A$d[1]/A$d[p-1])

This is for the linear algebra of the problem when there is a single variable causing the ill-regression. In most case you will have more than one (near) exact relationship and you will have to repeat steps 1 through 3. In practice, the computational specifics will depend on how clever is the approach you use to solve the SVD problem.

You can make yourself an idea of how many exact relationships there are in your dataset by computing

$$2\log\left(\frac{d_1(X_s)}{d_j(X_s)}\right)$$

for all $j$'s. To do this you only need the singular values which you can get at cost $O(p^2)$