Solved – Find “nearly” linearly dependent columns of matrix

feature selectionmatrixmulticollinearity

For practical reasons I need to eliminate some columns of a matrix based on how close they are to being a linear combination of other columns (Not looking for a regularizing method like LASSO or anything here — the constraints are related to hardware capabilities). I feel like this should be a standard test of some kind as opposed to a custom for-loop over the columns in which I fit a linear model of the other variables to get an L2 distance (or similar metric) to sort by. Ideally it could be done in an iterative fashion down to a certain number of columns (features) that I preset. Is there some standard method for this?

Note #1: I did try iterative elimination of columns based on the column which caused the largest decrease in matrix condition number as recommended by some (I standardized all the columns first). I found the pattern very interesting in that it wasn't a smooth curve (after logrithmic transformation) but had "intervals" if you will of decaying change with a sort of discontinuous jump between them. Is this expected behavior?

MCNiteration

Note #2: I also took another look at PCA in terms of the Eigenvectors associated with the smallest eigenvalue, but the process of sequential elimination wasn't as obvious as the matrix condition number.

Best Answer

One way I can think of is using matrix condition number. According to Wikipedia:

In the field of numerical analysis, the condition number of a function with respect to an argument measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input, and how much error in the output results from an error in the input.

...

In linear regression the condition number can be used as a diagnostic for multicollinearity.[1][2]

[1] Belsley, David A.; Kuh, Edwin; Welsch, Roy E. (1980). "The Condition Number". Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons. pp. 100–104. ISBN 0-471-05856-4.

[2] Pesaran, M. Hashem (2015). "The Multicollinearity Problem". Time Series and Panel Data Econometrics. New York: Oxford University Press. pp. 67–72 [p. 70]. ISBN 978-0-19-875998-0.


Here is a toy example:

> set.seed(0)
> x=runif(100)
> A1=cbind(x,x+runif(100)*1e-8)
> A2=cbind(x,2*x)
> kappa(A1)
[1] 329716417
> kappa(A2)
[1] 5.742016e+15