Solved – What are the software limitations in all possible subsets selection in regression

model selectionmultivariableregression

If I have a dependent variable and $N$ predictor variables and wanted my stats software to examine all the possible models, there would be $2^N$ possible resulting equations.

I am curious to find out what the limitations are with regard to $N$ for major/popular statistic software since as $N$ gets large there is a combinatorial explosion.

I've poked around the various web pages for packages but not been able to find this information. I would suspect a value of 10 – 20 for $N$?

If anyone knows (and has links) I would be grateful for this information.

Aside from R, Minitab, I can think of these packages SAS, SPPS, Stata, Matlab, Excel(?), any other packages I should consider?

Best Answer

I suspect 30--60 is about the best you'll get. The standard approach is the leaps-and-bounds algorithm which doesn't require fitting every possible model. In $R$, the leaps package is one implementation.

The documentation for the regsubsets function in the leaps package states that it will handle up to 50 variables without complaining. It can be "forced" to do more than 50 by setting the appropriate boolean flag.

You might do a bit better with some parallelization technique, but the number of total models you can consider will (almost undoubtedly) only scale linearly with the number of CPU cores available to you. So, if 50 variables is the upper limit for a single core, and you have 1000 cores at your disposal, you could bump that to about 60 variables.