Solved – Variable selection with groups of predictors that are highly correlated

feature selectionforecastinglassomachine learningmulticollinearity

What variable selection approach should I consider if I have thousands of predictors with clusters that are extremely correlated?

For example I might have a predictor set $X:= \{A_1,A_2,A_3,A_4,…,A_{39},B_1,B_2,…,B_{44},C_1,C_2,…\}$ with cardinality $|X| > 2000$. Consider the case where all $\rho(A_i,A_j)$ are very high, and similarly for $B$, $C$, ….

Correlated predictors aren't correlated "naturally"; it's a result of the feature engineering process. This is because all $A_i$ are hand engineered from the same underlying data with small variations in hand-engineering methodology, e.g. I use a thinner pass band on $A_2$ than I did for $A_1$ in my denoising approach but everything else is the same.

My goal is to improve out of sample accuracy in my classification model.

One approach would just be to try everything: non-negative garotte, ridge, lasso, elastic nets, random subspace learning, PCA/manifold learning, least angle regression and pick the one that's best in my out of sample dataset. But specific methods that are good at dealing with the above would be appreciated.

Note that my out of sample data is extensive in terms of sample size.

Best Answer

I would do the forward stepwise selection, adding predictors as long as the correlation with residuals is significant, and then do some regularization (ridge, lasso, elastic nets). There are 2-3 metaparameters: forward stepwise termination constraint, and 1 or 2 regularization parameters. These metaparameters are determined via cross-validation.

If you want to take into account non-linearity, you could try random forest, which produces good results when there are many predictors. But it is slow.