Solved – Best way to select useful features using R software

feature selectionr

I have a huge matrix (individuals X features with row.names as individuals numbers) and the corresponding segment in another vector of 1D (row.names are the same as in my huge matrix and the vector represent the segments associated).
I.E. :

row.names VAR1 VAR2 VAR3 VAR4 … VAR3000
    12     4    12    5   18      8
    58     6    13    19   3     10

for the huge matrix.
and:

row.names  x
    12     4
    58     2

for the segment representation (where x represent the individual' segment).

I have no a priori model and I want to select a subset of variables (variable/feature selection) in order to predict the segment using a minimal subset of variables. I didn't use biclustering technique to detect my classes but a simple-way one. Which technique would you recommend to :

  1. select the most discriminative variables (lasso, elastic net) and why?
  2. predict the segment from these variables.
  3. predict multiple values in another similar matrix (same individuals, few predictors that have been selected). Is it possible in this case to use correlation matrix (or cov) to infer directly the values of predictors that are not known in the other matrix (not using the following method: predict the class, then fill the missing values with medoid values or cluster-mean values)?

Thanks in advance.

Best Answer

One idea would be to use the rfe function in the caret package. Use the option rfeControl = rfeControl(functions=rfFunctions) to calculate variable importance using a random forest.

The rfe algorithm is explained in detail in the vignette: Alg 2

If a random forest performs well on your dataset, this is usually a good way to improve it. Or maybe the random forest alone gives you sufficiently accurate predictions.

You can also use the glmnet package to use the elastic net for regularization/selection. This will be MUCH faster, and often performs quiet well. If you've already got a glm model that you like, glmnet might improve it.

tl;dr: If a random forest works well on your data, try rfe with the rfFuncs. If a linear model works well, try glmnet, or rfe with lmFuncs.

Related Question