Solved – Best way to select useful features using R software

feature selectionr

I have a huge matrix (individuals X features with row.names as individuals numbers) and the corresponding segment in another vector of 1D (row.names are the same as in my huge matrix and the vector represent the segments associated).
I.E. :

row.names VAR1 VAR2 VAR3 VAR4 … VAR3000
    12     4    12    5   18      8
    58     6    13    19   3     10

for the huge matrix.
and:

row.names  x
    12     4
    58     2

for the segment representation (where x represent the individual' segment).

I have no a priori model and I want to select a subset of variables (variable/feature selection) in order to predict the segment using a minimal subset of variables. I didn't use biclustering technique to detect my classes but a simple-way one. Which technique would you recommend to :

select the most discriminative variables (lasso, elastic net) and why?
predict the segment from these variables.
predict multiple values in another similar matrix (same individuals, few predictors that have been selected). Is it possible in this case to use correlation matrix (or cov) to infer directly the values of predictors that are not known in the other matrix (not using the following method: predict the class, then fill the missing values with medoid values or cluster-mean values)?

Thanks in advance.

Best Answer

One idea would be to use the rfe function in the caret package. Use the option rfeControl = rfeControl(functions=rfFunctions) to calculate variable importance using a random forest.

The rfe algorithm is explained in detail in the vignette: Alg 2

If a random forest performs well on your dataset, this is usually a good way to improve it. Or maybe the random forest alone gives you sufficiently accurate predictions.

You can also use the glmnet package to use the elastic net for regularization/selection. This will be MUCH faster, and often performs quiet well. If you've already got a glm model that you like, glmnet might improve it.

tl;dr: If a random forest works well on your data, try rfe with the rfFuncs. If a linear model works well, try glmnet, or rfe with lmFuncs.

Related Solutions

Solved – Finding the best combination of variables for high R-squared values

In your case, it might be feasible to try out all combinations (there are 16383 combinations) of sums. I wrote a quick and dirty implementation of that. With 14 variables it takes less than a minute to try out all combinations. If you want a random combination, you can modify the code to meet your needs.

my.vars <- matrix(NA, ncol=14, nrow=) # a matrix with your 14 different environmental variables
colnames(my.vars) <- paste("var", 1:14, sep="") # add row names "var1" - "var14"
my.grad.data <- 1:14
sum.vars <- vector()
r.2 <- vector()
comb.mat <- matrix(numeric(0), nrow=14, ncol=0) # initialise the matrix containing all combinations

for ( i in 1:14 ) { # generate and store all possible combination of sums of the 14 variables

  t.mat <- combn(my.grad.data, m=i)

  comb.mat <- cbind(comb.mat, rbind(t.mat, matrix(NA, ncol=dim(t.mat)[2] , nrow=14-i)))
}

for ( j in 1:dim(comb.mat)[2] ) { # calculate and store the R2 for all combinations

  sum.vec <- rowSums(my.vars[, comb.mat[, j]], na.rm=TRUE)

  sum.vars[j] <- paste(
    colnames(my.vars[, comb.mat[, j]])[!is.na(colnames(my.vars[, comb.mat[, j]]))], 
    collapse="+")

  r.2[j] <- summary(lm(allele ~ sum.vec))$r.squared 
}


result.frame <- data.frame(combination=sum.vars, r2=r.2)

result.frame.sorted <- result.frame[order(r.2, decreasing=TRUE), ]

head(result.frame.sorted, n=10) # the 10 "best" combinations

GLMNet – How to Interpret GLMNet in R?

Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:

Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to ﬁt a whole path than compute a single ﬁt.

cv.glmnet will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggest cv$lambda.1se instead of cv$lambda.min, but in practice I've had success with the latter.

After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (cv$lambda) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).

To extract the desired run from cv.glmnet.fit, try this:

small.lambda.index <- which(cv$lambda == cv$lambda.min)
small.lambda.betas <- cv$glmnet.fit$beta[, small.lambda.index]

Revision (1/28/2017)

No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the s = "lambda.min", s = "lambda.1se" or some other number (e.g., s = .007) to both coef and predict. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an "s" in coef and predict, you'll be using the default of s = "lambda.1se". I have warmed up to that default after seeing it work better in a small data situation. s = "lambda.1se" also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).

Best Answer

Related Solutions

Solved – Finding the best combination of variables for high R-squared values

GLMNet – How to Interpret GLMNet in R?

Related Question