Solved – Feature selection with Caret for data with more than one target

caretfeature selectionr

I am trying to do some feature selection, having around 3500 variables for about 200 samples. To each sample is associated two numerical values (the expected outcome). I can't manage to make the caret work with this, or even find any information on this. Does anybody know how to do this?

As an example, my data is roughly in the following format:

Samples:

S1    2.1 1.2 3.1 ... 4.2 1.7 5.2
S2    3.4 1.1 4.5 ... 5.3 1.2 5.7
...
S3499 2.4 3.5 5.1 ... 2.2 1.5 5.7
S3500 4.1 1.2 5.4 ... 1.2 2.1 5.8

Targets:

S1    1.82 1.44
S2    2.44 1.22
...
S3499 1.23 1.32
S3500 1.99 1.51

Thanks,
Swatchpuppy

Best Answer

I don't think caret supports multi-task learning in any of its functions. You could try the glmnet package, with distribution set to mgaussian. This will allow you to do feature selection via lasso regularization, ridge regularization, or elastic net regularization for a linear regression model.

There may be other R machine learning libraries with built-in feature selection that support multi-task learning. Here's some sample code for multi-task learning, using the lasso for variable selection, adapted from ?glmnet:

#Create a dataset
set.seed(42)
library(glmnet)
x=matrix(rnorm(100*20),100,20)
cf <- sample(0:1, 20, replace=TRUE) #Select some columns
response1 <- x %*% cf*runif(20) #Apply random coefficients
response2 <- x %*% cf*runif(20)
y=cbind(response1, response2)

#Fit a single lasso model
#0 for ridge
#1 for lasso
#>0 & <1 for the elastic net (mix of ridge and lasoo)
fit1m=glmnet(x,y,family="mgaussian",alpha=1)
plot(fit1m,type.coef="2norm")

#Select lambda through cross validation
fit1m.cv <- cv.glmnet(x,y,family="mgaussian",alpha=1) 
plot(fit1m.cv)
coef(fit1m.cv) #Show coefficients at the selected value of lambda