Solved – Feature selection & model with glmnet on Methylation data (p>>N)

elastic netglmnetrregularizationunderdetermined

I would like to use GLM and Elastic Net to select those relevant features + build a linear regression model (i.e., both prediction and understanding, so it would be better to be left with relatively few parameters). The output is continuous. It's $20000$ genes per $50$ cases. I've been reading about the glmnet package, but I'm not 100% sure about the steps to follow:

Perform CV to choose lambda:
cv <- cv.glmnet(x,y,alpha=0.5)
(Q1) given the input data, would you choose a different alpha value?
(Q2) do I need to do something else before build the model?
Fit the model:
model=glmnet(x,y,type.gaussian="covariance",lambda=cv$lambda.min)
(Q3) anything better than "covariance"?
(Q4) If lambda was chosen by CV, why does this step need nlambda=?
(Q5) is it better to use lambda.min or lambda.1se?
Obtain the coefficients, to see which parameters have fallen out ("."):
predict(model, type="coefficients")

In the help page there are many predict methods (e.g., predict.fishnet, predict.glmnet, predict.lognet, etc). But any "plain" predict as I saw on an example.
(Q6) Should I use predict or predict.glmnet or other?

Despite what I've read about regularization methods, I'm quite new in R and in these statistical packages, so it's difficult to be sure if I'm adapting my problem to the code. Any suggestions will be welcomed.

UPDATE
Based on "As previously noted, an object of class train contains an element called finalModel, which is the fitted model with the tuning parameter values selected by resampling. This object can be used in the traditional way to generate predictions for new samples, using that model's
predict function."

Using caret to tune both alpha and lambda:

  trc = trainControl(method=cv, number=10)  
  fitM = train(x, y, trControl = trC, method="glmnet")

Does fitM replace previous step 2? If so, how to specify the glmnet options (type.gaussian="naive",lambda=cv$lambda.min/1se) now?
And the following predict step, can I replace model to fitM?

If I do

  trc = trainControl(method=cv, number=10)  
  fitM = train(x, y, trControl = trC, method="glmnet")  
  predict(fitM$finalModel, type="coefficients")

does it make sense at all or am I incorrectly mixing both package vocabulary?

Best Answer

Part 1

In the elastic net two types of constraints on the parameters are employed

Lasso constraints (i.e. on the size of the absolute values of $\beta_j$)
Ridge constraints (i.e. on the size of the squared values of $\beta_j$)

$\alpha$ controls the relative weighting of the two types. The Lasso constraints allow for the selection/removal of variables in the model. The ridge constraints can cope with collinear variables. Which you put more weight upon will depend on the data properties; lots of correlated variables may need both constraints, a few correlated variables might suggest more emphasis on ridge constraints.

One way to solve this is to treat $\alpha$ as a tuning parameter alongside $\lambda$ and use the values that give the lowest CV error, in the same way that you are tuning over $\lambda$ at the moment with cv.glmnet.

The R package caret can build models using the glmnet package and should be set up to tune over both parameters $\alpha$ and $\lambda$.

Part 2

Q3

Yes, in this case where $m \gg n$ (number of variables $\gg$ number of observations), the help page for ?glmnet suggests to use

type.gaussian = "naive"

Instead of storing all the inner-products computed along the way, which can be inefficient with a large number of variables or when $m \gg n$, the "naive" option will loop over $n$ each time it is required to computer inner products.

If you had not specified this argument, glmnet would have chosen "naive" anyway as $m > 500$, but it is better to specify this explicitly incase the defaults and options change later in the package and you are running code at a future date.

Q4

Short answer, you don't need to specify a high value for nlambda now that you have chosen an optimal value, conditioned on $\alpha = 0.5$. However, if you want to plot the coefficient paths etc then having a modest set of values of $\lambda$ over the interval results in a much nicer set of paths. The computational burden of doing the entire path relative to one specific $\lambda$ is not that great, the result of a lot of effort to develop algorithms to do this job correctly. I would just leave nlambda at the default, unless it makes an appreciable difference in compute time.

Q5

This is a question about parsimony. The lambda.min option refers to value of $\lambda$ at the lowest CV error. The error at this value of $\lambda$ is the average of the errors over the $k$ folds and hence this estimate of the error is uncertain. The lambda.1se represents the value of $\lambda$ in the search that was simpler than the best model (lambda.min), but which has error within 1 standard error of the best model. In other words, using the value of lambda.1se as the selected value for $\lambda$ results in a model that is slightly simpler than the best model but which cannot be distinguished from the best model in terms of error given the uncertainty in the $k$-fold CV estimate of the error of the best model.

The choice is yours:

The best model that may be too complex of slightly overfitted: lambda.min
The simplest model that has comparable error to the best model given the uncertainty: lambda.1se

Part 3

This is a simple one and is something you'll come across a lot with R. You use the predict() function 99.9% of the time. R will arrange for the use of the correct function for the object supplied as the first argument.

More technically, predict is a generic function, which has methods (versions of the function) for objects of different types (technically known as classes). The object created by glmnet has a particular class (or classes) depending on what type of model is actually fitted. glmnet (the package) provides methods for the predict function for these different types of objects. R knows about these methods and will choose the appropriate one based on the class of the object supplied.

Related Solutions

Solved – How to interpret this glmnet() code and its output in R

This smells incorrect, you probably wanted:

fit <- cv.glmnet(model, y, k=k)
coef(fit, "lambda.min")

which will return the coefficients using the internal fit from the cross validation.

Unless ridge_model has the same predictors, weights, mixing parameter, etc, plugging in a penalty parameter from one model into another seems odd; but if that were the same, ridge_model would be the same as fit$glmnet.fit above and redundant.

Solved – Building final model in glmnet after cross validation

Instead of performing a cross validation for each set of variables separately using a penalized regression, the cv.gmlnet function does this automatically:

library(glmnet)
data(QuickStartExample)

# your approach: use different lambdas and perform cross validation maually
fit_1 = glmnet(x, y,lambda = 1)


# glmnet's approach: automated cross validation
cvfit = cv.glmnet(x, y)
plot(cvfit)

# coeficients of the final model
coef_cv=coef(cvfit, s = "lambda.min")
# prediction of the final model
predict(cvfit, newx = x[1:5,], s = "lambda.min")

# extract optimal lambda
lmabda_opt=cvfit$lambda.min 

# manually plugging lambda into glmnet
fit_2 = glmnet(x, y,lambda = lmabda_opt) 

# compare cefficients - equal
cbind(coef_cv,coef(fit_2))

# compare predictions - equal
cbind(predict(cvfit, newx = x[1:5,], s = "lambda.min"),predict(fit_2, newx = x[1:5,]))

So for each lambda, a cross validation is performed and a performance meansure is calculated. Via plot(cvfit) you can see the result of the cross validation. Recall, that generally using glmnet() and plugging in arbitrary lambdas is not recommended. More detals can be found in the excellent tutorial: https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf