Solved – How to report RMSE of Lasso using glmnet in R

glmnetlassomsepredictionpredictive-models

So I'm confused about reporting RMSE (root mean squared error) as a metric of model accuracy when using glmnet.

Specifically, do I report the RMSE of the model itself (i.e., how it performs with the training data used to create it) or do I report the RMSE of the model's performance with new data (aka test data)? …Or both?

I guess I'm also confused as to whether the cross validation performed by the cv.glmnet function (see below) is all I need for predicting model accuracy and whether an additional test of the data on a separate tests data set is even necessary? …

Context:

When I run cv.glmnet, the cross validation version of the glmnet function in R, it produces a graph showing the MSE (mean squared error) of various iterations of the model given varying values of lambda (the "regularization parameter").

The MSE values are stored under $cvm.

Now, I can take the square root of the MSE of any of the CV-iterated models to calculate RMSE.

In my case, I choose to go with the one standard error rule and choose "lambda.1se" (associated with the dotted line above), producing sqrt(mod$cvm[mod$lambda == mod$lambda.1se]).

HOWEVER…

Is this RMSE value even interesting to me?

I assume I should instead report the RMSE of the model when used to predict new values for my test data.

Is this true?
If so, is the best way to do this simply to calculate new values using predict and then compare them to the actual values from the test data using the following equation?

Am I thinking about all this correctly??

As a follow up:

How do I approach calculating and reporting RMSE if I lack a test data set and instead have to use cross-validation of my available data?

Is that cross-validation procedure separate from the one performed in the cv.glmnet function?

Best Answer

Specifically, do I report the RMSE of the model itself (i.e., how it performs with the training data used to create it) or do I report the RMSE of the model's performance with new data (aka test data)? ...Or both?

These are called training error and test error, respectively. It's useful to report both, but test error is more important, presuming your interest is in the predictive accuracy of the model. Training error is, in general, an optimistically biased estimate of true error for the entire population, because of overfitting.

If so, is the best way to do this simply to calculate new values using predict and then compare them to the actual values from the test data using the following equation?

Yes.

How do I approach calculating and reporting RMSE if I lack a tet data set and instead have to use cross-validation of my availabel data?

Pretty much the same way. The catch is that you also need to use cross-validation to choose the lasso penalty. The way to handle this is to use nested cross-validation—that is, inside each fold of the cross-validation loop, do more cross-validation loops on the training part to choose the lasso penalty.

Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?

It does almost everything needed in a cross-validation. For example, it fits possible lambda values on the data, chooses the best model and finally trains the model with the appropriate parameters.

For example, in the returned object::

cvm is the mean cross-validated error. cvsd is the estimated standard deviation.

Like other returned values, these are calculated on the test set. Finally, the

glmnet.fit gives the model trained on all the data (training + test) with the best parameters.

Do I have to do so manually, or is perhaps the caret function useful for glmnet models?

You need not do this manually. 'Caret' would be very useful, and is one of my favourite package because it works for all the other models with same syntax. I myself often use caret rather than cv.glmnet. However, in your scenario it is essentially the same.

Do I use two concentric "loops" of cross validation?... Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?

You could do this and this concept is very similar to the idea of Nested Cross-Validation Nested cross validation for model selection.

If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?

Just run a loop where you generate a training data and test data run cv.glmnet on training data and use the model glmnet.fit to predict on the test data.

Best Answer

Related Solutions

Solved – How to interpret this glmnet() code and its output in R

Solved – How to do cross-validation with cv.glmnet (LASSO regression in R)

Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?

Do I have to do so manually, or is perhaps the caret function useful for glmnet models?

Do I use two concentric "loops" of cross validation?... Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?

If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?

Related Question