Solved – How to report RMSE of Lasso using glmnet in R

glmnetlassomsepredictionpredictive-models

So I'm confused about reporting RMSE (root mean squared error) as a metric of model accuracy when using glmnet.

Specifically, do I report the RMSE of the model itself (i.e., how it performs with the training data used to create it) or do I report the RMSE of the model's performance with new data (aka test data)? …Or both?

I guess I'm also confused as to whether the cross validation performed by the cv.glmnet function (see below) is all I need for predicting model accuracy and whether an additional test of the data on a separate tests data set is even necessary? …


Context:

When I run cv.glmnet, the cross validation version of the glmnet function in R, it produces a graph showing the MSE (mean squared error) of various iterations of the model given varying values of lambda (the "regularization parameter").

glmnet lambda vs. MSE

The MSE values are stored under $cvm.

Now, I can take the square root of the MSE of any of the CV-iterated models to calculate RMSE.

  • In my case, I choose to go with the one standard error rule and choose "lambda.1se" (associated with the dotted line above), producing sqrt(mod$cvm[mod$lambda == mod$lambda.1se]).

HOWEVER…

Is this RMSE value even interesting to me?

I assume I should instead report the RMSE of the model when used to predict new values for my test data.

  • Is this true?

  • If so, is the best way to do this simply to calculate new values using predict and then compare them to the actual values from the test data using the following equation?

RMSE equation

Am I thinking about all this correctly??


As a follow up:

How do I approach calculating and reporting RMSE if I lack a test data set and instead have to use cross-validation of my available data?

  • Is that cross-validation procedure separate from the one performed in the cv.glmnet function?

Best Answer

Specifically, do I report the RMSE of the model itself (i.e., how it performs with the training data used to create it) or do I report the RMSE of the model's performance with new data (aka test data)? ...Or both?

These are called training error and test error, respectively. It's useful to report both, but test error is more important, presuming your interest is in the predictive accuracy of the model. Training error is, in general, an optimistically biased estimate of true error for the entire population, because of overfitting.

If so, is the best way to do this simply to calculate new values using predict and then compare them to the actual values from the test data using the following equation?

Yes.

How do I approach calculating and reporting RMSE if I lack a tet data set and instead have to use cross-validation of my availabel data?

Pretty much the same way. The catch is that you also need to use cross-validation to choose the lasso penalty. The way to handle this is to use nested cross-validation—that is, inside each fold of the cross-validation loop, do more cross-validation loops on the training part to choose the lasso penalty.