Solved – Lasso logistic cross validated error

glmnetlassologisticregression

I fitted a lasso logistic regression using glmnet. I use a pretty small dataset with only 51 (28/23) observations. I want to compare the model fit of two possible variable combinations.

Only control variables
Control variables + linguistic predictors

Both models are comparable regarding explained deviance with best lambdas (1.:17% | 2.:16% dev. explained from null model).

Now I want also compare the mean cross validated error at the best lambdas. Again both models are pretty close (1.: 1.304177 | 2.: 1.324639).

My questions are:

1.) What exactly measures this score? Is it RMSE as measured in linear regression?

2.) From a predictive perspective: Is such a score either good or bad? (I would guess it is not the best predicitve model on earth)

3.) What would a good score look like?

Best Answer

1) For logistic regression use type.measure="class" or "auc" depending on whether it is a binomial or a multinomial classification.

2) Plot the two models using a ROC curve (use ROCR package) and compare the area under the curve as shown below.

3) A good score would depend on the baseline you are comparing with. If your baseline is random guessing you are comparing against the purple line.

Related Solutions

Solved – Performing Cross Validation to Compare Lasso and Other Regression Models in R

I'm not entirely sure I understand precisely where in the analysis pipeline your question is, but I think I can address it by walking through the steps you'll want to take. The software portion of your question is off-topic on CV, but the questions about CV are on-topic, so I'll answer those.

My question is: is it technically proper CV to determine the overall CV error by averaging the error on each fold given that the lambda chosen for each fold will be producing a different lasso result?

The elementary model development process is usually presented with respect to three partitions of your whole data set: train, test and validate. Training and test data are used together to tune model hyperparameters. Validation data is used to assess the performance of alternative models against data that wasn't used in model construction. The notion is that this is representative of new data that the model might encounter.

A slightly more sophisticated elaboration on this process is nested cross-validation. This is preferred because, across the whole process, all data is eventually used in testing and training the model. Instead of using one partitioning of the data, you can do CV partitioning on the whole data set (the outer partition) and then again on the data left over when you hold out one of the outer partitions (the inner set). Here, you tune model hyperparameters on the inner set and have out-of-sample performance evaluated on the outer holdout set. The final model is prepared by composing a final partition over the entire data set, using CV to select a final tuple of hyperparameters and then, at last, estimating a single model on all available data given that selected tuple. In this way, the model building process kind of telescopes on itself, collapsing CV steps as we estimate the final model.

It doesn't matter that alternative inner sets might give you different $\lambda_\text{min}$. What you're characterizing with your out-of-sample performance metrics is the model selection process itself. At the end of the day, you'll still only estimate one model, and that's the value of $\lambda_\text{min}$ that you care about. In the preceeding steps, you don't need to know the particular value of $\lambda_\text{min}$ except as a means to achieve out-of-sample estimates.

While I know that there is some discussion about using stepwise regression, I have used the stepAIC function to prune my variable set.

This is a bit of an understatement: it's not a discussion, it's a consensus that stepwise results are dubious. If you're fitting a lasso anyway, you can get statistically valid model by omitting the stepwise regression step from your analysis. Moreover, since the lasso step won't "see" the stepwise step, your results will have too-narrow error bands and cross-validation results will be irreparably biased. And lasso makes the entire stepwise step pointless anyway, because they solve the same problem! Lasso solves all of the variable selection problems that stepwise attempts to while avoiding the wealth of widely-accepted criticisms of stepwise strategies. There's no downside to using lasso on its own in this case. I'm convinced the only reason stepwise methods are included in R is for pedagogical reasons, and so that the functionality is available should someone need to demonstrate why it's hazardous.

Solved – Why doesn’t adding variables to the glmnet lasso model improve fit

First off, it looks like this is a classification problem, so make sure to have the type.measure option set to class, as such:

fit2=cv.glmnet(x[1:test,], y[1:test], type.measure = "class")

Remember that the Lasso loss function we try to minimize is the sum of the squared residuals plus lambda*(sum of the absolute value of the coefficient magnitudes, excluding the intercept). So, if you are comparing a lambda value for both models, they will keep approximately the same number of variables and similar magnitudes, because the cost for a large coefficient value is similar between the two models. However, when adding 2000 variables for which you want to include some in the model while also keeping your original significant variables, you need to adjust for a lower lambda, to be more inclusive.

If the some of the variables you are including are indeed significant, then the reason why your fit2 does not fit as well as fit1 is because the 2000 variables you are introducing may be valuable in predicting y, but not AS valuable as the variables in fit1. So, if the lambdas for both models are similar, the difference will be in sometimes including variables of the 2000 that are good but not as good as some of the originals for which they are replacing (but appear more important in the Lasso algorithm due to your training sample being slightly different than the population as a whole). With so many new variables added, the probability of randomly sampling where at least 1 of them appears more significant than it should be is high. In a shrinkage algorithm like the Lasso, this could seriously affect the results. Additionally, if some of the significant variables are highly correlated, then in a random sample some could go to zero if the correlated variable is more prevalent in the sample than in the population.

So, it is likely you want to change to the class measure if you haven't already, but besides that, it could be that the search for lambda is not including a small enough value for the fit2 model. Consider creating your own grid for lambda and and run the cv.glmnet with that grid. Here is an example version you can use:

grid = 10^seq(10, -2, length=100)
fit2=cv.glmnet(x[1:test,], y[1:test], type.measure = "class", lambda = grid)

Best Answer

Related Solutions

Solved – Performing Cross Validation to Compare Lasso and Other Regression Models in R

Solved – Why doesn’t adding variables to the glmnet lasso model improve fit

Related Question