Solved – Coefficients and significance of lasso/ridge

lassologisticregression coefficientsstatistical significance

Coefficients and significance of lasso/ridge

I had 628 predictors after forming dummy of all categorical variables. When I ran lot many iterations traditional logistic regression iteration, I came across 15 variables that was giving me pretty good model with good ROC, recall & precision(for certain cut-off) values on test data and also all variables were significant(at p<=0.05). But since it took lot of time, I tried using lasso that gave me 50 non-zero-coefficient variables after taking best lambda value post running 10 fold cross-validation. But only 5 variables were common between 15 variables of traditional method and 50 of lasso. Moreover, when I tried to calculate its SE and t-stats, I figured out that many variables are insignificant(low t-stats and high p-value). In addition to it, the AUC for ROC was less than traditional method.The ROC drops even more when I used traditional logistic regression on 50 variables that were result of lasso. Can someone help me understand the dynamics of it and how I will be able to justify the coefficients of lasso model as they are penalized(I have normalized all the variables before using lasso)?

Best Answer

The assumptions for the standard single value t-tests and p-values don't really hold after fitting the first model. So you should not trust p-values from stepwise style analyses.

Also, think about how many different tests you calculated, did you adjust for multiple comparisons looking at all the tests that you did? (some argue that you should also adjust for tests that you might have done had early tests given different results). This combined with the above suggests that your final 15 p-values that were "significant" are probably mostly meaningless.

Try doing what you did based on simulated data and no relationship (i.e. create 628 random predictors similar to your original variables, then randomly generate your response variable without relating it to any of the predictors). Or you can just randomly permute your response variable so that it no longer corresponds to the predictors in a meaningful way. Now go through your same procedure and you will still probably find many to be "significant" (actually I am a bit surprised that you only found 15 when you started with 628).

How did you compute your ROC and other scores? Did they use the same data for fitting and evaluating? If so, they are showing the best possible result, probably from over fitting. The cross-validation used with lasso tries to avoid over fitting by using a separate evaluation group than the fitting group of data. Unless you do the same thing with your initial analysis the comparison is not fair.

The lasso approach and the p-values are answering very different questions (so trying to compute p-values on the lasso results is really meaningless). The p-values are testing if a single variable (all or nothing) contributes above and beyond the other variables in the model. So if x1 and x2 are correlated with each other, it could be that one or the other is significant by itself, but that they don't both need to be in the model. The lasso (or other penalized approach) instead of saying all or nothing to each variable may find that a linear combination of x1 and x2 (a weighted average) gives a better prediction than either alone, a very different result to a different question than the p-values. If you run the standard p-values on x1 and x2 then you are ignoring the value that lasso found and asking the all or nothing question again (with some of the assumptions for the p-values not holding).

The only really meaningful comparison would be to use both methods on a "training" set of data, then compare their predictions on a completely separate "test" set of data.

Edit

This is in answer to the questions in the "Answer" below.

The usual tests of significance rely on assumptions of representativeness and independence. These often hold when you fit your first model and do the regular tests, but once you fit a second model inspired by the first model, those assumptions don't hold as much any more, and the more selection that you do, the more those assumptions are violated and the more biased the results will be. You saw with your own simulation that it is easy to get "significant" terms when there are none. Also see the second example in my answer here: R-code question: model selection based on individual significance in regression?.

It is not clear to me what else you did, but it looks like at one stage you took the variables from the lasso fit that did not have 0 and used them in a new regression model. Don't do that! If you are going to use the lasso, then use the coefficient estimates from the lasso fit.

Since you want to judge models based on the ROC curve, have you tried different penalties with the lasso fit to see how they affect the ROC curve?

How many times did you use the test data in your initial model fit?

Edit 2

Maybe this exercise will help you understand.

Imagine that we have several dice (say 10), all 6 sided and fair (not biased), each is a different color so they are easy to tell apart.

Scenario 1:

Now we roll all the dice. What is the probability that the green die shows a 1? that the green die shows a 6? Do these probabilities change if I tell you that the red die is showing a 3?

Scenario 2:

Again roll all of the dice, but this time remove the die that shows the highest value (if there are ties, then flip a coin or similar to decide on which one to remove). Assume that the green die was not removed, what is the probability that the green die shows a 1? that it shows a 6? No continue removing the die of those remaining that shows the highest value (again breaking ties randomly) (don't reroll the dice, just take the highest remaining). Do this until there are only 5 dice left. Now assuming the green die is among the remaining dice (or just choose a different color randomly from the remaining dice), what is the probability that it shows a 6? a 1? Continue removing dice by choosing the highest number until there is only 1 die left. What is the probability that the remaining die shows a 6? a 1?

The first time that we fit a regression model is like scenario 1, but once you start doing stepwise regression you move into scenario 2, but the computations still assume that you are in scenario 1. But you should be able to see (with the dice at least) that the last remaining die in scenario 2 is not representative of choosing a die from scenario 1.