LASSO Regression – Interpretation of Regression Coefficients

interpretationlassomultiple regressionpredictive-modelsregression coefficients

I'm currently working on building a predictive model for a binary outcome on a dataset with ~300 variables and 800 observations. I've read much on this site about the problems associated with stepwise regression and why not to use it.

I've been reading into LASSO regression and its ability for feature selection and have been successful in implementing it with the use of the "caret" package and "glmnet".

I am able to extract the coefficient of the model with the optimal lambda and alpha from "caret"; however, I'm unfamiliar with how to interpret the coefficients.

Are the LASSO coefficients interpreted in the same method as logistic regression?
Would it be appropriate to use the features selected from LASSO in logistic regression?

EDIT

Interpretation of the coefficients, as in the exponentiated coefficients from the LASSO regression as the log odds for a 1 unit change in the coefficient while holding all other coefficients constant.

https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

Best Answer

Are the LASSO coefficients interpreted in the same method as logistic regression?

Let me rephrase: Are the LASSO coefficients interpreted in the same way as, for example, ~~OLS~~ maximum likelihood coefficients in a logistic regression?

LASSO (a penalized estimation method) aims at estimating the same quantities (model coefficients) as, say, ~~OLS~~ maximum likelihood (an unpenalized method). The model is the same, and the interpretation remains the same. The numerical values from LASSO will normally differ from those from ~~OLS~~ maximum likelihood: some will be closer to zero, others will be exactly zero. If a sensible amount of penalization has been applied, the LASSO estimates will lie closer to the true values than the ~~OLS~~ maximum likelihood estimates, which is a desirable result.

Would it be appropriate to use the features selected from LASSO in logistic regression?

There is no inherent problem with that, but you could use LASSO not only for feature selection but also for coefficient estimation. As I mention above, LASSO estimates may be more accurate than, say, ~~OLS~~ maximum likelihood estimates.

Related Solutions

Solved – Categorical variables in LASSO regression

Factor variables in R and other software are automatically parsed out into several categorical factors. So for instance, if I create a variable

n <- 100
dayn <- sample(1:7, n, replace=T)
dayf <- factor(dayn, levels=1:7, labels=c('Sun', 'Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat'))

and I analyze it in a linear regression model, the regression model automatically creates the binary variables, taking "Sunday" as the referent level. Each factor gives a comparison of a day of the week versus Sunday in regression models. Sunday vs Sunday is redundant, so it is dropped.

For instance:

mm <- model.matrix(~dayf)
head(mm)

Gives me:

> head(mm)
  (Intercept) dayfMon dayfTues dayfWeds dayfThurs dayfFri dayfSat
1           1       1        0        0         0       0       0
2           1       0        1        0         0       0       0
3           1       0        0        0         0       1       0
4           1       0        0        0         0       1       0
5           1       0        0        1         0       0       0
6           1       1        0        0         0       0       0

Suppose further I had a outcome variable which is Poisson distributed... yet I analyze it with a linear regression model because I can

    sickdays <- rpois(n, lambda = exp(1 + 2*(dayf %in% c('Monday','Tuesday'))))
    boxplot(sickdays ~ dayf)

Now if my hypothesis is "Does day of the week affect the number of people taking sick days?" an appropriate test of the hypothesis may come from a 6 degree of freedom test concerning whether or not there is any statistically significant difference in mean sick days among any of the days of the week. Note that I am not concerned with exactly which day is affected. The regression model gives me 6 separate coefficients

library(lmtest)
big.model <- lm(sickdays ~ dayf)
summary(big.model)
null.model <- lm(sickdays ~ 1)
lrtest(big.model, null.model)

Depending on your seed, the likelihood ratio test may or may not be significant and the 6 separate Wald tests may or may not be significant. The problem with the 6 separate Wald tests is multiple testing is applied.

This relates to LASSO because with factors we do not hypothesize that separate levels may be predictive. So we either include all factor levels as a "feature" or not.

As a reminder, LASSO does feature selection. What is a feature? In a regression model, the particular comparison "Tuesday vs Sunday" or "Friday vs Sunday" is not a feature. The 6 level factor coming from dayf is considered a feature. So for model selection, it is all or nothing. Either all 6 factors are included, along with their penalization, or they are excluded.

From a theoretical perspective this makes sense. If I kept "Tuesday vs Sunday" as a factor and no other factors, this factor no longer means "Tuesday vs Sunday", but becomes "Tuesday vs every other day", that means there are significant practical differences in how that factor is interpreted when the model is expanded to include (what usually is) Wednesday vs Sunday. In that case, the two factors are Tuesday vs S/M/Th/F/Sa and Wednesday vs S/M/Th/F/Sa. And you cannot compare them.

Solved – Testing for coefficients significance in Lasso logistic regression

The problem with using the usual significance tests is that they assume the null that is that there are random variables, with no relationship with the outcome variables. However what you have with lasso, is a bunch of random variables, from which you select the best ones with the lasso, also the betas are shrunk. So you cannot use it, the results will be biased.

As far as I know, the bootstrap is not used to get the variance estimation, but to get the probabilities of a variable is selected. And those are your p-values. Check Hassie's free book, Statistical Learning with Sparsity, chapter 6 is talking about the same thing. Statistical Learning with Sparsity: The Lasso and Generalizations

Also check this paper for some other ways to get p-values from lasso: High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi.There are probably more.

Best Answer

Related Solutions

Solved – Categorical variables in LASSO regression

Solved – Testing for coefficients significance in Lasso logistic regression

Related Question