Solved – glmnet, categorical variable, group lasso

categorical datacox-modelglmnetlassosurvival

I am using glmnet for LASSO. My data set contains several continuous variables and one categorical variable (it has four levels). I wondered if I could treat three dummy variables as other continuous variables. Should I use a type of group LASSO approach for the three dummies?

Best Answer

As far as I am aware glmnet doesn't have this feature implemented yet. @Glen_b's suggestion of using type.multinomial is used to group variables across all responses in a multinomial model, but there's no way of grouping independent variables in a model. see

https://cran.r-project.org/web/packages/grplasso/grplasso.pdf

for an alternative.

Related Solutions

Model Selection – R: Using Leaps and Glmnet with Categorical Variables

regsubsets (a function in the leaps package that also performs exhaustive model searches) can accept categorical variables that are not split out into dummy variables and, thus, treats them as groups of variables that are either all part of a model or not.

For example, if Year has levels 2013, 2014 and Treatment has levels C,N,O I can run the following statement:

> search_output<-regsubsets(y~Year+Treatment,data=stats_df, method="exhaustive")

Output:

Subset selection object
Call: regsubsets.formula(mu_ln ~ Year + Treatment, data = SS_stats_df, 
    nbest = 1, method = "exhaustive")
3 Variables  (and intercept)
           Forced in Forced out
Year2014       FALSE      FALSE
TreatmentN     FALSE      FALSE
TreatmentO     FALSE      FALSE
1 subsets of each size up to 3
Selection Algorithm: exhaustive

> summary(search_output)$which
  (Intercept) Year2014 TreatmentN TreatmentO
1        TRUE    FALSE       TRUE      FALSE
2        TRUE    FALSE       TRUE       TRUE
3        TRUE     TRUE       TRUE       TRUE

When faced with this same problem I found this post very helpful (my answer here is essentially an abbreviated version of the pertinent portion): http://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

And for recoding or converting to factors or renaming factors these posts are helpful: https://stackoverflow.com/questions/5372896/recoding-variables-with-r http://www.cookbook-r.com/Manipulating_data/Recoding_data/

LASSO Regression – How to Treat Categorical Predictors in LASSO

When dealing with categorical variables in LASSO regression, it is usual to use a grouped LASSO that keeps the dummy variables corresponding to a particular categorical variable together (i.e., you cannot exclude only some of the dummy variables from the model). A useful method is the Modified Group LASSO (MGL) described in Choi, Park and Seo (2012). In this method the penalty is proportional to the norm of the $\boldsymbol{\beta}$ vector for the set of dummy variables. You still keep a reference category in this method, so the intercept term is still included. This allows you to deal with multiple categorical variables without identifiability problems.

In answer to your specific questions:

(1) LASSO is an estimation method for the coefficients, but the coefficients themselves are defined by the initial model equation for your regression. As such, the interpretation of the coefficients is the same as in a standard linear regression; they represent rates-of-change of the expected response due to changes in the explanatory variables.

(2) The above literature recommends grouping the variables, but keeping a reference category. This implicitly assumes that you are comparing the presence of a categorical variable with a model that removes it but still has an intercept term.

(3) As stated above, the estimation method does not affect the interpretation of the coefficients, which are set by the model statement.

Best Answer

Related Solutions

Model Selection – R: Using Leaps and Glmnet with Categorical Variables

LASSO Regression – How to Treat Categorical Predictors in LASSO

Related Question