I am using glmnet for LASSO. My data set contains several continuous variables and one categorical variable (it has four levels). I wondered if I could treat three dummy variables as other continuous variables. Should I use a type of group LASSO approach for the three dummies?
Solved – glmnet, categorical variable, group lasso
categorical datacox-modelglmnetlassosurvival
Related Solutions
regsubsets
(a function in the leaps
package that also performs exhaustive model searches) can accept categorical variables that are not split out into dummy variables and, thus, treats them as groups of variables that are either all part of a model or not.
For example, if Year
has levels 2013, 2014
and Treatment
has levels C,N,O
I can run the following statement:
> search_output<-regsubsets(y~Year+Treatment,data=stats_df, method="exhaustive")
Output:
Subset selection object
Call: regsubsets.formula(mu_ln ~ Year + Treatment, data = SS_stats_df,
nbest = 1, method = "exhaustive")
3 Variables (and intercept)
Forced in Forced out
Year2014 FALSE FALSE
TreatmentN FALSE FALSE
TreatmentO FALSE FALSE
1 subsets of each size up to 3
Selection Algorithm: exhaustive
> summary(search_output)$which
(Intercept) Year2014 TreatmentN TreatmentO
1 TRUE FALSE TRUE FALSE
2 TRUE FALSE TRUE TRUE
3 TRUE TRUE TRUE TRUE
When faced with this same problem I found this post very helpful (my answer here is essentially an abbreviated version of the pertinent portion): http://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
And for recoding or converting to factors or renaming factors these posts are helpful: https://stackoverflow.com/questions/5372896/recoding-variables-with-r http://www.cookbook-r.com/Manipulating_data/Recoding_data/
When dealing with categorical variables in LASSO regression, it is usual to use a grouped LASSO that keeps the dummy variables corresponding to a particular categorical variable together (i.e., you cannot exclude only some of the dummy variables from the model). A useful method is the Modified Group LASSO (MGL) described in Choi, Park and Seo (2012). In this method the penalty is proportional to the norm of the $\boldsymbol{\beta}$ vector for the set of dummy variables. You still keep a reference category in this method, so the intercept term is still included. This allows you to deal with multiple categorical variables without identifiability problems.
In answer to your specific questions:
(1) LASSO is an estimation method for the coefficients, but the coefficients themselves are defined by the initial model equation for your regression. As such, the interpretation of the coefficients is the same as in a standard linear regression; they represent rates-of-change of the expected response due to changes in the explanatory variables.
(2) The above literature recommends grouping the variables, but keeping a reference category. This implicitly assumes that you are comparing the presence of a categorical variable with a model that removes it but still has an intercept term.
(3) As stated above, the estimation method does not affect the interpretation of the coefficients, which are set by the model statement.
Best Answer
As far as I am aware glmnet doesn't have this feature implemented yet. @Glen_b's suggestion of using type.multinomial is used to group variables across all responses in a multinomial model, but there's no way of grouping independent variables in a model. see
https://cran.r-project.org/web/packages/grplasso/grplasso.pdf
for an alternative.