LASSO Regression – How to Treat Categorical Predictors in LASSO

categorical datacategorical-encodinginterceptlassoregression coefficients

I am running a LASSO that has some categorical variable predictors and some continuous ones. I have a question about the categorical variables. The first step I understand is to break each of them into dummies, standardize them for fair penalization, and then regress. Several options arise for treating the dummy variables:

  1. Include all but one of the dummies for each factor, leaving that one as a reference level. The interpretation of a dummy coefficient is relative to the excluded "reference" category. The intercept is now the mean response for the reference category.

  2. Group the variables in each factor so they're either all excluded or all-but-one included. I believe that's what @Glen_b is suggesting here:

    Normally, yes, you keep your factors all together. There's several R packages that can do this, including glmnet

  3. Include all of the levels, as suggested by @Andrew M here:

    You may also want to change the default contrast function, which by
    default leaves out one level of the each factor (treatment coding).
    But because of the lasso penalty, this is no longer necessary for
    identifiability, and in fact makes interpretation of the selected
    variables more complicated. To do this, set

    contr.Dummy <- function(contrasts, ...){
       conT <- contr.treatment(contrasts=FALSE, ...)
       conT
    }
    options(contrasts=c(ordered='contr.Dummy', unordered='contr.Dummy'))
    

    Now, whatever levels of a factor are selected, you can think of it as
    suggesting that these specific levels matter, versus all the omitted
    levels. In machine learning, I have seen this coding referred to as
    one-hot encoding.

Questions:

  1. What is the interpretation of the intercept and coefficients under each of these approaches?
  2. What are the considerations involved in selecting one of them?
  3. Do we un-scale the dummy coefficients and then interpret them as a change of going from off to on?

Best Answer

When dealing with categorical variables in LASSO regression, it is usual to use a grouped LASSO that keeps the dummy variables corresponding to a particular categorical variable together (i.e., you cannot exclude only some of the dummy variables from the model). A useful method is the Modified Group LASSO (MGL) described in Choi, Park and Seo (2012). In this method the penalty is proportional to the norm of the $\boldsymbol{\beta}$ vector for the set of dummy variables. You still keep a reference category in this method, so the intercept term is still included. This allows you to deal with multiple categorical variables without identifiability problems.

In answer to your specific questions:

(1) LASSO is an estimation method for the coefficients, but the coefficients themselves are defined by the initial model equation for your regression. As such, the interpretation of the coefficients is the same as in a standard linear regression; they represent rates-of-change of the expected response due to changes in the explanatory variables.

(2) The above literature recommends grouping the variables, but keeping a reference category. This implicitly assumes that you are comparing the presence of a categorical variable with a model that removes it but still has an intercept term.

(3) As stated above, the estimation method does not affect the interpretation of the coefficients, which are set by the model statement.

Related Question