Solved – Principled way of collapsing categorical variables with many levels

categorical datafaqfeature-engineeringmany-categoriesregression

What techniques are available for collapsing (or pooling) many categories to a few, for the purpose of using them as an input (predictor) in a statistical model?


Consider a variable like college student major (discipline chosen by an undergraduate student). It is unordered and categorical, but it can potentially have dozens of distinct levels. Let's say I want to use major as a predictor in a regression model.

Using these levels as-is for modeling leads to all sorts of issues because there are just so many. A lot of statistical precision would be thrown away to use them, and the results are hard to interpret. We're rarely interested in specific majors — we're much more likely to be interested in broad categories (subgroups) of majors. But it isn't always clear how to divide up the levels into such higher-level categories, or even how many higher-level categories to use.

For typical data I would be happy to use factor analysis, matrix factorization, or a discrete latent modeling technique. But majors are mutually exclusive categories, so I'm hesitant to exploit their covariance for anything.

Furthermore I don't care about the major categories on their own. I care about producing higher-level categories that are coherent with respect to my regression outcome. In the binary outcome case, that suggests to me something like linear discriminant analysis (LDA) to generate higher-level categories that maximize discriminative performance. But LDA is a limited technique and that feels like dirty data dredging to me. Moreover any continuous solution will be hard to interpret.

Meanwhile something based on covariances, like multiple correspondence analysis (MCA), seems suspect to me in this case because of the inherent dependence among mutually exclusive dummy variables — they're better suited for studying multiple categorical variables, rather than multiple categories of the same variable.

edit: to be clear, this is about collapsing categories (not selecting them), and the categories are predictors or independent variables. In hindsight, this problem seems like an appropriate time to "regularize 'em all and let God sort 'em out". Glad to see this question is interesting to so many people!

Best Answer

If I understood correctly, you imagine a linear model where one of the predictors is categorical (e.g. college major); and you expect that for some subgroups of its levels (subgroups of categories) the coefficients might be exactly the same. So perhaps the regression coefficients for Maths and Physics are the same, but different from those for Chemistry and Biology.

In a simplest case, you would have a "one way ANOVA" linear model with a single categorical predictor: $$y_{ij} = \mu + \alpha_i + \epsilon_{ij},$$ where $i$ encodes the level of the categorical variable (the category). But you might prefer a solution that collapses some levels (categories) together, e.g. $$\begin{cases}\alpha_1=\alpha_2, \\ \alpha_3=\alpha_4=\alpha_5.\end{cases}$$

This suggests that one can try to use a regularization penalty that would penalize solutions with differing alphas. One penalty term that immediately comes to mind is $$L=\omega \sum_{i<j}|\alpha_i-\alpha_j|.$$ This resembles lasso and should enforce sparsity of the $\alpha_i-\alpha_j$ differences, which is exactly what you want: you want many of them to be zero. Regularization parameter $\omega$ should be selected with cross-validation.


I have never dealt with models like that and the above is the first thing that came to my mind. Then I decided to see if there is something like that implemented. I made some google searches and soon realized that this is called fusion of categories; searching for lasso fusion categorical will give you a lot of references to read. Here are a few that I briefly looked at:

Gertheiss and Tutz 2010, published in the Annals of Applied Statistics, looks like a recent and very readable paper that contains other references. Here is its abstract:

Shrinking methods in regression analysis are usually designed for metric predictors. In this article, however, shrinkage methods for categorial predictors are proposed. As an application we consider data from the Munich rent standard, where, for example, urban districts are treated as a categorial predictor. If independent variables are categorial, some modifications to usual shrinking procedures are necessary. Two $L_1$-penalty based methods for factor selection and clustering of categories are presented and investigated. The first approach is designed for nominal scale levels, the second one for ordinal predictors. Besides applying them to the Munich rent standard, methods are illustrated and compared in simulation studies.

I like their Lasso-like solution paths that show how levels of two categorical variables get merged together when regularization strength increases:

Gertheiss and Tutz 2010

Related Question