What techniques are available for collapsing (or pooling) many categories to a few, for the purpose of using them as an input (predictor) in a statistical model?
Consider a variable like college student major (discipline chosen by an undergraduate student). It is unordered and categorical, but it can potentially have dozens of distinct levels. Let's say I want to use major as a predictor in a regression model.
Using these levels as-is for modeling leads to all sorts of issues because there are just so many. A lot of statistical precision would be thrown away to use them, and the results are hard to interpret. We're rarely interested in specific majors — we're much more likely to be interested in broad categories (subgroups) of majors. But it isn't always clear how to divide up the levels into such higher-level categories, or even how many higher-level categories to use.
For typical data I would be happy to use factor analysis, matrix factorization, or a discrete latent modeling technique. But majors are mutually exclusive categories, so I'm hesitant to exploit their covariance for anything.
Furthermore I don't care about the major categories on their own. I care about producing higher-level categories that are coherent with respect to my regression outcome. In the binary outcome case, that suggests to me something like linear discriminant analysis (LDA) to generate higher-level categories that maximize discriminative performance. But LDA is a limited technique and that feels like dirty data dredging to me. Moreover any continuous solution will be hard to interpret.
Meanwhile something based on covariances, like multiple correspondence analysis (MCA), seems suspect to me in this case because of the inherent dependence among mutually exclusive dummy variables — they're better suited for studying multiple categorical variables, rather than multiple categories of the same variable.
edit: to be clear, this is about collapsing categories (not selecting them), and the categories are predictors or independent variables. In hindsight, this problem seems like an appropriate time to "regularize 'em all and let God sort 'em out". Glad to see this question is interesting to so many people!
Best Answer
If I understood correctly, you imagine a linear model where one of the predictors is categorical (e.g. college major); and you expect that for some subgroups of its levels (subgroups of categories) the coefficients might be exactly the same. So perhaps the regression coefficients for Maths and Physics are the same, but different from those for Chemistry and Biology.
In a simplest case, you would have a "one way ANOVA" linear model with a single categorical predictor: $$y_{ij} = \mu + \alpha_i + \epsilon_{ij},$$ where $i$ encodes the level of the categorical variable (the category). But you might prefer a solution that collapses some levels (categories) together, e.g. $$\begin{cases}\alpha_1=\alpha_2, \\ \alpha_3=\alpha_4=\alpha_5.\end{cases}$$
This suggests that one can try to use a regularization penalty that would penalize solutions with differing alphas. One penalty term that immediately comes to mind is $$L=\omega \sum_{i<j}|\alpha_i-\alpha_j|.$$ This resembles lasso and should enforce sparsity of the $\alpha_i-\alpha_j$ differences, which is exactly what you want: you want many of them to be zero. Regularization parameter $\omega$ should be selected with cross-validation.
I have never dealt with models like that and the above is the first thing that came to my mind. Then I decided to see if there is something like that implemented. I made some google searches and soon realized that this is called fusion of categories; searching for
lasso fusion categorical
will give you a lot of references to read. Here are a few that I briefly looked at:Gerhard Tutz, Regression for Categorical Data, see pp. 175-175 in Google Books. Tutz mentions the following four papers:
Land and Friedman, 1997, Variable fusion: a new adaptive signal regression method
Bondell and Reich, 2009, Simultaneous factor selection and collapsing levels in ANOVA
Gertheiss and Tutz, 2010, Sparse modeling of categorial explanatory variables
Tibshirani et al. 2005, Sparsity and smoothness via the fused lasso is somewhat relevant even if not exactly the same (it is about ordinal variables)
Gertheiss and Tutz 2010, published in the Annals of Applied Statistics, looks like a recent and very readable paper that contains other references. Here is its abstract:
I like their Lasso-like solution paths that show how levels of two categorical variables get merged together when regularization strength increases: