Solved – Regression with Lots of Categorical Variables

categorical datacategorical-encodingmany-categoriesregression

I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.

I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.

For examples, are any regressors perhaps more suited to dummy variables?

Best Answer

$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.

A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?

Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?

Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.

If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.

Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.

If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.

Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.

Related Question