Solved – Categorical Data: Merging Categories

categorical datadata preprocessingmany-categories

I'm working with some categorical data that I want to use for prediction with a machine learning algorithm. Many of my my categorical features have multiple categories that contain very few positive observations.

I've seen approaches that merge categories in scenarios such as this in order to create a category with more positive observations (where it makes sense conceptually) .

My question here is whether this is the best approach, and if so whether there is a rule of thumb for how small a category needs to be to justify merging its with another category? I'm also interested to know whether some kind of statistical analysis should be applied to categories before any decision is made to merge them, or whether it is legitimate in most cases to use a priori knowledge and common sense in merging categories!?

A relevant example of one of a category where I'm considering this approach is as follows:

Old Category    Size of category (% of all records)
Sunny           82%
Raining         17%
Snowing         0.1%
Foggy           0.9% 

New Category    Size of category (% of all records)
Good Weather    82%
Bad Weather     17%

Best Answer

Ideally, you would want to let the learning algorithm decide what features are important. Does it make sense to merge categories as you suggest? I don't know.

Important things to consider w.r.t. your problem:

  1. What do you actually want to predict? Do you want to (also) analyze the features?

  2. How big is your data? The percentage numbers have fewer meaning when you have many observations, 0.1% snowy days could mean that of your n=1000 observations you have only 1 snowy day or you observed 1000 snowy days of your n=1.000.000 sized sample.

  3. What is your learning algorithm? If n is very large and prediction accuracy is most important you might want to consider neural networks, which by themselves ideally will extract "features" from the features you present it.

  4. Be careful with "Common sense". It is often not the sense other people have. In your example, you merge sunny,rainy,foggy, and snowy into two very subjective categories "good" and "bad" weather. I think this is not a "good" idea, as no one but you can say if you put snowy as "good" weather or "bad" weather (I personally love snow, but I know others who hate it...)

Related Question