Solved – Encoding categorical features to numbers for machine learning

machine learningmany-categories

Many machine learning algorithms, for example neural networks, expect to deal with numbers. So, when you have a categorical data, you need to convert it. By categorical I mean, for example:

Car brands: Audi, BMW, Chevrolet…
User IDs: 1, 25, 26, 28…

Even though user ids are numbers, they are just labels, and do not mean anyting in terms of continuity, like age or sum of money.

So, the basic approach seems to use binary vectors to encode categories:

Audi: 1, 0, 0…
BMW: 0, 1, 0…
Chevrolet: 0, 0, 1…

It's OK when there are few categories, but beyond that it looks a bit inefficient. For example, when you have 10 000 user ids to encode, it's 10 000 features.

The question is, is there a better way? Maybe one involving probabilities?

Best Answer

You can always treat your user ids as bag of words: most text classifiers can deal with hundreds of thousands of dimensions when the data is sparse (many zeros that you do not need to store explicitly in memory, for instance if you use Compressed Sparse Rows representation for your data matrix).

However the question is: does it make sense w.r.t. you specific problem to treat user ids as features? Would not it make more sense to denormalize your relation data and use user features (age, location, char-ngrams of the online nickname, transaction history...) instead of their ids?

You could also perform clustering of your raw user vectors and use the top N closest centers ids as activated features for instead of the user ids.