Machine-Learning – Encoding Categorical Data for Binary Classification

categorical datacategorical-encodingclassificationmachine learningmany-categories

ML newbie here, currently looking at a binary classification problem. I have quite a good number of training data (easily over 50k) which consists of both numeric and categorical data. The categorical data consists of both ordinal and nominal types.

Here's the problem. I am unsure of what is the most proper way of encoding the categorical data, and what are the factors I should consider when deciding the encoding method. I have came across several encoding methods, which can be summarized in this article.

As additional information, I am thinking of using logistic regression and random forest as my first test classifiers. I have read that certain encoding methods are more suitable for certain types of classifiers. Hope to have more insight on that matter as well.

I hope that you guys/girls can lend me a helping hand. Thank you very much in advance.

EDIT

Due to P&C, I cannot share instances of the data, however these are examples of the categorical features, and the number of different data for each feature:

  • Country (nominal) (40)
  • Job Grade (ordinal) (8)
  • Year/Quarter Joined (can be ordinal or nominal) (15)
  • Department/Business Unit (nominal) (10)

Library used: scikit-learn

Best Answer

If you have ordinal variables you should encode them by mapping each one to a number. The numbers should be selected in such a way to depict the order or hierarchy of the values in your variable.

For example, say you have a variable called ratings which assumes the values "bad", "good", "very good". This is clearly an ordinal variable as these values have a clear order ("bad" < "good" < "very good"). In this case you want to map these to three numbers that preserve that order (e.g. "1", "2", "3").

If you have nominal (or categorical) variables you should perform a one-hot encoding. With this scheme you create $M$ new variables, where $M$ is the number of unique values in your variable (e.g. above $M$ was $3$). Each of these variables corresponds to one of the values. To encode the data this way, for each sample, you look at the value of the nominal variable and place $1$ to the corresponding new variable you created; the other variables take the value of $0$.

For example, say you have a variable called color and it takes the values "yellow", "red" and "green". These values can't be ordered in any way, so we clearly have a nominal variable. Like we said we now create $3$ new variables, each one corresponding to a value of the nominal (i.e. each one corresponds to a color). If a sample has a color of "red" the new red variable becomes $1$ while the other two are set to $0$.

While this increases your problem's dimensionality (which usually isn't a good thing), it avoids leading your to making false assumptions regarding the order of the variable (e.g. "red" < "yellow").