Machine-Learning – Encoding Categorical Data for Binary Classification

categorical datacategorical-encodingclassificationmachine learningmany-categories

ML newbie here, currently looking at a binary classification problem. I have quite a good number of training data (easily over 50k) which consists of both numeric and categorical data. The categorical data consists of both ordinal and nominal types.

Here's the problem. I am unsure of what is the most proper way of encoding the categorical data, and what are the factors I should consider when deciding the encoding method. I have came across several encoding methods, which can be summarized in this article.

As additional information, I am thinking of using logistic regression and random forest as my first test classifiers. I have read that certain encoding methods are more suitable for certain types of classifiers. Hope to have more insight on that matter as well.

I hope that you guys/girls can lend me a helping hand. Thank you very much in advance.

EDIT

Due to P&C, I cannot share instances of the data, however these are examples of the categorical features, and the number of different data for each feature:

Country (nominal) (40)
Job Grade (ordinal) (8)
Year/Quarter Joined (can be ordinal or nominal) (15)
Department/Business Unit (nominal) (10)

Library used: scikit-learn

Best Answer

If you have ordinal variables you should encode them by mapping each one to a number. The numbers should be selected in such a way to depict the order or hierarchy of the values in your variable.

For example, say you have a variable called ratings which assumes the values "bad", "good", "very good". This is clearly an ordinal variable as these values have a clear order ("bad" < "good" < "very good"). In this case you want to map these to three numbers that preserve that order (e.g. "1", "2", "3").

If you have nominal (or categorical) variables you should perform a one-hot encoding. With this scheme you create $M$ new variables, where $M$ is the number of unique values in your variable (e.g. above $M$ was $3$). Each of these variables corresponds to one of the values. To encode the data this way, for each sample, you look at the value of the nominal variable and place $1$ to the corresponding new variable you created; the other variables take the value of $0$.

For example, say you have a variable called color and it takes the values "yellow", "red" and "green". These values can't be ordered in any way, so we clearly have a nominal variable. Like we said we now create $3$ new variables, each one corresponding to a value of the nominal (i.e. each one corresponds to a color). If a sample has a color of "red" the new red variable becomes $1$ while the other two are set to $0$.

While this increases your problem's dimensionality (which usually isn't a good thing), it avoids leading your to making false assumptions regarding the order of the variable (e.g. "red" < "yellow").

Related Solutions

Solved – technical issues regarding to cluster analysis

Firstly, asses the requirement of normalizing your continues data. Practice has shown that when numeric x-data values are normalized, training is more efficient which leads to a better predictor. You can use any of below depending on your model assumptions.
- Gaussian normalization i.e., v' = (v - mean) / std dev.
- Z-score
- Min - Max method
- Box Cox power transformation
You are right that dummy coding your categorical variable is required for PROC VARCLUS as the procedure uses either "R2", "pearson correlation" as the distance function to do clustering. Those statistics can only be applied to numeric vars. If discrete data is not handled carefully there is a high chance that the clustering algorithms ends up discovering the discreteness of your data, instead of a sensible structure. Consider rank ordering the variables basis some business justification where possible, for example occupation can be ranked basis corresponding avg salary.
If you want to specify relative weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a WEIGHT statement i.e., WEIGHT variables ;.

However for your point number 2 and 3 of question a better approach might be to consider rank ordering the predictive power of all variables by their Information Value (IV) and Weight of Evidence (WOE). You can find the SAS macro and paper here!. One advantage of this program is that continuous, ordinal and categorical variables can be assessed together.
Here are few links describing cluster validation techniques

If you largely categorical variable may be you should consider hierarchical clustering with appropriate distance function.

Solved – Binary Encoding vs One-hot Encoding

If you have a system with $n$ different (ordered) states, the binary encoding of a given state is simply it's $\text{rank number} - 1$ in binary format (e.g. for the $k$th state the binary $k - 1$). The one hot encoding of this $k$th state will be a vector/series of length $n$ with a single high bit (1) at the $k$th place, and all the other bits are low (0).

As an example encodings for the next system (levels of education):

-----------------------------------------------
|   Level   | "Decimal  | Binary   | One hot  |
|           | encoding" | encoding | encoding |
-----------------------------------------------
| No        |     0     |    000   |  000001  |
| Primary   |     1     |    001   |  000010  |
| Secondary |     2     |    010   |  000100  |
| BSc/BA    |     3     |    011   |  001000  |
| MSc/MA    |     4     |    100   |  010000  |
| PhD       |     5     |    101   |  100000  |
-----------------------------------------------

References: One hot encoding at Wikipedia

And a 2017 paper on the comparison on the effects of different encodings to neural networks in the International Journal of Computer Applications could be a good starting point: A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers

Best Answer

Related Solutions

Solved – technical issues regarding to cluster analysis

Solved – Binary Encoding vs One-hot Encoding

Related Question