Solved – Binary Encoding vs One-hot Encoding

categorical-encodingclassificationmachine learningneural networks

What is the difference between binary encoding and one-hot for categorical input variables for English Text and their impact on the neural network?
Can anyone help me to find a scientific paper about this problem?

Best Answer

If you have a system with $n$ different (ordered) states, the binary encoding of a given state is simply it's $\text{rank number} - 1$ in binary format (e.g. for the $k$th state the binary $k - 1$). The one hot encoding of this $k$th state will be a vector/series of length $n$ with a single high bit (1) at the $k$th place, and all the other bits are low (0).

As an example encodings for the next system (levels of education):

-----------------------------------------------
|   Level   | "Decimal  | Binary   | One hot  |
|           | encoding" | encoding | encoding |
-----------------------------------------------
| No        |     0     |    000   |  000001  |
| Primary   |     1     |    001   |  000010  |
| Secondary |     2     |    010   |  000100  |
| BSc/BA    |     3     |    011   |  001000  |
| MSc/MA    |     4     |    100   |  010000  |
| PhD       |     5     |    101   |  100000  |
-----------------------------------------------

References: One hot encoding at Wikipedia

And a 2017 paper on the comparison on the effects of different encodings to neural networks in the International Journal of Computer Applications could be a good starting point: A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers