Solved – Encoding of categorical variables for machine learning: binary vs. one-hot followed by PCA

categorical datadata preprocessingmachine learningmany-categoriespca

Edit: changed the title, removed call for opinions

This post compares several methods of encoding categorical data. Binary encoding (convert categories to integers, then to binary; assign each digit a separate column) seems to provide the best combination of predictive accuracy and dimensionality control.

However, the top answer to this post advocates applying PCA to one-hot encoded (convert categories to integers; assign each integer value a separate column, e.g. 5 = 0, 0, 0, 0, 1) data to isolate the most descriptive dimensions within. This would seem to be even better, dimension-wise, than binary encoding, without the associated distortion of distances. Has anyone compared binary vs. one-hot + PCA?

Best Answer

Assuming that by binary encoding you mean the one explained here, I would advice against using it. Seems an ill-advised idea, I will explain why. First explaining shortly the idea:

Suppose (only for simplicity) your categorical variable have $p=2^q$ levels, for the example I take $q=3$. Then code the levels with the binary numbers $0=000_2, 1=001_2, 2=010_2, 3=011_2, 4=100_2, 5=101_2, 6=110_2, 7=111_2$. Then, for each of the 3 bits, introduce one column in the design matrix, recording just that bit (so 0 or 1.) What is wrong with this? The 8 possible levels is represented in a 3-dimensional linear subspace, introducing (many!) linear restrictions! Those restrictions will depend on how you associate levels to binary numbers. In short: do not ever use this idea. That makes your main question here moot ...

About applying PCA to a matrix of one-hot vectors encoding a categorical variable, there is many posts here, search, I don't know. But I would in its case looking into using correspondence analysis in place of PCA, but see Doing principal component analysis or factor analysis on binary data.