Solved – Should I convert a categorical variable with k levels to (k-1) or k binary variables

binary datacategorical datamulticollinearityrandom forestsvm

I'm building a predictive model with a combination of numeric, binary, and categorical variables. The outcome is binary.

For methods like SVM, I have read on stack exchange that the categorical features must be converted to binary variables, i.e for a variable of "hair color" that can take on multiple levels (e.g red hair, blonde hair, black hair, or brown hair) we create a binary variable for each level (Hair red? y/n, hair blonde? y/n, etc)

However, I'm concerned about linear dependence between the variables when I do this. In the hair color example, assuming everyone has hair color of either red, blonde, black or brown and that people only have 1 type of hair color, if I create 4 binary variables for hair color, then the sum of the 4 variables is always 1. In other words, I can write v1 + v2 + v2 + v4 = 1, which means there is a linear dependence between the binary variables I have introduced to replace the multi level categorical variable.

If I create only k-1 variables then I don't have that issue.

I understand that multi-collinearity is a problem when fitting a regression model with continuous data because the inverse of the covariance matrix is unstable. Is the issue of linear dependence/multi-collinearity a concern with SVM and Random Forest when using binary/categorical data?

Best Answer

With random forests? No. They don't look at linear combinations of features, so this isn't a problem, and actually if you're doing this one-hot feature representation where, say, you don't include brown hair (because it's equivalent to not-red, not-blond, not-black), then individual trees will have a hard time asking about brown hair because they need to have three randomly included features to do that. If you include a brown hair feature, it can. Many random forest packages, though definitely not all, allow "real" categorical inputs, and so you don't have to do the one-hot representation.

With SVMs? Generally no. A distance-based kernel, e.g. the RBF kernel, definitely does not have this issue. Linear or polynomial kernels or so on do use linear combinations in that way, and so if you don't regularize, you can run into the same multicollinearity problems as you do with OLS. But you should almost never not regularize an SVM; the standard $L_2$ regularization is equivalent to doing ridge regression, which handles most of the issues with multicollinearity, as noted in this thread linked to by @DJohnson above.

Related Question