Solved – two ways of predicting a categorical variable

categorical datamachine learningpython

I'm working on a machine learning problem where I have to predict a categorical variable that has 12 different values, lets say A, B, C and so on.. (one thing I should mention is that one of the categories contain around 85% of the data)

I implemented two different aproaches but I'd like to ask you opinion on which one is better so I can focus on that particular one (Im using the XGBClassifier algorithm in python)

1) Keep th variable in one column, treating is as categorical as it is and build a model that predicts A, B, C, and so on… or

2) Create 12 columns on my dataset called "isA", "isB", "isC" and so on, where 11 of them will have value 0 and one of them will have value 1 depending on the category; them fit 12 models to my data that output the probability of belonging to that category (option 'binary:logistic')

one thing I realized is that if I ask the first model to predcit the probability of a category, and sum the 12 probabilities, it will sum up to 1, as in the second model that does not necessarily happens.

what do you guys think? whats the advantages and disadvantages of both aproaches?

Best Answer

Of course you should use the 1st approach. The classical way to do it is multinomial logistic regression (note that it is a synonym of softmax regression see : http://www.kdnuggets.com/2016/07/softmax-regression-related-logistic-regression.html).

The 2nd approach would be several binary logistic regressions. As you said the sum would not be 1. This is not recommended. As far as I know, nobody does it.

Note : there are other forms of multinomial regression other than logistic, but logistic is perfect for a start.

Related Question