Solved – In practice, why do we convert categorical class labels to integers for classification

classificationscikit learn

This might be a naive question, but I am wondering why we (or maybe it's just me) are converting categorical class labels to integers before we feed them to a classifier in a software package such as Python's scikit-learn ML library?

Let's take the simple Iris dataset, why do we convert the class labels from
"Setosa", "Virginica", and "Versicolor" to e.g., 0, 1, and 2?

This question came up when I was collaboratively working on a project and one of my colleagues didn't use a label encoder to convert the class labels from strings to integer. It worked (she was using scikit-learn); I intuitively "corrected" it (inserted a label encoder) and she asked me why: Well, I really had no good answer to that other than "most machine learning algorithms work better this way" (this is something I read sometime ago somewhere).

Now that I thought about it: What is the rationale behind it? Since in typical classification tasks class labels are nominal, not ordinal variables, is it computational efficiency (storing and processing less "data")?

Best Answer

Scikit learn only handles real numbers I believe. So you need to do something like one hot encoding where n numerical dimensions are used to represent membership in the categories. If you just pass in strings they'll get cast to floats in unpredictable ways.

There are mathematical reasons some methods (like svm) need floats. IE they are only defined in the space of real numbers. Representing 3 categories as values 1,2,3 in a single method might work but it may also yield suboptimal performance compared to one hot encoding since the split (1,3) vs (2) is difficult to pick up on unless the method can capture very non linear behavior like that.

Other methods like random forest can be made to work directly on categorical values. Ie during decision learnings you can propose potential splits as diffrent combinations of categories. For such methods it is often convenient to use ints to represent the categories because an array of ints is much nicer to work with then an array of strings on a computational level. You can also do things like generate all possible combinations of n categories by looking at the bit values of an n-bit integer you are incrementing which can be much faster and memory efficient then searching for splits over n-floats.