Machine Learning – Why Ordinal Target in Classification Problems Needs Special Attention

classificationmachine learningmultinomial logitordered-logitordinal-data

I have been working on an ML problem in which I want to predict an interval of money say, a, b, c, d that might be lent to a customer given its credit files, those amounts are represented on ordered bins the i.e a < b < c < d.

First I faced this problem as a multiclass classification problem and even though I did not obtain "good" performance I never thought It could be because of the inherent order on my target.

After googling it I found a paper in which a method to perform classification on ordinal target was developed, but I'm not still sure what are the implications on this scenario and even why it needs special attention.

In that paper is stated:

Standard classification algorithms for nominal classes can be applied to ordinal prediction problems by discarding the ordering information in the class
attribute. However, some information is lost when this is done, information that
can potentially improve the predictive performance of a classifier.

But it does not clarify the point to me.

Could you please help me to understand the implications of an ordinal target?
Might this order, be responsible for poor performance on a multiclass classification task when this order is not considered by applying Logistic Regression, an Ensemble method or any other classification model?

Best Answer

Dave's comments are on the right track. I'll try to expand on them.

Ordinal regression is half-way between classification and real-valued regression. When you perform multiclass classification of your ordinal data, you are assigning the same penalty whenever your classifier predicts a wrong class, no matter which one.

For example, assume that in your problem for some input vector $x$ the right prediction is $a$. Assume you are training two classifiers, $C_1$ and $C_2$. The first one predicts $b$, while the other predicts $d$. In the multivariate classifier's sense, $C_1$ and $C_2$ are equally far off, they have missed the correct class. But from the ordinal regression perspective, $C_1$ is obviously better than $C_2$, since it has missed the correct "class" only by one bin, not by three.

To drive this point into extreme, imagine performing a very-many-classes-classification instead of regression. I.e. you have predictors $x$ and a real-valued response variable $y$. You can treat values of $y$ as classes: $y = 3.14159$ would be one class, $y = 1.4142$ another, and so on. If you had $N$ observations, you're likely to have $N$ different classes (assuming all $y$'s differ). You could try to train a multiclass classifier, but you'd be likely to fail, as there would be only one observation per class. And even if you succeeded (because you were lucky to have same $y$'s repeat multiple times), you'd be essentially having many independent models, where each would only predict its own class and wouldn't care much about the others.

Such an ensemble of models would also be quite complex. If each model has, say, $M$ parameters, and if you had $K$ classes to predict $(K < N)$, your ensemble would have $M \cdot K$ parameters. In contrast, the complexity of the regression model is likely to be independent of the number of distinct $y$ values. You'd settle in advance for a linear, quadratic, or whatever function to fit through your data and the form of the function would determine the number of parameters.

In ordinal regression, e.g. proportional odds logistic regression, it is common to have one set of parameters (a vector) common to all "classes" (i.e. ordinal values), and a set of scalars to distinguish between the individual ordinal values. The same holds also for support vector ordinal regression (see e.g. http://www.gatsby.ucl.ac.uk/~chuwei/paper/svor.pdf), where you have the same model, consisting of the same $\alpha$'s (Lagrange coefficients) for all "classes", and distinguish between the classes only by the corresponding $b$'s (one per "class").

Related Question