Solved – Response is an Integer. Should I use classification or regression

classificationregression

During a class in my master's in computer science the professor asked us to come up with the best model to predict this particular data set. In it, we are given measurements of the weight and size of an abalone and need to predict the number of rings (an integer number) in its shell. Here an example of how the data looks:

enter image description here

The original paper (Sam Waugh (1995) "Extending and benchmarking Cascade-Correlation") in which this database was first used, uses a classification approach where each distinct number of rings is treated as a different class.

I see a couple of problems with this approach:

  • First of all, the evaluation metric the paper's author uses is the classification accuracy, which does not consider the closeness of the predicted value with its response. For example, a model that predicts a value of 3 when the correct value was 4 is treated the same as a model which predicts a value of 22 and the correct value was 4 (both got the classification wrong).

  • Second, the data set is highly unbalanced with few abalones having a high number of rings.

To my best interpretation, both these problems would be gone if we used a regression model (with a root mean square error as an evaluation metric for example) instead of classification. However usual regression models give you real values for your response. To my non-statistician brain, this seems to be not an issue since you can always round your value to the nearest integer.

My questions are then:

  1. Is multivariate regression indeed the best approach in trying to model this data?

  2. Is there an evaluation metric for classification that considers the closeness of the response with the classification result? If yes could it be used in this problem?

  3. Are there any problems in rounding up the regression result to the nearest integer?

Any other comments, suggestions or ideas to help me to best tackle the problem are also very helpful.

Also, sorry if a made any incorrect assumptions or mistake in my interpretation of the problem. Feel free to correct me.

Best Answer

I have recently used the abalone dataset for illustrating some regression methods and encountered basically the same questions. (UPDATE: link to paper "Predictive State Smoothing (PRESS): Scalable non-parametric regression for high-dimensional data with variable selection".)

Here is my take on it:

  1. I would say regression is the most natural way to approach this problem (see general comment at end of post for the domain-specific rationale). Doing a plain multi-class classification approach is IMHO downright wrong -- for the reason you point out (predicting '22' for a '3' is as good/bad as a predicting a '4' -- which is obviously not true).

  2. I think you are looking for 'ordered' or 'ordinal' classification, which takes such an ordering into account (see e.g., http://www.cs.waikato.ac.nz/~eibe/pubs/ordinal_tech_report.pdf which also contains an example on the Abalone dataset.) However, even ordinal classification has the problem that you can't predict anything else than the observed number of rings. Say, one day there is a massive abalone shell that's 20% larger than any shell we have seen before -- a classification approach will most likely put it in the largest class, which is '29'. However, that makes no sense as any biologist will tell you that that shell is most likely a rare find of a, say, 35 ring abalone shell.

  3. No, not a problem at all -- it's just part of your prediction model.

Having said all this, in the end you should ask yourself what is the domain-specific problem the abalone data is trying to help solve?!

It is predicting the age of a shell, which uses the number of rings as a proxy. A biologist is not really interested in predicting the number of rings, they want to know the age. So a prediction of, say, 6.124 is not less useful than '6' or '7' -- in fact, it's probably more useful. I "blame" this on CS/eng trying to cast everything as a precision/recall problem, so they like to emphasize this as an integer prediction/classification problem rather than regression -- not because that's actually the underlying problem, but because it fits their tools and benchmark metrics (who does not love to throw a deep net classifier on this problem and declare victory because "precision/recall or AUC is really high" ;) )

Related Question