Regression – Why Converting a Regression Model to a Classification Model via Output Discretization Enhances Performance

categorical dataclassificationcontinuous dataregression

In regression problems, if the output is discretized into bins/categories/clusters and used as labels, the model is reduced to a classification model.

My question is: what is the theoretical or applied motivation behind doing this reduction? In my particular experiments on predicting location from text, I have often seen improvements when I model the problem as classification rather than regression.

In my particular case, the output is 2d but I'm looking for a more general explanation for this.

Update:
Assume the input is BoW text and the output is coordinates (e.g. as in geotagged Twitter data). In regression the task is to predict lat/lon given text using squared error loss. If we cluster the training lat/lon points and assume each cluster to be a class then we can predict a class by optimizing the cross-entropy loss in a classification model.

Evaluation:

For regression the mean distance between the predicted locations and the gold locations.

For classification the mean distance between the median training point in the predicted cluster and the gold location.

Best Answer

Let's look at the sources of error for your classification predictions, compared to those for a linear prediction. If you classify, you have two sources of error:

  1. Error from classifying into the wrong bin
  2. Error from the difference between the bin median and the target value (the "gold location")

If your data has low noise, then you will usually classify into the correct bin. If you also have many bins, then the second source of error will be low. If conversely, you have high-noise data, then you might misclassify into the wrong bin often, and this might dominate the overall error - even if you have many small bins, so the second source of error is small if you classify correctly. Then again, if you have few bins, then you will more often classify correctly, but your within-bin error will be larger.

In the end, it probably comes down to an interplay between the noise and the bin size.

Here is a little toy example, which I ran for 200 simulations. A simple linear relationship with noise and only two bins:

training

Now, let's run this with either low or high noise. (The training set above had high noise.) In each case, we record the MSEs from a linear model and from a classification model:

nn.sample <- 100
stdev <- 1
nn.runs <- 200
results <- matrix(NA,nrow=nn.runs,ncol=2,dimnames=list(NULL,c("MSE.OLS","MSE.Classification")))

for ( ii in 1:nn.runs ) {
    set.seed(ii)
    xx.train <- runif(nn.sample,-1,1)
    yy.train <- xx.train+rnorm(nn.sample,0,stdev)
    discrete.train <- yy.train>0
    bin.medians <- structure(by(yy.train,discrete.train,median),.Names=c("FALSE","TRUE"))

    # plot(xx.train,yy.train,pch=19,col=discrete.train+1,main="Training")

    model.ols <- lm(yy.train~xx.train)
    model.log <- glm(discrete.train~xx.train,"binomial")

    xx.test <- runif(nn.sample,-1,1)
    yy.test <- xx.test+rnorm(nn.sample,0,0.1)

    results[ii,1] <- mean((yy.test-predict(model.ols,newdata=data.frame(xx.test)))^2)
    results[ii,2] <- mean((yy.test-bin.medians[as.character(predict(model.log,newdata=data.frame(xx.test))>0)])^2)
}

plot(results,xlim=range(results),ylim=range(results),main=paste("Standard Deviation of Noise:",stdev))
abline(a=0,b=1)
colMeans(results)
t.test(x=results[,1],y=results[,2],paired=TRUE)

Low noise High noise

As we see, whether classification improves accuracy comes down to the noise level in this example.

You could play around a little with simulated data, or with different bin sizes.

Finally, note that if you are trying different bin sizes and keeping the ones that perform best, you shouldn't be surprised that this performs better than a linear model. After all, you are essentially adding more degrees of freedom, and if you are not careful (cross-validation!), you'll end up overfitting the bins.

Related Question