Solved – Model Evaluation for Discrete Regression

model-evaluationr-squaredregression

I've building a model to predict count variables, i. e. the quantity I'm predicting is a positive integer.

I know that for regression a usual metric of model quality is the R-squared coefficient, but I'm not sure if this is a good metric for a discrete output. What's the usual metric for model evaluation for a discrete regression?

Best Answer

If you really wanted, then you could use one of multiple proposals for pseudo-$R^2$ for generalized linear models, since Poisson regression is a kind of generalized linear model. However, in general, even if $R^2$ is popular, it is not the best measure and can be misleading.

Instead, what you could do is:

If you are comparing models, you could use multiple information criteria like AIC, or BIC, or likelihood-ratio tests.
You could use cross-validation and if you are going to use your model for prediction, then you should consider it. By cross-validation we mean splitting the data into two parts, where one part is used for "training" your model, and the second part is used to make predictions. By this we test our model on the data that was "not seen" by it previously, so we can check how it could possibly behave with external data.
In many cases very simple and very revealing thing to do is to plot distribution of your predicted variable and distribution of your predictions on two overlapping histograms or density plots. This may easily make you aware of what exactly does your model predict.
Another thing to consider are posterior predictive checks (check also here). The idea is to simulate some random data using your model and then compare the distribution of simulated data, to the real data to check when and how they are similar to each other.
Besides, I'd highly recommend to look at diagnostic plots (see also here, here, and here) to make sure if there are no problems with your model.

Check also How to calculate goodness of fit in glm (R) and If the model fits well, nothing can be done?

For reading more, I'd highly recommend Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill, or Regression Modeling Strategies by Frank E. Harrell.

Related Solutions

Solved – Evaluation metric for rare event probability regression

You could define a custom loss function $L(y,\hat y)$ that quantifies the trade-off between false positives and false negatives. Then you could use the expected loss on held-out data for evaluation. If we denote by $y_i\in\{0,1\}$ the correct label for row $i$ of test data and $\hat p_i$ the corresponding probability predicted by your classifier, the expected loss would be \begin{equation} \frac 1 n \sum_{i=1}^n \hat p_i L(y_i,1) + (1-\hat p_i)L(y_i,0) , \tag{1}\label{eq:exp_loss} \end{equation} where $n$ is the size of the test set. In the case of binary classification, the loss boils down to four numbers $L(0,0), L(0,1), L(1,0),$ and $L(1,1)$. Typically, one would set $L(0,0)=L(1,1)=0$. The important part is setting $L(0,1)$, the penalty for false positives, and $L(1,0)$, the penalty for false negatives. These are highly problem-specific and you will have to judge for yourself how to set these (only their ratio matters). For example, if you are detecting cancer, then maybe a false negative is 100 times as bad as a false positive, and so $L(0,1)=1, L(1,0)=100$.
Alternatively, the F1 score might be suitable if you have class imbalance. Since you have probabilistic predictions, you could use $$ \mathrm{precision} = \frac{\sum_{i=1}^n y_i\hat p_i}{\sum_{i=1}^n \hat p_i}, $$ $$ \mathrm{recall} = \frac{\sum_{i=1}^n y_i\hat p_i}{\sum_{i=1}^n y_i}, $$ $$ F_1 = 2\cdot \frac{ \mathrm{recall} \cdot \mathrm{precision}}{\mathrm{recall} + \mathrm{precision}}. $$ The above gives equal importance to precision and recall, so if that is not what you want, consider the $F_\beta$ instead.

Regression – Why Converting a Regression Model to a Classification Model via Output Discretization Enhances Performance

Let's look at the sources of error for your classification predictions, compared to those for a linear prediction. If you classify, you have two sources of error:

Error from classifying into the wrong bin
Error from the difference between the bin median and the target value (the "gold location")

If your data has low noise, then you will usually classify into the correct bin. If you also have many bins, then the second source of error will be low. If conversely, you have high-noise data, then you might misclassify into the wrong bin often, and this might dominate the overall error - even if you have many small bins, so the second source of error is small if you classify correctly. Then again, if you have few bins, then you will more often classify correctly, but your within-bin error will be larger.

In the end, it probably comes down to an interplay between the noise and the bin size.

Here is a little toy example, which I ran for 200 simulations. A simple linear relationship with noise and only two bins:

Now, let's run this with either low or high noise. (The training set above had high noise.) In each case, we record the MSEs from a linear model and from a classification model:

nn.sample <- 100
stdev <- 1
nn.runs <- 200
results <- matrix(NA,nrow=nn.runs,ncol=2,dimnames=list(NULL,c("MSE.OLS","MSE.Classification")))

for ( ii in 1:nn.runs ) {
    set.seed(ii)
    xx.train <- runif(nn.sample,-1,1)
    yy.train <- xx.train+rnorm(nn.sample,0,stdev)
    discrete.train <- yy.train>0
    bin.medians <- structure(by(yy.train,discrete.train,median),.Names=c("FALSE","TRUE"))

    # plot(xx.train,yy.train,pch=19,col=discrete.train+1,main="Training")

    model.ols <- lm(yy.train~xx.train)
    model.log <- glm(discrete.train~xx.train,"binomial")

    xx.test <- runif(nn.sample,-1,1)
    yy.test <- xx.test+rnorm(nn.sample,0,0.1)

    results[ii,1] <- mean((yy.test-predict(model.ols,newdata=data.frame(xx.test)))^2)
    results[ii,2] <- mean((yy.test-bin.medians[as.character(predict(model.log,newdata=data.frame(xx.test))>0)])^2)
}

plot(results,xlim=range(results),ylim=range(results),main=paste("Standard Deviation of Noise:",stdev))
abline(a=0,b=1)
colMeans(results)
t.test(x=results[,1],y=results[,2],paired=TRUE)

As we see, whether classification improves accuracy comes down to the noise level in this example.

You could play around a little with simulated data, or with different bin sizes.

Finally, note that if you are trying different bin sizes and keeping the ones that perform best, you shouldn't be surprised that this performs better than a linear model. After all, you are essentially adding more degrees of freedom, and if you are not careful (cross-validation!), you'll end up overfitting the bins.

Best Answer

Related Solutions

Solved – Evaluation metric for rare event probability regression

Regression – Why Converting a Regression Model to a Classification Model via Output Discretization Enhances Performance

Related Question