Solved – Cost function for validating Poisson regression models

generalized linear modelpoisson distributionr

For count data that I have collected, I use Poisson regression to build models. I do this using the glm function in R, where I use family = "poisson". To evaluate possible models (I have several predictors) I use the AIC. So far so good. Now I want to perform cross-validation. I already succeeded in doing this using the cv.glm function from the boot package. From the documentation of cv.glm I see that e.g. for binomial data you need to use a specific cost function to get a meaningful prediction error. However, I have no idea yet what cost function is appropriate for family = poisson, and an extensive Google search did not yield any specific results. My question is anybody has some light to shed on which cost function is appropriate for cv.glm in case of poisson glm's.

Best Answer

Assuming nothing special in your particular case, I think there is a good argument for either using the default (Mean Square Error) or use the mean of the error of the logs, or even the chi-squared error.

The purpose of the cost function is to express how "upset" you are with wrong predictions, specifically what "wrongness" bothers you most. This is particularly important for binary responses, but can matter in any situation.

Mean Square Error (of responses)

$C = \frac{1}{n}\sum_i (Y_i-\hat Y_i)^2$

Using the MSE you are equally sensitive to errors from above and below and equally sensitive for large and small predictions. This is a pretty standard thing to do, and so I don't think would be frowned on in most situations.

Mean Square Error (of log responses)

$C = \frac{1}{n}\sum_i (\ln Y_i-\ln \hat Y_i)^2$

Because you are working with count data, it could be argued that you are not symmetric nor size indifferent. Being out by 10 counts for a prediction of 10 is very different from a prediction of 1000. This is a somewhat "canonical" cost function, because you have matched the costs up to the link function. This ensures that that costs match the variance distribution being assumed in the model.

Chi-Squared Error

$C = \frac{1}{n}\sum_i \frac{(Y_i-\hat Y_i)^2}{\hat Y_i}$

A third way would be to use the chi-squared error. This could be particularly appealing if you are comparing your GLM to other count based models - particularly if there are factors in your GLM. Similar to the error log responses, this will scale with size, but it is symmetric around the predicted count. You are now evaluating goodness of fit based on percentage error.


On The Discreteness

The question cites the documentation example where they have a binary response variable, so use a different cost function. The issue for a binary response is that the GLM will forecast a real number between 0 and 1, even though the response is always exactly 0 or 1. It is perfectly valid to say that the closer that number is to the correct response the better the forecast, but often people don't want this. The reasoning being that one often must act either as though it is 0 or 1, and so will take anything less than 0.5 as a forecast for 0. In that case, it makes sense simply to count the number of "wrong" forecasts. The argument here is that for a True/False question you can only ever be right or wrong - there is no gradation of wrongness.

In your case you have count data. Here it is far more common to accept predictions that are not on the same support as the response. A prediction of 2.4 children per family for example, or 9.7 deaths per year. Usually one would not try to do anything about this because it is not about being "right" or "wrong", just as close as you can get. If you really must have a prediction that is an integer though, perhaps because you have a very very low count rate, then there is no reason you can't round the prediction first and count the "whole number" or error. In this case, the three expressions above still apply, but you simply need to round $\hat Y$ first.