Poisson Models – Error Metrics for Cross-Validating Poisson Distribution Models

count-datacross-validationdeviancepoisson distributionscoring-rules

I'm cross validating a model that's trying to predict a count. If this was a binary classification problem, I'd calculate out-of-fold AUC, and if this was a regression problem I'd calculate out-of-fold RMSE or MAE.

For a Poisson model, what error metrics can I use to evaluate the "accuracy" of the out-of-sample predictions? Is there a Poisson extension of AUC that looks at how well the predictions order the actual values?

It seems that a lot of Kaggle competitions for counts (e.g. number of useful votes a yelp review will get, or number of days a patient will spend in the hospital) use root mean log squared error, or RMLSE.


/Edit: One thing I've been doing is calculating deciles of the predicted values, and then looking at the actual counts, binned by decile. If decile 1 is low, decile 10 is high, and the deciles in between are strictly increasing, I've been calling the model "good," but I've been having trouble quantifying this process, and I've convinced there's a better approach.

/Edit 2: I'm looking for a formula that takes predicted and actual values and returns some "error" or "accuracy" metric. My plan is to calculate this function on the out-of-fold data during cross-validation, and then use it to compare a wide variety of models (e.g. a poisson regression, a random forest and a GBM).

For example, one such function is RMSE = sqrt(mean((predicted-actual)^2)). Another such function would be AUC. Neither function seems to be right for poisson data.

Best Answer

There are a couple of proper and strictly proper scoring rules for count data you can use. Scoring rules are penalties $s(y,P)$ introduced with $P$ being the predictive distribution and $y$ the observed value. They have a number of desirable properties, first and foremost that a forecast that is closer to the true probability will always receive less penalty and there is a (unique) best forecast and that one is when the predicted probability coincides with the true probability. Thus minimizing the expectation of $s(y,P)$ means reporting the true probabilities. See also Wikipedia.

Often one takes an average of those over all predicted values as

$S=\frac{1}{n}\sum_{i=1}^n s(y^{(i)},P^{(i)})$

Which rule to take depends on your objective but I'll give a rough characterization when each is good to be used.

In what follows I use $f(y)$ for the predictive probability mass function $\Pr(Y=y)$ and $F(y)$ the predictive cumulative distribution function. A $\sum_k$ runs over the whole support of the count distribution (i.e, $0,1,\dots, \infty$). $I$ denotes an indicator function. $\mu$ and $\sigma$ are the mean and standard deviation of the predictive distribution (which are usually directly estimated quantities in count data models).

Strictly proper scoring rules

  • Brier Score: $s(y,P)=-2 f(y) + \sum_k f^2(k)$ (stable for size imbalance in categorical predictors)
  • Dawid-Sebastiani score: $s(y,P)=(\frac{y-\mu}{\sigma})^2+2\log\sigma$ (good for general predictive model choice; stable for size imbalance in categorical predictors)
  • Deviance score: $s(y,P)=-2\log f(y) + g_y$ ($g_y$ is a normalization term that only depends on $y$, in Poisson models it is usually taken as the saturated deviance; good for use with estimates from an ML framework)
  • Logarithmic score: $s(y,P)=-\log f(y)$ (very easily calculated; stable for size imbalance in categorical predictors)
  • Ranked probability score: $s(y,P)=\sum_k \{F(k)-I(y\leq k)\}^2$ (good for contrasting different predictions of very high counts; susceptible to size imbalance in categorical predictors)
  • Spherical score: $s(y,P)=\frac{f(y)}{\sqrt{\sum_k f^2(k)}}$ (stable for size imbalance in categorical predictors)

Other scoring rules (not so proper but often used)

  • Absolute error score: $s(y,P)=|y-\mu|$ (not proper)
  • Squared error score: $s(y,P)=(y-\mu)^2$ (not strictly proper; susceptible to outliers; susceptible to size imbalance in categorical predictors)
  • Pearson normalized squared error score: $s(y,P)=(\frac{y-\mu}{\sigma})^2$ (not strictly proper; susceptible to outliers; can be used for checking if model checking if the averaged score is very different from 1; stable for size imbalance in categorical predictors)

Example R code for the strictly proper rules:

library(vcdExtra)
m1 <- glm(Freq ~ mental, family=poisson, data=Mental) 

# scores for the first observation
mu <- predict(m1, type="response")[1]
x  <- Mental$Freq[1]

# logarithmic (equivalent to deviance score up to a constant) 
-log(dpois(x, lambda=mu))

# quadratic (brier)
-2*dpois(x,lambda=mu) + sapply(mu, function(x){ sum(dpois(1:1000,lambda=x)^2) })

# spherical
- dpois(x,mu) / sqrt(sapply(mu, function(x){ sum(dpois(1:1000,lambda=x)^2) }))

# ranked probability score
sum(ppois((-1):(x-1), mu)^2) + sum((ppois(x:10000,mu)-1)^2)

# Dawid Sebastiani
(x-mu)^2/mu + log(mu)