Poisson Models – Error Metrics for Cross-Validating Poisson Distribution Models

count-datacross-validationdeviancepoisson distributionscoring-rules

I'm cross validating a model that's trying to predict a count. If this was a binary classification problem, I'd calculate out-of-fold AUC, and if this was a regression problem I'd calculate out-of-fold RMSE or MAE.

For a Poisson model, what error metrics can I use to evaluate the "accuracy" of the out-of-sample predictions? Is there a Poisson extension of AUC that looks at how well the predictions order the actual values?

It seems that a lot of Kaggle competitions for counts (e.g. number of useful votes a yelp review will get, or number of days a patient will spend in the hospital) use root mean log squared error, or RMLSE.

/Edit: One thing I've been doing is calculating deciles of the predicted values, and then looking at the actual counts, binned by decile. If decile 1 is low, decile 10 is high, and the deciles in between are strictly increasing, I've been calling the model "good," but I've been having trouble quantifying this process, and I've convinced there's a better approach.

/Edit 2: I'm looking for a formula that takes predicted and actual values and returns some "error" or "accuracy" metric. My plan is to calculate this function on the out-of-fold data during cross-validation, and then use it to compare a wide variety of models (e.g. a poisson regression, a random forest and a GBM).

For example, one such function is RMSE = sqrt(mean((predicted-actual)^2)). Another such function would be AUC. Neither function seems to be right for poisson data.

Best Answer

There are a couple of proper and strictly proper scoring rules for count data you can use. Scoring rules are penalties $s(y,P)$ introduced with $P$ being the predictive distribution and $y$ the observed value. They have a number of desirable properties, first and foremost that a forecast that is closer to the true probability will always receive less penalty and there is a (unique) best forecast and that one is when the predicted probability coincides with the true probability. Thus minimizing the expectation of $s(y,P)$ means reporting the true probabilities. See also Wikipedia.

Often one takes an average of those over all predicted values as

$S=\frac{1}{n}\sum_{i=1}^n s(y^{(i)},P^{(i)})$

Which rule to take depends on your objective but I'll give a rough characterization when each is good to be used.

In what follows I use $f(y)$ for the predictive probability mass function $\Pr(Y=y)$ and $F(y)$ the predictive cumulative distribution function. A $\sum_k$ runs over the whole support of the count distribution (i.e, $0,1,\dots, \infty$). $I$ denotes an indicator function. $\mu$ and $\sigma$ are the mean and standard deviation of the predictive distribution (which are usually directly estimated quantities in count data models).

Strictly proper scoring rules

Brier Score: $s(y,P)=-2 f(y) + \sum_k f^2(k)$ (stable for size imbalance in categorical predictors)
Dawid-Sebastiani score: $s(y,P)=(\frac{y-\mu}{\sigma})^2+2\log\sigma$ (good for general predictive model choice; stable for size imbalance in categorical predictors)
Deviance score: $s(y,P)=-2\log f(y) + g_y$ ($g_y$ is a normalization term that only depends on $y$, in Poisson models it is usually taken as the saturated deviance; good for use with estimates from an ML framework)
Logarithmic score: $s(y,P)=-\log f(y)$ (very easily calculated; stable for size imbalance in categorical predictors)
Ranked probability score: $s(y,P)=\sum_k \{F(k)-I(y\leq k)\}^2$ (good for contrasting different predictions of very high counts; susceptible to size imbalance in categorical predictors)
Spherical score: $s(y,P)=\frac{f(y)}{\sqrt{\sum_k f^2(k)}}$ (stable for size imbalance in categorical predictors)

Other scoring rules (not so proper but often used)

Absolute error score: $s(y,P)=|y-\mu|$ (not proper)
Squared error score: $s(y,P)=(y-\mu)^2$ (not strictly proper; susceptible to outliers; susceptible to size imbalance in categorical predictors)
Pearson normalized squared error score: $s(y,P)=(\frac{y-\mu}{\sigma})^2$ (not strictly proper; susceptible to outliers; can be used for checking if model checking if the averaged score is very different from 1; stable for size imbalance in categorical predictors)

Example R code for the strictly proper rules:

library(vcdExtra)
m1 <- glm(Freq ~ mental, family=poisson, data=Mental) 

# scores for the first observation
mu <- predict(m1, type="response")[1]
x  <- Mental$Freq[1]

# logarithmic (equivalent to deviance score up to a constant) 
-log(dpois(x, lambda=mu))

# quadratic (brier)
-2*dpois(x,lambda=mu) + sapply(mu, function(x){ sum(dpois(1:1000,lambda=x)^2) })

# spherical
- dpois(x,mu) / sqrt(sapply(mu, function(x){ sum(dpois(1:1000,lambda=x)^2) }))

# ranked probability score
sum(ppois((-1):(x-1), mu)^2) + sum((ppois(x:10000,mu)-1)^2)

# Dawid Sebastiani
(x-mu)^2/mu + log(mu)

Related Solutions

Time Series – How to Cross-Validate Time-Series Analysis

The "classical" k-times cross-validation technique is based on the fact that each sample in the available data set is used (k-1)-times to train a model and 1 time to test it. Since it is very important to validate time series models on "future" data, this approach will not contribute to the stability of the model.

One important property of many (most?) time series is the correlation between the adjacent values. As pointed out by IrishStat, if you use previous readings as the independent variables of your model candidate, this correlation (or lack of independence) plays a significant role and is another reason why k-times cross validation isn't a good idea.

One way to overcome over this problem is to "oversample" the data and decorrelate it. If the decorrelation process is successful, then using cross validation on time series becomes less problematic. It will not, however, solve the issue of validating the model using future data

Clarifications

by validating model on future data I mean constructing the model, waiting for new data that wasn't available during model construction, testing, fine-tuning etc and validating it on that new data.

by oversampling the data I mean collecting time series data at frequency much higher than practically needed. For example: sampling stock prices every 5 seconds, when you are really interested in hourly alterations. Here, when I say "sampling" I don't mean "interpolating", "estimating" etc. If the data cannot be measured at higher frequency, this technique is meaningless

Solved – Cross validating lasso regression in R

An example on how to do vanilla plain cross-validation for lasso in glmnet on mtcars data set.

Load data set.
Prepare features (independent variables). They should be of matrix class. The easiest way to convert df containing categorical variables into matrix is via model.matrix. Mind you, by default glmnet fits intercept, so you'd better strip intercept from model matrix.
Prepare response (dependent variable). Let's code cars with above average mpg as efficient ('1') and the rest as inefficient ('0'). Convert this variable to factor.
Run cross-validation via cv.glmnet. It will pickup alpha=1 from default glmnet parameters, which is what you asked for: lasso regression.
By examining the output of cross-validation you may be interested in at least 2 pieces of information:
- lambda, that minimizes cross-validated error. glmnet actually provides 2 lambdas: lambda.min and lambda.1se. It's your judgement call as a practicing statistician which to use.
- resulting regularized coefficients.

Please see the R code per the above instructions:

# Load data set
data("mtcars")

# Prepare data set 
x   <- model.matrix(~.-1, data= mtcars[,-1])
mpg <- ifelse( mtcars$mpg < mean(mtcars$mpg), 0, 1)
y   <- factor(mpg, labels = c('notEfficient', 'efficient'))

library(glmnet)

# Run cross-validation
mod_cv <- cv.glmnet(x=x, y=y, family='binomial')

mod_cv$lambda.1se
[1] 0.108442

coef(mod_cv, mod_cv$lambda.1se)
                     1
(Intercept)  5.6971598
cyl         -0.9822704
disp         .        
hp           .        
drat         .        
wt           .        
qsec         .        
vs           .        
am           .        
gear         .        
carb         .  

mod_cv$lambda.min
[1] 0.01537137

coef(mod_cv, mod_cv$lambda.min)
                      1
(Intercept)  6.04249733
cyl         -0.95867199
disp         .         
hp          -0.01962924
drat         0.83578090
wt           .         
qsec         .         
vs           .         
am           2.65798203
gear         .         
carb        -0.67974620

Final comments:

note, the model's output says nothing about statistical significance of the coefficients, only values.
l1 penalizer (lasso), which you asked for, is notorious for instability as evidenced in this blog post and this stackexchange question. A better way could be to cross-validate on alpha too, which would let you decide on proper mix of l1 and l2 penalizers.
an alternative way to do cross-validation could be to turn to caret's train( ... method='glmnet')
and finally, the best way to learn more about cv.glmnet and it's defaults coming from glmnet is of course ?glmnet in R's console )))

Best Answer

Related Solutions

Time Series – How to Cross-Validate Time-Series Analysis

Solved – Cross validating lasso regression in R

Related Question