Solved – How to test how “good” the Poisson Model is

generalized linear modelpoisson distributionscikit learnstatsmodels

Background: coding in Python, utilizing pandas, statsmodels and scikitlearn (the former supports Poisson, the latter can easily split sample for training and testing). Very little experience in statistics.

Let's assume I have a hypothesis: for every soccer game, several figures (predictors) can be used to estimate the total goals scored (response) in that game. by total goals, I mean the number of goals scored by both teams.

Notes:

  • The predictor values are calculated up to the point of the game. So for example if x1 is the home team's average goals, only past game results would affect it. (obvious, I know)

  • Some of the predictor values are not whole numbers. As I understand, this shouldn't affect the model.

So I've split my sample to training and testing sets.

Here's the output I get after I fit the model:

                               GEE Regression Results                              
===================================================================================
Dep. Variable:                 total_goals   No. Observations:                 2458
Model:                                 GEE   No. clusters:                     2458
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   1
Family:                            Poisson   Mean cluster size:                 1.0
Dependence structure:         Independence   Num. iterations:                     7
Date:                     Mon, 16 Jan 2017   Scale:                           0.982
Covariance type:                    robust   Time:                         14:39:12
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.2474      0.097      2.552      0.011         0.057     0.437
var1                 0.1471      0.027      5.453      0.000         0.094     0.200
var2                 0.1115      0.027      4.068      0.000         0.058     0.165
var3                 0.2351      0.114      2.069      0.039         0.012     0.458
==============================================================================
Skew:                          0.5144   Kurtosis:                       0.0916
Centered skew:                 0.0000   Centered kurtosis:             -3.0000
==============================================================================

I'm not sure how to interpret all of the column values, but I know that (in this case) if P>|z| is smaller than 0.05 then a variable is considered statistically significant.

What I want to inquire about, is whether I have a better way (compared to the following) to test whether my model is good for predicting whether more than two goals are scored in a match. Here's what I'm currently doing:

  1. for each game in the test set, predict total_goals.
  2. if the prediction for total_goals is above certain threshold (cutoff), I count it as a counted prediction.
  3. for each counted prediction, compare against the actual result of the game. If the actual result is above 2 goals, mark it as correct, if not: incorrect.
  4. Output the hit rate %.

The issue that I don't really know how to address is this: scikit-learn (rightfully) randomizes games in the sample when it splits it to test and train subsets. Thus, I obtain different results (different hit rate %) each time. Some are satisfactory, some aren't.

How can I estimate confidence in my model, then? I thought about running the whole procedure 1,000 times, and obtain the average hit rate %, but that doesn't sound like the right way to do it.

Any help or guidance is much appreciated!

Update

User psarka suggested I use scikitlearn's cross-validation. The issue here is that I'm not using scikit to fit my model. More so, When I try to run:

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, train, test, cv=5)

Where train is my training set (0.7 of sample) and test is my testing set (0.3 of sample), it throws an error and says that the data is of different lengths. I am assuming this because in the scikit-learn example that I'm trying to copy to my code:

scores = cross_val_score(clf, iris.data, iris.target, cv=5)

iris.data and iris.target are of the same length: the former includes the predictors while the latter contains the correct classification. But I'm not making a classifier. How can I adjust this to fit my code, or should I?

Best Answer

Some general answers:

Using the Poisson distribution to estimate the mean parameter requires relatively weak assumption, essentially only that our model for the mean (expected value) of the response variable y given the explanatory variables x is correctly specified. With the default log-link this is

$$E(y | x) = exp(x b)$$

We can use this to predict the mean for a new observation given a new set of explanatory variables. We can also get standard errors and confidence intervals for the prediction of the mean.

However, when we want the entire predictive distribution we need the assumption that we have a correctly specified distribution for the response. statsmodels does not have much of built-in support for goodness of fit tests for distributions outside the normal distribution case, especially not for discrete distribution like Bernoulli, Binomial or Poisson.

However, it is easy to get the predictive distribution using the scipy.stats distributions.

For example a sequence to get the results for Poisson could be

from scipy import stats
import statsmodels.api as sm
results = sm.Poisson(y, x).fit()
mean_predicted = results.predict()   # or use a new x
prob_2more = stats.poisson.sf(2 - 1, mean_predicted).mean() # average over x in sample
freq_2more = (y > 2 - 1).mean()  # sample average

or similar probability and frequency for observing y=2 given the x or predicted mean for each observation:

prob_2_obs = stats.poisson.pmf(2, mean_predicted)

Note: 2 - 1 in sf is used because of the weak inequality in sf(k)=Prob(y > k)

The pmf could be used to create an analog of a classification table comparing predicted versus empirical counts.

related code: The first is essentially doing what I explained for sm.Poisson https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/discrete_model.py#L2691 and not yet available Vuong test for comparing model (not yet in statsmodels) https://gist.github.com/jseabold/6617976#file-zip_model-py-L5

Caveat: The above predictions ignore parameter uncertainty, i.e. they don't take into account that our parameters for the mean of the Poisson distribution are estimated. We can use simulation to get the distribution of the predicted Poisson probabilities, but I didn't manage to come up with an API for it, it has to be simulated for each x in the prediction and produces a large collection of numbers, i.e. a distribution over distributions for each explanatory set x.

technical aside:

statsmodels has 3 versions for Poisson that all produce the same results but have different extras, sm.Poisson (from discrete_model), GLM with family Poisson and GEE with family Poisson and independence or singleton clusters as in your case. I used Poisson above because it is easier to type, i.e. no family or extras to include.