Solved – How to test how “good” the Poisson Model is

generalized linear modelpoisson distributionscikit learnstatsmodels

Background: coding in Python, utilizing pandas, statsmodels and scikitlearn (the former supports Poisson, the latter can easily split sample for training and testing). Very little experience in statistics.

Let's assume I have a hypothesis: for every soccer game, several figures (predictors) can be used to estimate the total goals scored (response) in that game. by total goals, I mean the number of goals scored by both teams.

Notes:

The predictor values are calculated up to the point of the game. So for example if x1 is the home team's average goals, only past game results would affect it. (obvious, I know)
Some of the predictor values are not whole numbers. As I understand, this shouldn't affect the model.

So I've split my sample to training and testing sets.

Here's the output I get after I fit the model:

                               GEE Regression Results                              
===================================================================================
Dep. Variable:                 total_goals   No. Observations:                 2458
Model:                                 GEE   No. clusters:                     2458
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   1
Family:                            Poisson   Mean cluster size:                 1.0
Dependence structure:         Independence   Num. iterations:                     7
Date:                     Mon, 16 Jan 2017   Scale:                           0.982
Covariance type:                    robust   Time:                         14:39:12
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.2474      0.097      2.552      0.011         0.057     0.437
var1                 0.1471      0.027      5.453      0.000         0.094     0.200
var2                 0.1115      0.027      4.068      0.000         0.058     0.165
var3                 0.2351      0.114      2.069      0.039         0.012     0.458
==============================================================================
Skew:                          0.5144   Kurtosis:                       0.0916
Centered skew:                 0.0000   Centered kurtosis:             -3.0000
==============================================================================

I'm not sure how to interpret all of the column values, but I know that (in this case) if P>|z| is smaller than 0.05 then a variable is considered statistically significant.

What I want to inquire about, is whether I have a better way (compared to the following) to test whether my model is good for predicting whether more than two goals are scored in a match. Here's what I'm currently doing:

for each game in the test set, predict total_goals.
if the prediction for total_goals is above certain threshold (cutoff), I count it as a counted prediction.
for each counted prediction, compare against the actual result of the game. If the actual result is above 2 goals, mark it as correct, if not: incorrect.
Output the hit rate %.

The issue that I don't really know how to address is this: scikit-learn (rightfully) randomizes games in the sample when it splits it to test and train subsets. Thus, I obtain different results (different hit rate %) each time. Some are satisfactory, some aren't.

How can I estimate confidence in my model, then? I thought about running the whole procedure 1,000 times, and obtain the average hit rate %, but that doesn't sound like the right way to do it.

Any help or guidance is much appreciated!

Update

User psarka suggested I use scikitlearn's cross-validation. The issue here is that I'm not using scikit to fit my model. More so, When I try to run:

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, train, test, cv=5)

Where train is my training set (0.7 of sample) and test is my testing set (0.3 of sample), it throws an error and says that the data is of different lengths. I am assuming this because in the scikit-learn example that I'm trying to copy to my code:

scores = cross_val_score(clf, iris.data, iris.target, cv=5)

iris.data and iris.target are of the same length: the former includes the predictors while the latter contains the correct classification. But I'm not making a classifier. How can I adjust this to fit my code, or should I?

Best Answer

Some general answers:

Using the Poisson distribution to estimate the mean parameter requires relatively weak assumption, essentially only that our model for the mean (expected value) of the response variable y given the explanatory variables x is correctly specified. With the default log-link this is

$$E(y | x) = exp(x b)$$

We can use this to predict the mean for a new observation given a new set of explanatory variables. We can also get standard errors and confidence intervals for the prediction of the mean.

However, when we want the entire predictive distribution we need the assumption that we have a correctly specified distribution for the response. statsmodels does not have much of built-in support for goodness of fit tests for distributions outside the normal distribution case, especially not for discrete distribution like Bernoulli, Binomial or Poisson.

However, it is easy to get the predictive distribution using the scipy.stats distributions.

For example a sequence to get the results for Poisson could be

from scipy import stats
import statsmodels.api as sm
results = sm.Poisson(y, x).fit()
mean_predicted = results.predict()   # or use a new x
prob_2more = stats.poisson.sf(2 - 1, mean_predicted).mean() # average over x in sample
freq_2more = (y > 2 - 1).mean()  # sample average

or similar probability and frequency for observing y=2 given the x or predicted mean for each observation:

prob_2_obs = stats.poisson.pmf(2, mean_predicted)

Note: 2 - 1 in sf is used because of the weak inequality in sf(k)=Prob(y > k)

The pmf could be used to create an analog of a classification table comparing predicted versus empirical counts.

related code: The first is essentially doing what I explained for sm.Poisson https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/discrete_model.py#L2691 and not yet available Vuong test for comparing model (not yet in statsmodels) https://gist.github.com/jseabold/6617976#file-zip_model-py-L5

Caveat: The above predictions ignore parameter uncertainty, i.e. they don't take into account that our parameters for the mean of the Poisson distribution are estimated. We can use simulation to get the distribution of the predicted Poisson probabilities, but I didn't manage to come up with an API for it, it has to be simulated for each x in the prediction and produces a large collection of numbers, i.e. a distribution over distributions for each explanatory set x.

technical aside:

statsmodels has 3 versions for Poisson that all produce the same results but have different extras, sm.Poisson (from discrete_model), GLM with family Poisson and GEE with family Poisson and independence or singleton clusters as in your case. I used Poisson above because it is easier to type, i.e. no family or extras to include.

Related Solutions

Solved – What’s a good visualization for Poisson regressions

After you've fit the model, why not use predicted defects as a variable to compare to the others using whatever standard techniques are meaningful to them? It has the advantage of being a continuous variable so you can see even small differences. For example, people will understand the difference between an expected number of defects of 1.4 and of 0.6 even though they both round to one.

For an example of how the predicted value depends on two variables you could do a contour plot of time v. complexity as the two axes and colour and contours to show the predicted defects; and superimpose the actual data points on top.

The plot below needs some polishing and a legend but might be a starting point.

enter image description here

An alternative is the added variable plot or partial regression plot, more familiar from a traditional Gaussian response regression. These are implemented in the car library. Effectively the show the relationship between what is left of the response and what is left of one of the explanatory variables, after the rest of the explanatory variables have had their contribution to both the response and explanatory variables removed. In my experience most non-statistical audiences find these a bit difficult to appreciate (could by my poor explanations, of course).

enter image description here

#--------------------------------------------------------------------
# Simulate some data
n<-200
time <- rexp(n,.01)
complexity <- sample(1:5, n, prob=c(.1,.25,.35,.2,.1), replace=TRUE)
trueMod <- exp(-1 + time*.005 + complexity*.1 + complexity^2*.05)
defects <- rpois(n, trueMod)
cbind(trueMod, defects)


#----------------------------------------------------------------------
# Fit model
model <- glm(defects~time + poly(complexity,2), family=poisson)
# all sorts of diagnostic checks should be done here - not shown


#---------------------------------------------------------------------
# Two variables at once in a contour plot

# create grid
gridded <- data.frame(
    time=seq(from=0, to=max(time)*1.1, length.out=100),
    complexity=seq(from=0, to=max(complexity)*1.1, length.out=100))

# create predicted values (on the original scale)
yhat <- predict(model, newdata=expand.grid(gridded), type="response")

# draw plot
image(gridded$time, gridded$complexity, matrix(yhat,nrow=100, byrow=FALSE),
    xlab="Time", ylab="Complexity", main="Predicted average number of defects shown as colour and contours\n(actual data shown as circles)")
contour(gridded$time, gridded$complexity, matrix(yhat,nrow=100, byrow=FALSE), add=TRUE, levels=c(1,2,4,8,15,20,30,40,50,60,70,80,100))

# Add the original data
symbols(time, complexity, circles=sqrt(defects), add=T, inches=.5)

#--------------------------------------------------------------------
# added variable plots

library(car)
avPlots(model, layout=c(1,3))

Solved – Is this a correct way to do document classification using topic modeling

The approach termed "Type 1" is already explored in the paper by Blei et al. (2003) http://jmlr.org/papers/volume3/blei03a/blei03a.pdf in §7.2. The result is that this approach is valuable in feature selection for the SVM.

So "Type 1" is definitely one right way.

I have no comments "Type 2" except that it lacks the clarity that "Type 1" has.

Update

Best Answer

Related Solutions

Solved – What’s a good visualization for Poisson regressions

Solved – Is this a correct way to do document classification using topic modeling

Related Question