Predicting with a GLM

binomial distributiongamma distributiongeneralized linear modelpoisson distributionpredictive-models

I wanted to check my understanding of predicting with a GLM:

A binomial/logistic regression model predicts the binomial parameter = p = P(success). To convert the probability into classes, we have to add a threshold or cutoff.

The same idea applies for a multinomial logistic regression model.

A poisson regression model predicts the poisson parameter = rate. To convert into counts, I use a threshold again?

A gamma model predicts the scale and rate parameters. I do not need a threshold because the response is continuous.

Best Answer

It matters what you mean by prediction. Unfortunately, this term can be somewhat ambiguous, especially since the linear combination of covariates in the regression model is often referred to as a linear predictor.

The typical purpose of a generalized linear model is to estimate the population mean and to perform inference on the mean. This would be the proportion in a Bernoulli model and the mean in a Poisson or gamma model.

The word prediction is best reserved for when interest surrounds a future sampled observation. Of course our best point prediction of a future observation is the estimated mean of the population. For a gamma model one would report the sample mean as the point prediction for a future observation. For a Bernoulli model one would report the value 0 or 1 that has the largest estimated proportion since an individual observation can only take on these discrete values. For a Poisson model one could report the mean rounded to the nearest integer since the support of the Poisson distribution is the non-negative integers. One could also use the floor or ceiling function on the mean to produce a point prediction.

One might also be interested in presenting the estimated percentiles of the population. It is important that these be presented with tolerance intervals (confidence intervals for population percentiles). Alternatively one might be interested in quantifying the uncertainty regarding the point prediction for a single future observation. This would require the use of a prediction interval which is not the estimated percentiles. Here is a related thread that discusses prediction intervals.

Addendum: Splitting the data into training and test is for the purposes of validating the out-of-sample prediction ability of a model. My preferred approach is not to split the data into training and test sets. Rather, I suggest to bootstrap (sample with replacement) $n$ observations from the data set as if it is the population, fit the model, and construct a point prediction or interval prediction for a particular prediction target (a single future $y$ [$m=1$ observation] or a future $\bar{y}$ based on $m$ observations). Then bootstrap a sample of size $m$ and tally i) the discrepancy between the point prediction and the target, and ii) whether the prediction interval covered the target . Repeat this 10,000 times and plot the histogram for point prediction errors and calculate the coverage rate for prediction intervals. This validates the performance of the model based on operating characteristics.

Sampling with replacement from your data set treats it as a much larger population. It is likely the percentiles of your data set do not match the theoretical percentiles of the glm model you posit. This means there is slight model misspecification so don't be surprised if the prediction intervals do not cover at the nominal level and if the histogram of prediction errors shows small bias (not centered at zero). You can also perform this type of validation through simulation by randomly generating observations from the theoretical model that matches your glm, e.g. gamma or Poisson. Here you should find the prediction intervals perform close to the nominal level and your point prediction is asymptotically unbiased for the target.

This type of approach can also be used to validate point and interval estimation of a population parameter.