Solved – Gamma GLM – Derive prediction intervals for new x_i

gamma distributiongeneralized linear modelprediction intervalstatsmodels

In a Gamma GLM, the statistical model for each observation 𝑖 is assumed to be $Y_i \sim Gamma(shape, scale)$, where $E(Y_i) = \mu_i = f(X_i\beta)$, and $f$ is the link function.

I've used MLE to estimate $\hat{\beta}$ and $\hat{scale}$, and wish to produce a 90% prediction interval on a new point $Y'$ given $X'$.

I can produce the confidence intervals on $E(Y|X') = \mu'$ by using link function $f$ on the normally distributed confidence intervals for $X\hat{\beta}$. Let's say $\hat{\mu'} = 10$ and 90% confidence intervals are [5, 30].

However, we want the intervals from the distribution of $Y'$, not $\mu'$. Intuitively, these intervals should be much wider than the confidence intervals for $\mu'$ I think they should also be wider than the 5th and 95th percentile of a single Gamma distribution with $\mu=\hat{\mu'}$, since the uncertainty around $\hat{\mu'}$ should translate into increased uncertainty around the final distribution, sort of like an vague prior on a bayesian posterior distribution.

What is the correct way to model prediction intervals on the new point $Y'$?

The below schema shows how uncertainty on $\mu'$ translates into many possible gamma distributions and a wide prediction interval for $Y'$

schema of Y'

References:

https://www.rocscience.com/help/swedge/swedge/Gamma_Distribution.htm

https://www.statsmodels.org/stable/glm.html

Best Answer

The prediction interval for a new observation depends on both the assumed inherent randomness in this case given by the gamma distribution, and the uncertainty coming from the parameters that are estimated and not assumed to be known.

In general there is no analytical or closed form expression for the combination of the two effects. The two main options are to ignore parameter uncertainty and to use simulation methods.

Ignoring parameter uncertainty: If we ignore that the parameters are estimated with some sampling uncertainty, the the distribution of a new observation is just given by the assumed gamma distribution. We can use the mean and scale estimates to compute the relevant prediction intervals using e.g. the distribution methods in scipy.stats. The parameterization might have to be transformed from the GLM parameterization to the scipy.stats parameterization.

Simulation Methods: One possibility is to use full bootstrap on the original sample to simulate new observations. The simpler method is to assume that the asymptotic normal distribution for the parameter estimates are appropriate and simulate the parameters of the gamma distribution from the mean and covariance of the parameter estimates. For each sampled parameter we can sample a new observation and compute a confidence interval from the simulated observations. One problem with this is that GLM only estimates the mean parameters directly, the scale is estimated using deviance or pearson chi2. That is, GLM in statsmodels in other packages does not provide a joint covariance for mean and scale parameter.

Because of these problems, statsmodels currently provides prediction intervals for new observations that take parameter uncertainty into account only for the linear normal case, i.e. OLS.

Related Question