This is a fairly straight forward problem. Although there is a connection between the Poisson and Negative Binomial distributions, I actually think this is unhelpful for your specific question as it encourages people to think of negative binomial processes. Basically, you have a series of Poisson processes:
$$Y_i(t_i)|\lambda_i\sim Poisson(\lambda_i t_i)$$
Where $Y_i$ is the process and $t_i$ is the time you observe it, and $i$ denotes the individuals. And you are saying that these processes are "similar" by tying the rates together by a distribution:
$$\lambda_i\sim Gamma(\alpha,\beta)$$
On doing the integration/mxixing over $\lambda_i$, you have:
$$Y_i(t_i)|\alpha\beta\sim NegBin(\alpha,p_i)\;\;\; where \;\;p_i=\frac{t_i}{t_i+\beta}$$
This has a pmf of:
$$Pr(Y_i(t_i)=y_i|\alpha\beta) = \frac{\Gamma(\alpha+y_i)}{\Gamma(\alpha)y_i!}p_i^{y_i}(1-p_i)^\alpha$$
To get the waiting time distribution we note that:
$$Pr(T_i\leq t_i|\alpha\beta)=1-Pr(T_i> t_i|\alpha\beta)=1-Pr(Y_i(t_i)=0|\alpha\beta)$$
$$=1-(1-p_i)^\alpha=1-\left(1+\frac{t_i}{\beta}\right)^{-\alpha}$$
Differentiate this and you have the PDF:
$$p_{T_i}(t_i|\alpha\beta)=\frac{\alpha}{\beta}\left(1+\frac{t_i}{\beta}\right)^{-(\alpha+1)}$$
This is a member of the generalized Pareto distributions, type II. I would use this as your waiting time distribution.
To see the connection with the Poisson distribution, note that $\frac{\alpha}{\beta}=E(\lambda_i|\alpha\beta)$, so that if we set $\beta=\frac{\alpha}{\lambda}$ and then take the limit $\alpha\to\infty$ we get:
$$\lim_{\alpha\to\infty}\frac{\alpha}{\beta}\left(1+\frac{t_i}{\beta}\right)^{-(\alpha+1)}=\lim_{\alpha\to\infty}\lambda\left(1+\frac{\lambda t_i}{\alpha}\right)^{-(\alpha+1)}=\lambda\exp(-\lambda t_i)$$
This means that you can interpret $\frac{1}{\alpha}$ as an over-dispersion parameter.
The usual gamma GLM contains the assumption that the shape parameter is constant, in the same way that the normal linear model assumes constant variance.
In GLM parlance the dispersion parameter, $\phi$ in $\text{Var}(Y_i)=\phi\text{V}(\mu_i)$ is normally constant.
More generally, you have $a(\phi)$, but that doesn't help.
It might perhaps be possible to use a weighted Gamma GLM to incorporate this effect of a specified shape parameter, but I haven't investigated this possibility yet (if it works it is probably the easiest way to do it, but I am not at all sure that it will).
If you had a double GLM you could estimate that parameter as a function of covariates... and if the double glm software let you specify an offset in the variance term you could do this. It looks like the function dglm
in the package dglm
let you specify an offset. I don't know if it will let you specify a variance model like (say) ~ offset(<something>) + 0
though.
Another alternative would be to maximize the likelihood directly.
> y <- rgamma(100,10,.1)
> summary(glm(y~1,family=Gamma))
Call:
glm(formula = y ~ 1, family = Gamma)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.93768 -0.25371 -0.05188 0.16078 0.81347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0103660 0.0003486 29.74 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.1130783)
Null deviance: 11.223 on 99 degrees of freedom
Residual deviance: 11.223 on 99 degrees of freedom
AIC: 973.56
Number of Fisher Scoring iterations: 5
The line where it says:
(Dispersion parameter for Gamma family taken to be 0.1130783)
is the one you want.
That $\hat\phi$ is related to the shape parameter of the Gamma.
Best Answer
It matters what you mean by prediction. Unfortunately, this term can be somewhat ambiguous, especially since the linear combination of covariates in the regression model is often referred to as a linear predictor.
The typical purpose of a generalized linear model is to estimate the population mean and to perform inference on the mean. This would be the proportion in a Bernoulli model and the mean in a Poisson or gamma model.
The word prediction is best reserved for when interest surrounds a future sampled observation. Of course our best point prediction of a future observation is the estimated mean of the population. For a gamma model one would report the sample mean as the point prediction for a future observation. For a Bernoulli model one would report the value 0 or 1 that has the largest estimated proportion since an individual observation can only take on these discrete values. For a Poisson model one could report the mean rounded to the nearest integer since the support of the Poisson distribution is the non-negative integers. One could also use the floor or ceiling function on the mean to produce a point prediction.
One might also be interested in presenting the estimated percentiles of the population. It is important that these be presented with tolerance intervals (confidence intervals for population percentiles). Alternatively one might be interested in quantifying the uncertainty regarding the point prediction for a single future observation. This would require the use of a prediction interval which is not the estimated percentiles. Here is a related thread that discusses prediction intervals.
Addendum: Splitting the data into training and test is for the purposes of validating the out-of-sample prediction ability of a model. My preferred approach is not to split the data into training and test sets. Rather, I suggest to bootstrap (sample with replacement) $n$ observations from the data set as if it is the population, fit the model, and construct a point prediction or interval prediction for a particular prediction target (a single future $y$ [$m=1$ observation] or a future $\bar{y}$ based on $m$ observations). Then bootstrap a sample of size $m$ and tally i) the discrepancy between the point prediction and the target, and ii) whether the prediction interval covered the target . Repeat this 10,000 times and plot the histogram for point prediction errors and calculate the coverage rate for prediction intervals. This validates the performance of the model based on operating characteristics.
Sampling with replacement from your data set treats it as a much larger population. It is likely the percentiles of your data set do not match the theoretical percentiles of the glm model you posit. This means there is slight model misspecification so don't be surprised if the prediction intervals do not cover at the nominal level and if the histogram of prediction errors shows small bias (not centered at zero). You can also perform this type of validation through simulation by randomly generating observations from the theoretical model that matches your glm, e.g. gamma or Poisson. Here you should find the prediction intervals perform close to the nominal level and your point prediction is asymptotically unbiased for the target.
This type of approach can also be used to validate point and interval estimation of a population parameter.