Let's see what Dan Ma actually says in his blog. To quote:
There is uncertainty in the parameter $\theta$, reflecting the risk characteristic of the insured. Some insureds are poor risks (with large $\theta$) and some are good risks (with small $\theta$). Thus the parameter $\theta$ should be regarded as a random variable $\Theta$. The following is the conditional distribution of $N$ (conditional on $\Theta=\theta$):
$$\displaystyle (15) \ \ \ \ \ P(N=n \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots$$
Aside from some small oddness in the wording, the gist of that is fine. The parameter of the Poisson ($\theta$ in the quoted discussion) represents the underlying rate of claims per unit time; that individuals are homogeneous, and have different 'riskiness' (different claim-rates) isn't controversial.
So why does he think that the distribution of the claim-rate is distributed as gamma?
Well, actually he doesn't say that he thinks that at all.
What he says is:
Suppose that $\Theta$ has a Gamma distribution with scale parameter $\alpha$ and shape parameter $\beta$.
He's positing a circumstance -- discussing an assumption if you wish -- for which he then discusses the consequences.
He doesn't even assert anything about the plausibility of the assumption.
Here's some things that might be reasonable to assert/suppose about the claim-rate distribution:
1) It's necessarily non-negative and may be taken to be continuous
2) we could expect that it would tend to be right-skew
3) We might not-too-unreasonably expect there to be a typical level (a mode), around which the bulk of the distribution lies, and that it tails off as we move further away (i.e. it might be reasonable to expect that it would be unimodal, at least to a first approximation)
That's about all we could say without collecting data.
The gamma at least doesn't break any of those suppositions/expectations, and so is likely to result in a more useful distribution than assuming homogeneity of claim-rate, but any number of other distributions satisfy those conditions.
So why gamma rather than lognormal say? Likely, a matter of convenience; the gamma works nicely with the Poisson - which even conditional on the individual underlying claim-frequency is itself another assumption that isn't actually true (though we can make some argument that the assumptions of claims having a Poisson process may not be too badly wrong, it's clear that they can't be exactly true).
There's no good reason to think it is gamma-distributed.
Indeed, I'll assert here and now that there's no real-world case where the claim rate is actually gamma distributed, in practice there will always be differences between the actual distribution of interest and some simple model for it; but that's true of essentially all our probability models.
They're convenient fictions, which may sometimes be not so badly inaccurate as to have some value.
Is there a way I can determine if my density is gamma distributed?
Nothing will tell you it is; in fact you can be quite sure - even when it looks like an excellent description of the distribution - that the gamma is at best merely an approximation. You can use diagnostic displays (perhaps something like a Q-Q plot) to help check that it's not too far from gamma.
Imagine you're the newly appointed manager of a flower shop. You've got a record of last year's customers – the frequency with which they shop and how long since their last visit. You want to know how much business the listed customers are likely to bring in this year. There are a few things to consider:
[assumption (ii)] Customers have different shopping habits.
Some people like having fresh flowers all the time, while others only by them on special occasions. It makes more sense to have a distribution for the transaction rate $\lambda$, rather than assuming that a single $\lambda$ explains everyone’s behaviour.
The distribution needs to have few parameters (you don’t necessarily have a lot of data), to be fairly flexible (you’re presumably not a mind-reading entrepreneurial guru and don’t know all about shopping habits), and to take values in the positive real numbers. The Gamma distribution ticks all of those boxes, and is well-studied and relatively easy to work with. It’s often used as a prior for positive parameters in different settings.
[assumption (iii)] You might have already lost some of the customers on the list.
If Andrea has bought flowers about once a month every month in the last year, it’s a fairly safe bet she’ll be returning this year. If Ben used to buy flowers weekly, but he hasn’t been around for months, then maybe he’s found a different flower shop. In making future business plans, you might want to count on Andrea but not on Ben.
Customers won’t tell you when they’ve moved on, which is where the “unobserved lifetime” assumption kicks in for both models. Imagine a third customer, Cary. The Pareto/NBD and BG/NBD models give you two different ways to think about Cary dropping out of the shop for good.
For the Pareto/NBD case, imagine that at any point in time, there is a small chance that Cary might come across a better shop than yours. This constant infinitesimal risk gives you the exponential lifetime – and the longer it’s been since Cary’s last visit, the longer he’s been exposed to other (potentially better) flower shops.
The BG/NBD case is a little more contrived. Every time Cary arrives in your shop, he’s committed to buying some flowers. While browsing, he’ll consider the changes in price, quality and variety since his last visit, and that will ultimately make him decide whether to come back again next time, or look for another shop. So rather than being constantly at risk, Cary has some probability p of just deciding to leave after each purchase.
[assumption (iv)] Not all customers are equally committed to your shop.
Some customers are regulars, and only death – or a sharp price increase – will force them to leave. Others might like to explore, and would happily leave you for the sake of the new hipster flower shop across the street. Rather than a single drop-out rate for all customers, it makes more sense to have a distribution of drop-out rates (or probabilities in the BG/NBD case).
This works very much in the same vein as the shopping habits. We’re after a flexible, well-established distribution with few parameters. In the Pareto/NBD case we use a Gamma, since the rate $\mu$ is in the positive real numbers. In the BG/NBD case we use a Beta, which is the standard prior for parameters in $(0; 1)$.
I hope this helps. Have a look at the original paper (Schmittlein et al., 1987) if you haven't already -- they go through some of the intuition there.
Best Answer
This is a (super) late answer, but I myself was looking for some information related to gamma-gamma models for monetary value, and came across this. The short answer is yes, the negative values for expected transaction values exposes issues with the underlying dataset used to fit the model.
In case it is helpful for you or others with similar questions, I'll try to illustrate why it's concerning to have $q<1$. The purpose of these spend models is to understand observed spend per transaction with the goal of predicting future spend per transaction at the individual level. The use of a gamma distribution was first proposed by Colombo and Jiang (1999) and was motivated by the observation that if transactions are distributed normal, then 1) it is not bounded below by $0$ for any choice of mean and variance parameters, and 2) you get symmetric spend distributions, when the observed data consistently appears to be right skewed.
Following the paper you refer to, a customer with $x$ transactions values $z_1,\dots,z_x$ is modeled such that $z_i \sim \text{Gamma}(p,\nu),$ and we allow for heterogeneity across customers by also having that $\nu \sim \text{Gamma}(q,\gamma)$. A key observation is that conditional on $p$ and $\nu$, a customer's mean transaction value $\delta$ is $\delta = p/\nu$. Now $\nu$ varies across customers, so you may want to know what the mean transaction value $\delta$ is across all individuals. Denote this random variable $D$. It can be shown that $$E[D|p,q,\gamma] = \frac{p\gamma}{q-1}$$
which says that the mean transaction value for customers is $\frac{p\gamma}{q-1}$ (showing this is a bit involved, but the way to do it is to derive the distribution and show it is an inverse-gamma distribution with specific parameters and find the expected value given that). In any gamma distribution, the parameters are strictly positive, so $p>0,\gamma >0$, so if you have $q<1$, then it must be that the expected transaction value across individuals is negative.
This should be pause for concern: why is the expected transaction value negative? You can try to validate this by thinking of compensating individuals for each transaction, but this is quite odd and there are other models if this is the kind of situation you are dealing with, and so the fact that your model finds $q<1$ should immediately raise some serious concerns for this reason alone.
As a final point, I think it's nice to better understand
$$ \begin{align} \mathbb{E}(M\mid p, q, \gamma, m_x, x) & = \frac{(\gamma + m_xx)p}{px+q-1}\\ & = \bigg(\frac{q-1}{px+q-1}\bigg)\frac{\gamma p}{q-1}+\bigg(\frac{px}{px+q-1}\bigg)m_x\\ \end{align} $$
as noting that it is simply the weighted average of the population mean transaction value $E[D|p,q,\gamma] = \frac{p\gamma}{q-1}$ and the observed average transaction value $m_x = \frac{1}{x}\sum_{i=1}^x z_i$ of a given customer, and the weightings can be fully understood from a bayesian framework as having a prior (the mean average transaction value), and the weight you place on it goes down as you observe more data $x$ on an given individual!