In one of the exercises for my course, we're using a Kaggle medical dataset.
The exercise says:
we want to model the distribution of individual charges and we also really want to be able to capture our uncertainty about that distribution so we can better capture the range of values we might see. Loading the data and performing an initial view:
We may suspect from the above that there is some sort of exponential-like distribution at play here. …The insurance claim charges may possibly be multimodal. The gamma distribution may be applicable and we could test this for the distribution of charges that weren't insurance claims first.
I looked up "Gamma distribution" and found "a continuous, positive-only, unimodal distribution that encodes the time required for «alpha» events to occur in a Poisson process with mean arrival time of «beta»"
There's no time involved here, just unrelated charges, either insured or not.
Why would they choose a gamma distribution?
Best Answer
When you're considering simple parametric models for the conditional distribution of data (i.e. the distribution of each group, or the expected distribution for each combination of predictor variables), and you are dealing with a positive continuous distribution, the two common choices are Gamma and log-Normal. Besides satisfying the specification of the domain of the distribution (real numbers greater than zero), these distributions are computationally convenient and often make mechanistic sense.
There are other reasons one might pick one or the other - for example, the "heaviness" of the tail of the distribution, which might be important in predicting the frequency of extreme events. There are plenty of other positive, continuous distributions (e.g see this list), but they tend to be used in more specialized applications.
Very few of these distributions will capture the multi-modality you see in the marginal distributions above, but multi-modality may be explained by the data being grouped into categories described by observed categorical predictors. If there are no observable predictors that explain the multimodality, one might choose to fit a finite mixture model based on a mixture of a (small, discrete) number of positive continuous distributions.