Solved – Fit distributions with glm

distributionsfittinggeneralized linear modelr

I'm trying to fit different statistical distributions (Gamma, Poisson, normal, inverse Gaussian) to my data with a glm. An example could be like this:

data   <- rexp(500, 3)
model  <- glm(data~1, family=Gamma())
shape  <- 1 / gamma.dispersion(model)
rate   <- shape*model$coef[1]

model2 <- glm(data~1, family=poisson())
lambda <- unique(model2$fitted.values)

So, here are the questions:

  1. If I want to select between all of them the best distribution fit with a glm, can I use the models AIC? Or should I use another method, like MSE? It's because I have really different results from one distribution to another.
  2. I already know that packages fitdistrplus or MASS include specific functions to do this, but it's too slow for my data. If I use those functions to fit the data, mostly of the time I get as a result an exponential distribution. Is there a way to fit it with the glm families?

Update: The data comes from sales orders, but it is always grater than 0, that's why I can use the exponential or gamma distributions. (Only include the normal for the cases where there are so many orders that the negative part under the curve is almost 0).

Also, I need the distribution for simulate from it, a new series.
With the result showed by the AIC i can't make them compete to get the best fitting.
I would like a formal test, but i also need it to be really fast, so I'll settle for an approximation or something exploratory.

Best Answer

Since your data are counts, they can only come in whole numbers. That is, you could get a count of $5$ orders or $6$ orders, but you could never get $5.5$ orders. Thus, your data cannot be distributed as Gamma, normal or inverse Gaussian, as these are continuous distributions. (It is always possible in some case that the fits will be good enough for your purposes, but your goal is to use the distribution to simulate new data, so we can just rule them out.)

For your case, the default first choice would be the Poisson distribution. However, the Poisson is very restrictive, in the sense that the variance must equal the mean. This can happen in the real world, but rarely does in practice and I highly doubt it would hold for sales. A distribution that relaxes this restriction / generalizes the Poisson is the negative binomial. This will allow you to estimate the mean and variance separately.

Thus, your first strategy for fitting these distributions and doing so quickly, is to only fit the negative binomial. The primary function for getting these parameters is ?fitdistr in the MASS package. Subsequently, you can use ?rnbinom for simulation.

If speed and size remain a problem, another possibility is to extract random subsamples of manageable size from your data and fit those. Even if these were a small proportion of the total, the fitted parameters should be very close to the true parameters if the sample is large enough.

Related Question