Solved – R: How to estimate the parameters of a Gamma distribution, knowing the empirical cumulative distribution

fittingr

I have been given an empirical cumulative distribution, and I need to fit a Gamma distribution to it using R.

By trial-and-error (and using Excel's Solver), I know that's possible, but I need to automate this process (I need to update this distribution each week, so automation is needed); I'd like to do it in R, so I can use the results in some other reports (using sweave for example).

Data sample:

           F
0  0.0000000
1  0.4761905
2  0.5665025
3  0.6284307
4  0.6851982
5  0.7404266
6  0.7994206
7  0.7994206
8  0.8595944
9  0.8595944
10 0.8595944

Best estimate (so far): Gamma distribution with the following parameters: alpha = 0.357779 and theta = 12.019887

Best Answer

If the observations only take on integer values, then while a gamma distribution might be a reasonable summary function, you don't really have an underlying gamma distribution.

However, if the integers were just used to bin the data, then you can determine the maximum likelihood estimates of the two gamma distribution parameters in the following manner. I've made the assumption that you have the sample size available and for this example I used $n=235$.

# Available sample cumulative distribution function
  x <- c(0:10,Inf)
  f <- c(0.0000000,0.4761905,0.5665025,0.6284307,0.6851982,0.7404266,0.7994206, 
    0.7994206,0.8595944,0.8595944,0.8595944,1)

# Determine counts in each bin assuming that the total sample size is 235
  counts <- diff(floor(235*f+0.5))
  counts

# Define log likelihood function for the binned data
logL <- function(p) {
  s <- 0
  for (i in 1:(length(counts)-1)) {
      s <- s + counts[i]*log(pgamma(x[i+1],p[1],1/p[2])-pgamma(x[i],p[1],1/p[2]))
  }
  s <- s + counts[length(counts)]*log(1-pgamma(x[length(counts)-1],p[1],1/p[2]))
  s
}

# Find maximum likelihood solution (with starting values from other method)
  solution <- optim(c(0.357779,12.019887), logL, method="L-BFGS-B",
    lower=c(0.001,0.001), upper=c(Inf,Inf), control=list(fnscale=-1))
  solution
#  $par
#[1]  0.3370922 12.3350379
#
#$value
#[1] -412.9522
# 
#$counts
#function gradient 
#     17       17 
#
#$convergence
#[1] 0
#
#$message
#[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

# Plot results
  plot(c(0:10), f[1:11], type="s", xlab="x", ylab="Probability")
  lines(c(0:100)/10, pgamma(c(0:100)/10,solution$par[1],1/solution$par[2]), col="red")

Related Solutions

Beta Distribution – Finding Estimates for Data Using the Cumulative Beta Distribution in R

First remark : your data is nowhere near a distribution, and definitely not the beta function. As I see it, you see your boot.mean as 'density' and your x-axis (the index?) as value. The beta function is limited between 0 and 1, and as the area under the curve of any density function should equal 1, your data doesn't come close. Good point of @whuber: fit a scaled version. Alternatively: Scale to the sum of the data, as @iterator said. Now as the beta function requires scaling twice (both on the X-axis, so the indices and on the Y-axis, being the actual data)

Now you talk about the beta function and you talked about the inverse of the cumulative normal distribution somewhere else. I suppose you mean 'when that distribution is looking in the mirror, it sees what I want to see...' ;-)

So an ad-hoc way of doing this (without any theoretical background, as that background is not the one you need here), is given below. Apart from what everybody else said here, I just want to point out the optim() function, which is doing basically what you're looking for. Regardless whether you fit a scaled and mirrored beta distribution or something that looks close to an inverse normal cumulative distribution for some value of close...

customFit <- function(x, data) {
    d.data <- rev(cumsum(dnorm(1:length(data), x[1], x[2]))) * max(data)
    SS <- sum((d.data - data)^2)
    return(SS)
}

fit.optim <- optim(c(5, 8), customFit, data = boot.mean)

plot(boot.mean)
lines(rev(cumsum(dnorm(1:length(boot.mean), 
         fit.optim$par[1], fit.optim$par[2]))) * max(boot.mean), 
         col = "red")

Note of warning: apart from having a defined function that fits your data, there's little you can do with this result...

Solved – Quantiles of a compound gamma/negative binomial distribution

As a practical answer to the real questions you're addressing, such high quantiles will generally be quite sensitive to issues with model choice (especially such things as whether you model the right censoring and how heavy the tails are in the components).

But in any case - especially when dealing with high quantiles where ordinary simulation becomes impractical - that has great value in its own right; it's an interesting question from both theoretical and practical standpoints.

A couple of other approaches to this problem are (1) using the Fast Fourier Transform and (2) direct numerical integration.

One useful reference on this topic is Luo and Shevchenko (2009)$^{[1]}$.

In it they develop an adaptive direct numerical integration approach that's faster than simulation and competitive with FFT.

The more traditional approach in actuarial work was been (3) Panjer recursion, which can be found in numerous texts. Embrechts and Frei (2009)$^{[2]}$ discuss and compare Panjer recursion and FFT. (Note that both of these techniques involve discretization of the continuous distribution.)

On the other hand, doing a very unsophisticated version of simulation, and with no effort whatever made to be efficient, generating from a compound gamma-negative binomial isn't particularly onerous. This is timing on my kids' little laptop:

system.time(replicate(100000,sum(rgamma(MASS:::rnegbin(1,4,2),5,.1))))
   user  system elapsed 
   2.82    0.00    2.84

I think 2.8 seconds to generate 100$^{\,}$000 simulations of a compound distribution on a slow little laptop really isn't bad. With some effort to be efficient (of which one might suggest many possibilities), I imagine that could be made a good deal faster.

Here's the ecdf for $10^6$ simulations (which took about 29 seconds):

$\hspace{1cm}$ enter image description here

We see the characteristic discrete jump at zero you expect to see with a compound distribution.

[While it should be easy to make simulation a lot faster, all three alternatives mentioned here - if carried out sensibly - should be a lot faster still.]

You should note that the actuar package supports computation with compound distributions, and offers several methods for calculation with them.

See, for example, this vignette which discusses this facility.

[Of possibly some further passing interest, note that there is an R package for the Poisson-lognormal distribution -- poilog; if you need that distribution at some point it may be useful.]

Added in edit:

A potential quick approximation where the gamma shape parameter isn't changing -

In the gamma case, because a convolution of gammas with constant shape parameter is another gamma, you could write down the distribution of $Y|N=n$, and then evaluate the cdf and the density at a large number of grid-values at each $n$, then simply accumulate the sum directly (rather as one would for a KDE). The direct calculation only yields a lower bound to the true quantile, but if the negative binomial is not heavy tailed it should be quite rapid.

References:

[1]: Luo, X. and Shevchenko, P.V. (2009),
"Computing Tails of Compound Distributions Using Direct Numerical Integration,"
Journal of Computational Finance, 13 (2), 73-111.
[arXiv preprint available here]

[2]: Embrechts, P., and Frei, M. (2009),
"Panjer recursion versus FFT for compound distributions,"
Mathematical Methods of Operations Research, 69:3 (July) pp 497-508.
[seems to be a pre-publication version here]

Best Answer

Related Solutions

Beta Distribution – Finding Estimates for Data Using the Cumulative Beta Distribution in R

Solved – Quantiles of a compound gamma/negative binomial distribution

Related Question