Solved – How to find the distribution of data with data fitting

data miningdistributionsgamma distributionmathematical-statistics

I need to do data fitting to find the distribution of given data.
I need to find the PDF function of the distribution.
I can use data fitting functions in MATLAB and Python.

It looks like a truncated gamma.
What if the data cannot fit the truncated gamma well?
The QQ-plot (quantile-quantile) shows that truncated gamma is not a good fit.

How to find the distribution parameters for truncated gamma such as alpha (shape) and beta (scale)?
If data fitting cannot work here, what other methods I can use for that?
Any help would be appreciated.

Best Answer

Showing us the distribution may help with concrete suggestions or comments.

The QQ-plot (quantile-quantile) shows that it is not a good fit for truncated gamma.

How do you generate the expected quantiles for the truncated gamma?

How to find the distribution parameters such as alpha (shape), beta (scale) for the truncated gamma ?

If you want to try to fit a truncated gamma, there are certainly techniques for identifying the parameters (and even the truncation point, if it's unknown).

The usual approach for doing this is via maximum likelihood; one can write down the density for the truncated distribution and then estimate the parameters via some iterative optimization scheme. Many packages provide functions which will do this optimization for you. Some even have purpose-built functions for fitting common truncated densities.

(If you have the middle of the distribution it's often reasonably easy to generate good starting estimates of the parameters for such ML optimization.)

[The R package truncdist has suitable functions for evaluating pdfs and QQ plots (and so on) for truncated distributions (it works with the gamma). Besides making it easy to generate the plots, this the would make it possible to use its functions to supply something for the optimizer functions to find ML estimates of parameters. The package distr has some useful functions, including the very handy Truncate, which may be also very useful for supplying functions suitable for optimization]

I need to find the probability density function of the distribution.

Generally speaking, you simply won't find some functional form and know "that's what it is". You may find one or two nice reasonably simple distributions that give a reasonable fit, but an infinite number of alternatives will exist. With most real data, what you actually have is lumpy and bumpy and not really any particular simple functional form.

More generally, there are numerous posts about attempting to identify which distribution data might be from, including this, this, this, and this, which have comments that may be relevant.

Is there are reason you can't use the empirical distribution of the data itself for whatever you say you need to know the distribution for?

In any case, more information is likely to aid in making the advice more specific.

Related Solutions

Solved – Manually fitting a mixture distribution in matlab

You are right that it's not meaningful to evaluate the pdf and think of it as predicted values comparable to the data. You could try something like this:

xx = linspace(1,100);                               % grid of values
ksdensity(x,'support','positive')                   % empirical density plot
line(xx,mypdf(xx,p(1),p(2),p(3),p(4)),'color','r')  % fitted density overlay

Here I used ksdensity function from the Statistics Toolbox, but you could use hist if you normalize to unit area. The mypdf function is the pdf that you define. The first input xx is a grid of values spanning the data. The p(1),...,p(4) values are the estimated parameters that you found.

Solved – P.d.f for Gamma posterior with Exponential data

The Gamma distribution $\text{Gamma}(\alpha,\beta)$ has a mode at $\dfrac{\alpha-1}{\beta}$ and has a mean of $\dfrac{\alpha}{\beta}$

so if $N \gg \alpha$, $\sum_i^Nx_i \gg \beta$ and $\frac{1}{N} \sum_i^Nx_i \approx 10 $

then $\text{Gamma}(\alpha+N,\beta+\sum_i^Nx_i)$ will have a mode and mean near $0.1$, so if you code is not producing this then you probably have a coding error. Greenparker has pointed out a possibility

Best Answer

Related Solutions

Solved – Manually fitting a mixture distribution in matlab

Solved – P.d.f for Gamma posterior with Exponential data

Related Question