Solved – Fitting data to gamma distribution to find score which corresponds to pvalue < 0.05

density functiongamma distributionmodelingp-valuer

I have data of size 116.667 rows defined as:

iD  Signal
chr17.3620  0.5741552
chr1.7341   0.5680284
chr7.3937   0.5479430
chr17.3890  0.5402434
chr12.3200  0.5298978
chr17.7227  0.5298536

As it is difficult to show results with all the data, because I have no reputation enough to upload images I made my post with the first 200 rows.

Summary:

Min. = 0.3693
1st Quartile = 0.3847
Median = 0.4039
Mean = 0.4199
3rd Quartile = 0.4413
Max. = 0.5742

Also I tried some exploratory analysis to see which kind of distribution follows this data (i.e. plot(density(t)) and qqnorm(t); qqline(t,col=2)). According to preliminary results, I would say that this data apparently follows a gamma distribution.

Here I put the first 200 signal values:

A <- structure(list(V1 = c(0.5741552,0.5680284,0.5479430,0.5402434,0.5298978,0.5298536,0.5282417,
0.5165426,0.5131503,0.5129329,0.5105448,0.5104201,0.5095860,0.5090263,
0.5061467,0.4972821,0.4959428,0.4953381,0.4920510,0.4915160,0.4868505,
0.4843749,0.4825519,0.4823313,0.4809742,0.4788553,0.4775991,0.4770962,
0.4745947,0.4743952,0.4727112,0.4718017,0.4714738,0.4674141,0.4670385,
0.4648104,0.4633502,0.4616054,0.4615068,0.4614247,0.4613338,0.4597812,
0.4551755,0.4535067,0.4528133,0.4508228,0.4494993,0.4494936,0.4442789,
0.4413460,0.4412279,0.4402557,0.4392294,0.4385639,0.4385187,0.4361337,
0.4344499,0.4342413,0.4342331,0.4338879,0.4337806,0.4336820,0.4329372,
0.4325534,0.4323201,0.4312287,0.4292037,0.4281761,0.4279843,0.4279774,
0.4252035,0.4243487,0.4228516,0.4226953,0.4218263,0.4214821,0.4212546,
0.4210894,0.4206089,0.4204235,0.4193896,0.4168915,0.4164699,0.4152126,
0.4127455,0.4126053,0.4113571,0.4105654,0.4099753,0.4088188,0.4085093,
0.4075957,0.4074018,0.4072499,0.4072114,0.4067329,0.4065400,0.4052757,
0.4044982,0.4040699,0.4036509,0.4033471,0.4031712,0.4026698,0.4017872,
0.4011538,0.4011325,0.4011320,0.4008897,0.4006470,0.4003469,0.3996736,
0.3992583,0.3991979,0.3990366,0.3989118,0.3983172,0.3980860,0.3978592,
0.3977522,0.3965371,0.3963045,0.3957640,0.3954328,0.3950159,0.3935825,
0.3934975,0.3932916,0.3931091,0.3929565,0.3922829,0.3919779,0.3919713,
0.3914740,0.3910446,0.3909540,0.3890607,0.3890550,0.3876478,0.3875172,
0.3873815,0.3872299,0.3870533,0.3858995,0.3858361,0.3855984,0.3854444,
0.3852595,0.3849558,0.3847531,0.3844442,0.3842814,0.3831377,0.3822418,
0.3817666,0.3805661,0.3803090,0.3802035,0.3800845,0.3800580,0.3799694,
0.3795814,0.3794039,0.3792874,0.3788970,0.3787295,0.3785160,0.3782523,
0.3782439,0.3779547,0.3778596,0.3777452,0.3770986,0.3767652,0.3767104,
0.3765786,0.3760886,0.3760124,0.3753271,0.3750943,0.3749116,0.3744146,
0.3743998,0.3730250,0.3729932,0.3727007,0.3726170,0.3722539,0.3721743,
0.3721055,0.3720965,0.3718959,0.3714824,0.3711862,0.3709115,0.3708262,
0.3706647,0.3701728,0.3697490,0.3692868)))

t<-as.matrix(A$V1)

My questions are:

1.- How can I fit my data to a gamma distribution?

2.- Is there any way to determine, according to the model fitting, which signal value corresponds to p-value < 0.05?

Thanks for your help!

PD 1: I accept suggestions in order to edit as this is my first post in the community!

PD 2 : I've been trying all of this using R

Best Answer

One simple way to fit a gamma distribution to the data is the method of moments: the gamma distribution with parameters $(\alpha, \beta)$ has mean $\frac\alpha\beta$ and variance $\frac\alpha{\beta^2}$. You can use sample estimates of the mean and variance and some algebra to solve for the parameters of the model. Naturally, more advanced methods exist. But with the large number of observations, this should be a good starting point.
I'm uncertain what your second question means. Are you looking for the score which divides the smallest 95% of your data from the largest 5%?

Related Solutions

Solved – How to find the distribution of data with data fitting

Showing us the distribution may help with concrete suggestions or comments.

The QQ-plot (quantile-quantile) shows that it is not a good fit for truncated gamma.

How do you generate the expected quantiles for the truncated gamma?

How to find the distribution parameters such as alpha (shape), beta (scale) for the truncated gamma ?

If you want to try to fit a truncated gamma, there are certainly techniques for identifying the parameters (and even the truncation point, if it's unknown).

The usual approach for doing this is via maximum likelihood; one can write down the density for the truncated distribution and then estimate the parameters via some iterative optimization scheme. Many packages provide functions which will do this optimization for you. Some even have purpose-built functions for fitting common truncated densities.

(If you have the middle of the distribution it's often reasonably easy to generate good starting estimates of the parameters for such ML optimization.)

[The R package truncdist has suitable functions for evaluating pdfs and QQ plots (and so on) for truncated distributions (it works with the gamma). Besides making it easy to generate the plots, this the would make it possible to use its functions to supply something for the optimizer functions to find ML estimates of parameters. The package distr has some useful functions, including the very handy Truncate, which may be also very useful for supplying functions suitable for optimization]

I need to find the probability density function of the distribution.

Generally speaking, you simply won't find some functional form and know "that's what it is". You may find one or two nice reasonably simple distributions that give a reasonable fit, but an infinite number of alternatives will exist. With most real data, what you actually have is lumpy and bumpy and not really any particular simple functional form.

More generally, there are numerous posts about attempting to identify which distribution data might be from, including this, this, this, and this, which have comments that may be relevant.

Is there are reason you can't use the empirical distribution of the data itself for whatever you say you need to know the distribution for?

In any case, more information is likely to aid in making the advice more specific.

Solved – R: Which distribution to use with gbm for gamma distributed data

The distribution gamma are available in both gbm (only for the github version https://github.com/gbm-developers/gbm , not in the CRAN version) and mboost package.

For the package gbm, simply specify distribution = 'gamma' in the parameters of gbm function.

For the package mboost, use gamma distribution specifying family = GammaReg() in the options of the function mboost as shown in the toy example below :

library(mboost)
n.obs  <- 1000
n.iter <- 100
x1     <- rgamma(n.obs, shape = 1, scale = 1)
x2     <- rgamma(n.obs, shape = 2, scale = 1)
y      <- x1 + x2
model  <- mboost(formula = y ~ x1 + x2, data = data.frame(y, x1, x2),
                 baselearner = "btree", family = GammaReg(), 
                 control = boost_control(mstop = n.iter))

Best Answer

Related Solutions

Solved – How to find the distribution of data with data fitting

Solved – R: Which distribution to use with gbm for gamma distributed data

Related Question