Solved – Fitting Distribution for data in R

distributionsfittingr

Finding a distribution of the data is a crucial part of my thesis. I have to process this step in R eventhough there are some other tools to get these information in fast.
I made some search to analyze which distribution fits best for the given variable, this instructions guided me a bit.

For instructions: via stackoverflow: how-to-determine-which-distribution-fits-my-data-best

However, I am lost to have distributions of the variables since I have about 18.

For example;

http://www.filedropper.com/samplest

library(fitdistrplus)   

importeddata <- read.csv(file.choose(), sep=";",na.strings = "", stringsAsFactors=FALSE, header = TRUE)

for(i in 1:tail(ncol(importeddata))){
  importeddata[,i] <- gsub(",", ".", importeddata[ , i])} 
xx<- as.matrix(as.data.frame(lapply(importeddata, as.numeric)))

descdist(xx[,1])

I can say that this variable may fit uniform, beta or normal distributions. Let's see.

    fit.norm <- fitdist(xx[,1], "norm")
    fit.norm
         Fitting of the distribution ' norm ' by maximum likelihood 
         Parameters:
              estimate Std. Error
         mean 13.428316  0.3652664
         sd    7.120353  0.2582823

    plot(fit.norm)

However, beta causes an error. Because, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.

   fitdist(xx[,1], "beta")

Error in start.arg.default(data10, distr = distname) :
values must be in [0-1] to fit a beta distribution

  fit.uni <- fitdist(xx[,1], "beta")

       Fitting of the distribution ' unif ' by maximum likelihood 
       Parameters:
        estimate Std. Error
             min     3.12         NA
             max    29.64         NA

   plot(fit.uni)


  fit.uni$aic
  [1] NA

  fit.norm$aic
  [1] 2574.241

There are two questions to be asked:

May I directly said that xx variable is normally distributed N(13.42,7.12)? How can I compare the distributions better or not?
Is there alternative way to have these informations? Because it is going to be repeated 18 times.

Best Answer

There are important things to say that are much too long for comments but you'll need to answer some questions (which I will post in comments) for a proper answer to be offered.

Note that the distributions in the $(\beta_1,\beta_2)$ plot$^\dagger$ are all actually location-scale families of distributions (you can shift or stretch the distributions without changing the skewness and kurtosis). When dealing with skewness-squared (as is the case for both our plots), along with the scale factor it also includes a term for a sign-flip.

[In reality in that diagram we're dealing there with the Pearson distributions plus lognormal and logistic; if you're going to show additional distributions than the Pearson family it's not clear to me why you'd add those but not some others; adding new distributions to such plots is discussed here]

The grey region in your plot (pink in the plot below) is that for the Pearson distribution type I -

(plot taken from my answer at the link above)

this is a location-scale family which corresponds (with different parameterization) to a four parameter beta), not the two-parameter beta you tried to fit.

$$f_Y(y) =\frac{1}{B(\alpha,\beta)} \frac{(y-a)^{\alpha-1} (c-y)^{\beta-1}}{(c-a)^{\alpha+\beta-1}},\: a < y <c$$

This is why your beta fit failed!

May I directly say that the xx variable is normally distributed N(13.42,7.12)

It surely isn't, so you had better not claim that it is. It very likely won't from be any of the distributions you consider (nor any other simple distribution). Those are models -- convenient but hopefully useful approximations.

$\dagger$ such charts - plotting sample $\beta_1,\beta_2$ (or sometimes skewness and kurtosis rather than squared-skewness and kurtosis) to identify plausible distributions - long predate Cullen and Frey (1999), by the way; I was making such plots in the 80s (several times, including in an unpublished thesis, though my plot also included the Laplace in addition to the lognormal and logistic that the above plot adds to the Pearson family); but Bowman and Shenton were effectively making them in the 70s, when they ivestigated the sampling distribution of skewness and kurtosis under normality -- and I am pretty confident that Bowman and Shenton didn't come up with the idea of looking at the sample values on a plot like that either; I think it may go back decades earlier. Indeed it turns out Cullen and Frey themselves say "many texts provide such charts" and give the example of Hahn and Shapiro, 1967 (so this oddness is not Cullen and Frey's fault). Some other programs call it a Pearson plot, a much better choice I think.

Related Solutions

Goodness of Fit – Evaluating Distribution Fit with Estimated Parameters Using KS Tests in R

The paper from Clauset et al. warns (Section 4.2) against small sample sizes (< 100) which are much easier to fit. You may want to consider using the direct comparisons of models.

While the p-value of the KS statistic with estimated parameters is an overestimate, the bootstrapping procedure you described is able to tackle this and provides a correct p-value given enough simulations.

However, the way the goodness of fit is computed in your code is not correct as it does not strictly follow the procedure described in the paper, and implemented in the poweRlaw package.

Specifically: the synthetic data generation procedure is half implemented, it does not search for the best xmin as provided by the extimate_xmin function of the poweRlaw package, and finally the ks.test discards all the ties, which the package doesn't with its built-in KS test.

On this page is provided code that takes into account these issues using poweRlaw; as a consequence it is significantly slower than the code you suggested: http://notesnico.blogspot.com/2014/07/goodness-of-fit-test-for-log-normal-and.html

Solved – Fitting custom distributions by MLE

This answer assumes $\mu$ is known.

One very flexible way to get MLE's in R is to use STAN via rstan. STAN has a reputation for being an MCMC tool, but it also can estimate parameters by variational inference or MAP. And you're free to not specify the priors.

In this case, what you're doing is very similar to their hurdle-model example. Here is the STAN code for that example.

data {
  int<lower=0> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> lambda;
}
model {
  for (n in 1:N) {
    if (y[n] == 0)
      target += bernoulli_lpmf(1 | theta);
    else
      target += bernoulli_lpmf(0 | theta)
                  + poisson_lpmf(y[n] | lambda);
  }
}

To adapt this for your own use, you could:

Replace poisson_lpmf with the log-density for your $f_A$.
Add a third case to the if-else so that it checks for exceeding $\mu$, not just 0. As the meat of that third case, use the log pmf for your extreme value distribution of choice.
Replace bernoulli_lpmf with categorical_lpmf and make the mixture probability parameter into a vector.
To incorporate covariates, you can add regression parameters, and make all your other parameters functions of them. It may help to use categorical_logit_lpmf in place of categorical_lpmf.
Truncate one mixture component at $\mu$ from above and the other at $\mu$ from below, depending on your perspective on the dilemma raised by Jarle Tufto in the comments. It seems like you could get VERY different estimates depending on how exactly you decide to handle this. A nice sanity check: generate a fake dataset from the fitted parameters and make sure it has the right amount at 0, amount above $\mu$, etc.

Once you have a file with the right STAN code, you can use STAN with lots of different toolchains. To use it with R, check out these examples. I simplified one to get an MLE, using rstan::optimizing instead of sampling:

install.packages("rstan")
library("rstan")
model = stan_model("Example1.stan")
fit = optimizing(model)

There are also some tricks for faster/better optimization that could help in practice.

Best Answer

Related Solutions

Goodness of Fit – Evaluating Distribution Fit with Estimated Parameters Using KS Tests in R

Solved – Fitting custom distributions by MLE

Related Question