Finding a distribution of the data is a crucial part of my thesis. I have to process this step in R eventhough there are some other tools to get these information in fast.
I made some search to analyze which distribution fits best for the given variable, this instructions guided me a bit.
For instructions: via stackoverflow: how-to-determine-which-distribution-fits-my-data-best
However, I am lost to have distributions of the variables since I have about 18.
For example;
http://www.filedropper.com/samplest
library(fitdistrplus)
importeddata <- read.csv(file.choose(), sep=";",na.strings = "", stringsAsFactors=FALSE, header = TRUE)
for(i in 1:tail(ncol(importeddata))){
importeddata[,i] <- gsub(",", ".", importeddata[ , i])}
xx<- as.matrix(as.data.frame(lapply(importeddata, as.numeric)))
descdist(xx[,1])
I can say that this variable may fit uniform, beta or normal distributions. Let's see.
fit.norm <- fitdist(xx[,1], "norm")
fit.norm
Fitting of the distribution ' norm ' by maximum likelihood
Parameters:
estimate Std. Error
mean 13.428316 0.3652664
sd 7.120353 0.2582823
plot(fit.norm)
However, beta causes an error. Because, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.
fitdist(xx[,1], "beta")
Error in start.arg.default(data10, distr = distname) :
values must be in [0-1] to fit a beta distribution
fit.uni <- fitdist(xx[,1], "beta")
Fitting of the distribution ' unif ' by maximum likelihood
Parameters:
estimate Std. Error
min 3.12 NA
max 29.64 NA
plot(fit.uni)
fit.uni$aic
[1] NA
fit.norm$aic
[1] 2574.241
There are two questions to be asked:
- May I directly said that xx variable is normally distributed N(13.42,7.12)? How can I compare the distributions better or not?
- Is there alternative way to have these informations? Because it is going to be repeated 18 times.
Best Answer
There are important things to say that are much too long for comments but you'll need to answer some questions (which I will post in comments) for a proper answer to be offered.
Note that the distributions in the $(\beta_1,\beta_2)$ plot$^\dagger$ are all actually location-scale families of distributions (you can shift or stretch the distributions without changing the skewness and kurtosis). When dealing with skewness-squared (as is the case for both our plots), along with the scale factor it also includes a term for a sign-flip.
[In reality in that diagram we're dealing there with the Pearson distributions plus lognormal and logistic; if you're going to show additional distributions than the Pearson family it's not clear to me why you'd add those but not some others; adding new distributions to such plots is discussed here]
The grey region in your plot (pink in the plot below) is that for the Pearson distribution type I -
(plot taken from my answer at the link above)
$$f_Y(y) =\frac{1}{B(\alpha,\beta)} \frac{(y-a)^{\alpha-1} (c-y)^{\beta-1}}{(c-a)^{\alpha+\beta-1}},\: a < y <c$$
This is why your beta fit failed!
It surely isn't, so you had better not claim that it is. It very likely won't from be any of the distributions you consider (nor any other simple distribution). Those are models -- convenient but hopefully useful approximations.
$\dagger$ such charts - plotting sample $\beta_1,\beta_2$ (or sometimes skewness and kurtosis rather than squared-skewness and kurtosis) to identify plausible distributions - long predate Cullen and Frey (1999), by the way; I was making such plots in the 80s (several times, including in an unpublished thesis, though my plot also included the Laplace in addition to the lognormal and logistic that the above plot adds to the Pearson family); but Bowman and Shenton were effectively making them in the 70s, when they ivestigated the sampling distribution of skewness and kurtosis under normality -- and I am pretty confident that Bowman and Shenton didn't come up with the idea of looking at the sample values on a plot like that either; I think it may go back decades earlier. Indeed it turns out Cullen and Frey themselves say "many texts provide such charts" and give the example of Hahn and Shapiro, 1967 (so this oddness is not Cullen and Frey's fault). Some other programs call it a Pearson plot, a much better choice I think.