I have a dataset and would like to figure out which distribution fits my data best.
I used the fitdistr()
function to estimate the necessary parameters to describe the assumed distribution (i.e. Weibull, Cauchy, Normal). Using those parameters I can conduct a Kolmogorov-Smirnov Test to estimate whether my sample data is from the same distribution as my assumed distribution.
If the p-value is > 0.05 I can assume that the sample data is drawn from the same distribution. But the p-value doesn't provide any information about the godness of fit, isn't it?
So in case the p-value of my sample data is > 0.05 for a normal distribution as well as a weibull distribution, how can I know which distribution fits my data better?
This is basically the what I have done:
> mydata
[1] 37.50 46.79 48.30 46.04 43.40 39.25 38.49 49.51 40.38 36.98 40.00
[12] 38.49 37.74 47.92 44.53 44.91 44.91 40.00 41.51 47.92 36.98 43.40
[23] 42.26 41.89 38.87 43.02 39.25 40.38 42.64 36.98 44.15 44.91 43.40
[34] 49.81 38.87 40.00 52.45 53.13 47.92 52.45 44.91 29.54 27.13 35.60
[45] 45.34 43.37 54.15 42.77 42.88 44.26 27.14 39.31 24.80 16.62 30.30
[56] 36.39 28.60 28.53 35.84 31.10 34.55 52.65 48.81 43.42 52.49 38.00
[67] 38.65 34.54 37.70 38.11 43.05 29.95 32.48 24.63 35.33 41.34
# estimate shape and scale to perform KS-test for weibull distribution
> fitdistr(mydata, "weibull")
shape scale
6.4632971 43.2474500
( 0.5800149) ( 0.8073102)
# KS-test for weibull distribution
> ks.test(mydata, "pweibull", scale=43.2474500, shape=6.4632971)
One-sample Kolmogorov-Smirnov test
data: mydata
D = 0.0686, p-value = 0.8669
alternative hypothesis: two-sided
# KS-test for normal distribution
> ks.test(mydata, "pnorm", mean=mean(mydata), sd=sd(mydata))
One-sample Kolmogorov-Smirnov test
data: mydata
D = 0.0912, p-value = 0.5522
alternative hypothesis: two-sided
The p-values are 0.8669 for the Weibull distribution, and 0.5522 for the normal distribution. Thus I can assume that my data follows a Weibull as well as a normal distribution. But which distribution function describes my data better?
Referring to elevendollar I found the following code, but don't know how to interpret the results:
fits <- list(no = fitdistr(mydata, "normal"),
we = fitdistr(mydata, "weibull"))
sapply(fits, function(i) i$loglik)
no we
-259.6540 -257.9268
Best Answer
First, here are some quick comments:
But let's do some exploration. I will use the excellent
fitdistrplus
package which offers some nice functions for distribution fitting. We will use the functiondescdist
to gain some ideas about possible candidate distributions.Now lets use
descdist
:The kurtosis and squared skewness of your sample is plottet as a blue point named "Observation". It seems that possible distributions include the Weibull, Lognormal and possibly the Gamma distribution.
Let's fit a Weibull distribution and a normal distribution:
Now inspect the fit for the normal:
And for the Weibull fit:
Both look good but judged by the QQ-Plot, the Weibull maybe looks a bit better, especially at the tails. Correspondingly, the AIC of the Weibull fit is lower compared to the normal fit:
Kolmogorov-Smirnov test simulation
I will use @Aksakal's procedure explained here to simulate the KS-statistic under the null.
The ECDF of the simulated KS-statistics looks like follows:
Finally, our $p$-value using the simulated null distribution of the KS-statistics is:
This confirms our graphical conclusion that the sample is compatible with a Weibull distribution.
As explained here, we can use bootstrapping to add pointwise confidence intervals to the estimated Weibull PDF or CDF:
Automatic distribution fitting with GAMLSS
The
gamlss
package forR
offers the ability to try many different distributions and select the "best" according to the GAIC (the generalized Akaike information criterion). The main function isfitDist
. An important option in this function is the type of the distributions that are tried. For example, settingtype = "realline"
will try all implemented distributions defined on the whole real line whereastype = "realsplus"
will only try distributions defined on the real positive line. Another important option is the parameter $k$, which is the penalty for the GAIC. In the example below, I set the parameter $k = 2$ which means that the "best" distribution is selected according to the classic AIC. You can set $k$ to anything you like, such as $\log(n)$ for the BIC.According to the AIC, the Weibull distribution (more specifically
WEI2
, a special parametrization of it) fits the data best. The exact parametrization of the distributionWEI2
is detailled in this document on page 279. Let's inspect the fit by looking at the residuals in a worm plot (basically a de-trended Q-Q-plot):We expect the residuals to be close to the middle horizontal line and 95% of them to lie between the upper and lower dotted curves, which act as 95% pointwise confidence intervals. In this case, the worm plot looks fine to me indicating that the Weibull distribution is an adequate fit.