Solved – Fitting data, lognormal and beta distribution, interpretations

beta distributioninterpretationlognormal distributionmodeling

I've been given some data that should come from a lognormal distribution.
I've got some issues concerning the fitting, here's what I did.

library(fitdistrplus)

x <- c(3.36,0.31,9.24,0.29,0.01,1.18,0.35,0.89,3.23,12.24,0.19,5.35,23.05,32.39,
   0.79,0.7,14.64,8.81,4.12,17.92,7.80,11.96,18.39,11.29,6.46,13.22,9.01,9.4,
   8.43,25.82,4.69,6.28,8.70,7.45,9.48,5.07,11.93,9.52,18.41,11.38,10.80,23.21,
   15.18,17.6,20.35,49.61,34.69,12.25,38.82,25.66,25.01,16.89,19.58,22.72,
   10.01,4.30,20.06,5.93,4.55,18.11,0.54,5.75,16.79,8.77,0.11,5,3.77,9.06)

descdist(x,discrete=FALSE)

I cannot post the graph since there would be too many links for my reputation, but data seem to come from a beta distribution, rather than a lognormal one. However I've tried the fit anyway.

fit.dist <- fitdist(x,"lnorm")
plot(fit.dist)

What can I assess from the graphs? Q-Q plot suggests that the distribution of the data should have a longer right tail than the theoretical distribution. Does it also suggest that the data are right skewed?

At this point i rescale the data in the interval [0,1] to try to see if the beta distribution fits better.

y <- x/100
fit.dist <- fitdist(y,"beta")

Fit seems, as expected, much better.

> summary(fit.dist)
Fitting of the distribution ' beta ' by maximum likelihood 
Parameters : 
        estimate Std. Error
shape1 0.8660958  0.1292606
shape2 6.6110806  1.2517790
Loglikelihood:  79.03786   AIC:  -154.0757   BIC:  -149.6367

Can I, at this point, infer that the data come from a Beta(0.8660958,6.610806)?
What are the relations between Beta and lognormal in this particulare case?
What other analysis can I perform to strengthen this hypothesis?

Best Answer

The plots indicate that the data have a shorter right tail than a two parameter lognormal, not a longer tail.

Certainly the data are right skewed, but their logs are distinctly left-skew, and a lognormal would be unlikely to be useful

I suggest looking at a normal Q-Q plot of log(x) (qqnorm(log(x)) in R), or indeed a histogram of log(x) if you use plenty of bins.

If anything these plots hint at something nearer to a gamma or something perhaps even a little lighter tailed still. (If they're real data they won't be from a gamma distribution either, but it's considerably more plausible). A normal Q-Q plot of the cube root of the data looks reasonably close to straight but there's a big gap (1.057 to 1.478) in the cube-root values that would make a gamma not ideal (while being considerably better than lognormal). Aide from that gap an exponential looks feasible.

Can I, at this point, infer that the data come from a Beta(0.8660958,6.610806)?

Certainly not. It's a reasonable fit. That doesn't mean any more than just that. Don't imagine your model is anything other than a plausible fiction (possibly a useful one).

(Note also that you appear to have actually fitted a three parameter model, not a two parameter model. How did you figure to divide by 100 except by reference to the data?)

What other analysis can I perform to strengthen this hypothesis?

"Oh, it looks a bit like such-and-such" isn't really what I'd call a hypothesis, but consider issues with hypotheses suggested by the data (whether formal statistical hypotheses or not, the issues are similar)

Those issues aside, you should not seek to confirm the results of your search for a model but to disconfirm them. If we seek to confirm what we already like the idea of we'll be ignoring (generally subsconsciously) evidence otherwise. But with models, better still to just accept they're not true and focus instead on trying to get an idea of how much that might matter (or indeed, to consider whether we really require an explicit distributional model at all).

Related Solutions

Solved – Lognormal distribution from world bank quintiles PPP data

Here is the example of the quick and dirty R code to illustrate what Michael suggested:

Define quantiles available:

q<-c(0.1,0.2,0.4,0.6,0.8,0.9)

Create artificial data and add some noise

data <-jitter(qlnorm(q))

Create function to minimise

fitfun <- function(p)sum(abs(data-qlnorm(q,p[1],p[2])))

Run the optimiser with the initial guess of parameters of log-normal distribution:

opt <- optim(c(0.1,1.1))

The parameters fitted:

Display the fit visually:

aa<-seq(0,0.95,by=0.01)
plot(aa,qlnorm(aa,opt$par[1],opt$par[2]),type="l")
points(q,data)

enter image description here

Note, I intentionally plotted only 95%-quantile, since the log-normal distribution is unbounded, i.e. the 100%-quantile is infinity.

Usual caveats apply, real life example might look much uglier than this one, i.e. fit might be much worse. Also try Singh-Maddala distribution instead of log-normal, it works better for income distributions.

Solved – Fitting Beta Distributions to Data

To generate an accurate curve fit, you can sample the histogram via the scipy.stats.rv_histogram function and then average over the fitted parameters from multiple iterations:

n_iter = 1000
params = np.empty((n_iter, 2))
for i in range(n_iter):    
    n_samples = 1000
    hist_dist = scipy.stats.rv_histogram((y2, np.append(np.arange(k), k)))
    data = hist_dist.rvs(size=n_samples)
    a, b, c, d = scipy.stats.beta.fit(data, floc=0, fscale=k)

    params[i, 0] = a
    params[i, 1] = b

a_mean = np.round(np.mean(params[:, 0]), 2)
b_mean = np.round(np.mean(params[:, 1]), 2)
print(a_mean, b_mean)

y2_normalized_fit = scipy.stats.beta.pdf(np.arange(k), a_mean, b_mean, loc=0, scale=k)
scaling_factor, _, _, _ = np.linalg.lstsq(y2_normalized_fit.reshape(-1,1), y2, rcond=None)

Then, you multiply the scaling_factor by the fitted curve to get the right curve.

y2_normalized_fit * scaling_factor

Best Answer

Related Solutions

Solved – Lognormal distribution from world bank quintiles PPP data

Solved – Fitting Beta Distributions to Data

Related Question