Look how big your values are. Look at your Theoretical Quantiles on the second (bad) set of plots. Notice anything alarming?
statsmodel.ProbPlot is experiencing some, ahem, extreme numerical errors when you feed it such large values.
We can compare the behavior of scipy.stats.probplot with statsmodel.ProbPlot as a check.
Run this code first with s=0.954
. Then run it with s=5.23
.
import matplotlib
import numpy as np
import statsmodels.api as sm
from scipy import stats
from scipy.stats import lognorm
import matplotlib.pyplot as plt
# generate some lognormal values
s = 0.954 # shape parameter
#s = 5.23
mean, var, skew, kurt = lognorm.stats(s, moments='mvsk')
print(mean,var,skew,kurt)
x = stats.lognorm.rvs(s, size=1000)
print("max(x) = ",np.max(x)," min(x) = ",np.min(x))
# make Scipy probplot
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
res1 = stats.probplot(x, fit=True, dist=stats.lognorm,sparams=(s,), plot=ax1)
#res1 = stats.probplot(x, fit=True, sparams=(s,), plot=ax1)
ax1.set_title("Scipy Probplot for lognormal dist")
plt.show()
# make statsmodel probplot
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
#shape, loc, scale = lognorm.fit(x)
print(shape,loc,scale)
#probplot = sm.ProbPlot(x, dist=lognorm, fit=True, loc=loc, scale=scale, distargs=(shape,))
print("max(x) = ",np.max(x)," min(x) = ",np.min(x))
#probplot = sm.ProbPlot(np.log(x), fit=True)
probplot = sm.ProbPlot(x, dist=lognorm, fit=True) # bad line
#probplot = sm.ProbPlot(x, dist=lognorm, loc=loc, scale=scale, fit=True)
probplot.ppplot(line='45', ax=ax[0])
probplot.qqplot(line='45', ax=ax[1])
ax[0].set_title('P-P Plot')
ax[1].set_title('Q-Q Plot')
plt.show()
Scipy also has an unfamiliar form for the lognormal equation, see fitting log normal distribution in R vs Scipy
Addendum:
Look at the size of your Theoretical Quantiles, with values like 120,000 and 500,000. I postulate that within sm.ProbPlot, somewhere it is doing exp(large number) and getting a lot of inf's and large numbers like 500,000.
Now run this code (below) Initially, s
is set to 0.954. Note how the max of the data is around 25 or so. The quantiles and histogram all have maximum values close to the maximum of the data.
Then set s
to 3.74176496393, which is the fit from your data. Now things start to go awry. I just ran it and got max(x)=485,096. The plots are scaled out to 500,000 to 700,000. This is aggravated by your original code line:
probplot = sm.ProbPlot(x, dist=lognorm, fit=True) # original code line
... which doesn't give location, scale, and shape to the routine.
Yet your data don't have a max of 200,000 - 500,000, as a shape=3.74 will give you. The max of your data is 1913 -- when you fit it, you got a shape parameter of s=3.74. Your 75% is 69; could you have some outliers that are excessively influencing the fit?
I postulate you are getting a poor fit of the lognormal distribution to your data.
import matplotlib
import numpy as np
import statsmodels.api as sm
from scipy import stats
from scipy.stats import lognorm
import matplotlib.pyplot as plt
# generate some lognormal values
s = 0.954 # shape parameter
#s = 3.74176496393
mean, var, skew, kurt = lognorm.stats(s, moments='mvsk')
print("Generated random lognormal variable x")
print("shape = ",s)
print("Distribution parameters")
print("mean = ",mean," var = ",var," skew = ",skew," kurtosis = ",kurt)
x = stats.lognorm.rvs(s, size=1000)
print("max(x) = ",np.max(x)," min(x) = ",np.min(x))
# make Scipy probplot
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
res1 = stats.probplot(x, fit=True, dist=stats.lognorm,sparams=(s,), plot=ax1)
#res1 = stats.probplot(x, fit=True, sparams=(s,), plot=ax1)
ax1.set_title("Scipy Probplot for lognormal dist")
plt.show()
# make statsmodel probplot
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
shape, loc, scale = lognorm.fit(x)
#paramters from actual data
#shape=3.74176496393
#loc=1.48645695058
#scale=1.95280308726
print("shape (sigma) =",shape," loc (mu) = ",loc," scale (alpha) = ",scale)
print("max(x) = ",np.max(x)," min(x) = ",np.min(x))
probplot = sm.ProbPlot(x, dist=lognorm, fit=True, loc=loc, scale=scale, distargs=(shape,))
#probplot = sm.ProbPlot(np.log(x), fit=True)
#probplot = sm.ProbPlot(x, dist=lognorm, fit=True) # original code line
#probplot = sm.ProbPlot(x, dist=lognorm, loc=loc, scale=scale, fit=True)
probplot.ppplot(line='45', ax=ax[0])
probplot.qqplot(line='45', ax=ax[1])
ax[0].set_title('Statsmodels P-P Plot')
ax[1].set_title('Statsmodels Q-Q Plot')
plt.show()
# histograms and fit pdf
fig3 = plt.figure()
ax3 = fig3.add_subplot(111)
#xp = np.linspace(lognorm.ppf(0.01, s),
# lognorm.ppf(0.99, s), 100)
print("s =",s)
shape, loc, scale = lognorm.fit(x)
#actual values from dataset:
#shape=3.74176496393
#loc=1.48645695058
#scale=1.95280308726
print("Shape = %3f loc = %3f scale = %3f" %(shape,scale,loc))
xp = np.linspace(lognorm.ppf(0.01, shape),
lognorm.ppf(0.99, shape), 100)
ax3.plot(xp, lognorm.pdf(xp,shape),'r-', lw=5, alpha=0.6, label='lognorm pdf',figure=fig3)
ax3.hist(x, normed=True, log=False, histtype='stepfilled', alpha=0.2,figure=fig3)
ax3.legend(loc='best', frameon=False)
plt.show()
Best Answer
I'll turn my comments into an answer; I can delete this or add more if necessary.
Based on your original qq-plot, it appears to me that the tails of your distribution may be too short--at least relative to the normal distribution. (This is based on my interpretation that the data values are on the Y axis "Ordered Values" and the theoretical quantiles are on the X axis.) As a result of this, the evident symmetry, and the slight bowing in the middle, I wondered if it might be a uniform distribution or something similar. I discussed the interpretation of qq-plots here: qq-plot does not match histogram.
Edit 2 noted that the kurtosis was given as $-1$. I like this resource for thinking about kurtosis, which notes that kurtosis cannot be lower than $1$, thus SciPy has given you excess kurtosis (which is kurtosis - 3). The Wikipedia page for kurtosis lists the kurtosis for the uniform distribution as $-1$, which is consistent with my guess about the qq-plot.
Edit 3 posts a qq-plot against the uniform, which fits rather well, but the tails now seem slightly too heavy. It's worth noting that the uniform distribution is actually a special case of the beta distribution where the parameters are $(1,1)$. Thus, it's possible you have a beta that is very close to (1,1), but not actually quite (1,1) (ie, not quite uniform). Something like $(.9, .9)$, might serve as an initial guess. Of course, the validity of this hunch depends on how much data you have as to whether that slight divergence is reliable. You can read more about the beta distribution in this excellent thread: what-is-intuition-behind-beta-distribution.