Solved – Interpreting QQ plot (Normal vs Heavy-tailed)

heavy-tailednormal distributionqq-plot

I'm having some trouble interpreting the shape of this distribution. It is a distribution of price differences between an estimate and actual price. There are 219 points. I'm not sure if I can call it Gaussian or if it is heavy tailed or light tailed. The shapiro-wilk test for normality gave me a significant p-value of 1.96e^-5. While the KS-test(using a second normal distribution with he same mean and sd) gave me an insignificant p-value. (0.14). How an I go about finding a an appropriate distribution this can be described with?
enter image description here

Best Answer

The null hypothesis for a Shapiro-Wilk test is that the population from which a sample was randomly sampled has some normal distribution (parameters unspecified). By contrast, $H_0$ for our Kolmogorov-Smirnov test is that the population is normal with specified $\mu$ and $\sigma.$ (If you estimate $\mu$ by $\bar X$ and $\sigma$ by $S,$ the P-value needs to be adjusted.)

Here is an example of normal Q-Q plots and tests for samples of size $n=250$ from normal and heavy tailed $\mathsf{T}(\nu=2)$ distributions. Because you show a Q-Q plot with Sample Quantiles on the vertical axis (default in R), that is the type of Q=Q plots I show.

Moderate sample size. We use $n=250$ here because formal tests for various distributions may be at their best for such moderate sample sizes.

  • The S-W, and especially the K-S test, may have very poor power for small sample sizes.

  • Also, in practice with huge samples, these tests may too 'readily' reject a (nearly) normal sample as being non-normal because of some small quirk that is not of practical importance.

Normal data. The sample is of moderate size so the tests work well. Neither S-W nor K-S for $\mathsf{Norm}(0.1) rejects.

set.seed(1234)
z = rnorm(250)  # standard normal
summary(z);  sd(z)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-3.233152 -0.657095 -0.043433 -0.004079  0.623527  3.043766 
[1] 1.017413  # sample SD
shapiro.test(z)$p.val
[1] 0.1382135         # Not Rejected (correct)
ks.test(z, pnorm, 0,1)$p.val
[1] 0.7156302         # Not Rejected (correct)

Heavy-tailed $\mathsf{T}(\nu=2)$ population. This distribution has such heavy tails that it has no variance (or standard deviation), so we do not show its sample standard deviation in the summary. Notice max and min both far from $\mu=0.$

u = rt(250, 2)
summary(u)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-12.75136  -0.70875   0.05952   0.19331   0.93226  20.32579 

The S-W test strongly rejects the sample as non-normal, the K-S barely rejects the sample as not from $\mathsf{Norm}(0,1).$ The K-S test [based on the CDF of $\mathsf{T}(2)]$correctly fails to reject the population as sampled from this heavy-tailed distribution.

shapiro.test(u)$p.val
[1] 3.118322e-19       # Strongly Rejected (correct)

ks.test(u, pnorm, 0,1)$p.val
[1] 0.02851291         # Barely Rejected (correct)

ks.test(u, pt, 2)$p.val
[1] 0.1142186          # Not Rejected (correct)

Normal probability plots of the two samples. Many statisticians prefer to judge normality "by eye," using Q-Q plots, rather than by using formal tests.

One expects normal data to yield a "nearly" linear pattern of points, perhaps staying near a reference line based on upper and lower quartiles. However, in the tails were data is sparse one does not expect the data points to follow the reference line closely. There is no question that the sample from the heavy-tailed distribution fails to yield a "linear" plot.

enter image description here

R code for plots:

par(mfrow=c(1,2))
 qqnorm(z, main="Normal")
  qqline(z, col="blue")
 qqnorm(u, main="T(2)")
  qqline(u, col="blue")
par(mfrow=c(1,1))

Finally, we show normal probability plots for two additional samples of size 250 from these same distributions.

set.seed(1122)
z = rnorm(250);  u = rt(250, 2)

enter image description here