Normality Test – Are Two Asymptotic Values Enough to Fail the Test of Normality?

heavy-tailednormality-assumptionqq-plot

Looking at this post I started to wonder about the gestalt interpretation of the QQ plots generated by qqnorm in R. Here's the plot to avoid having to go to the linked post:

First off, the $y$ axis reads "quantiles", but these are not quantiles, but rather equally spaced values containing very few data points towards the ends. The plot also struck me as too horizontal to be interpreted in a gestalt fashion as a "heavy-tail" distribution.

So I ran a quick "convincing-myself" test with R: To a $\small 500$ sample from a standard normal distribution, I added just a couple of asymptotic values $\small -10$ and $\small 10$. Just $\small 2$ data points: x <- c(rnorm(500),-10, 10). My thinking was that these two values couldn't possibly change our opinion – clearly the bulk of the data were normal by design. This was the obvious plot before introducing the -10, 10:

Two loose, possibly mistaken, values wouldn't make a dent. However, the plot was again "heavy-tailed" (not so unexpected now, given the observation on the $y$ axis construction):

… a ridiculously caricatured version of the first plot. My thought was that this could represent a well-known anomaly in some cases, warranting correcting. Surely statisticians must be in the habit of getting rid of these outliers… The most surprising part came when a Shapiro-Wilk test also seemed to comfortably rule out normality just because of these two values:

    Shapiro-Wilk normality test

data:  x
W = 0.8779, p-value < 2.2e-16

I guess this is consistent with the interpretation of the boxplot:

"If the box plot for the given data has outliers on both sides and has tails longer than the length of the box, then the data is said to have heavy-tailed distribution." An Insight Into Heavy-Tailed Distribution. Annapurna Ravi, Journal of Mathematical Sciences & Mathematics Education, Vol.5, No.1.

The question/-s:

Are these essentially normal data no longer consistent with a normal distribution simply because of $2$ outliers? Can the new data be considered "heavy-tailed" because of these two dots sticking out at either end of the otherwise flat and straight QQ plot? And is this consistent with the mathematical definition of heavy tails?

Best Answer

The first plot looks to me quite similar to a mixture of a normal (most of the points) and something with larger variance and somewhat heavier tail (perhaps a 90-10 mixture but it's hard to judge) -- and possibly a slightly lower center. Of course, that doesn't mean the original process is physically a mixture of two original processes; you can get exactly the same appearance with something that has a distribution close to the distribution function of that mixture.

Surely statisticians must be in the habit of getting rid of these outliers

Actually, there are many dozens of questions (possibly hundreds) here where people who aren't statisticians are asking about trying to get rid of outliers. You don't quite so often see the statisticians who respond saying "get rid of outliers"; you may want to do something* but not necessarily "get rid of them".

* (such as choose a better model, or a more robust methodology; if I was using a Wilcoxon-Mann-Whitney test on a pair of samples like that I might not care at all),

The most surprising part came when a Shapiro-Wilk test also seemed to comfortably rule out normality just because of these two values:

That wasn't remotely surprising to me, since they will dramatically affect the correlation in the plot, which directly impacts the closely-related Shapiro-Francia statistic (if my recollection is correct, you can regard it as a function of $R^2$ in the QQ-plot -- or possibly it is $R^2$, and it's also typically very close to the Shapiro-Wilk). So in fact I'd be surprised if two such large outliers didn't cause the Shapiro-Wilk to be significant given how big the sample is.

In fact I was playing with much larger samples just recently (yesterday? I think it was) with only a single outlier where the squared-correlation in the corresponding plot was really quite close to 0.

then the data is said to have heavy-tailed distribution

Said by whom? Presumably the author of that sentence. (He either needs to say who is saying it, or own it as his own definition.)

It seems a somewhat odd definition, but okay, he can define it that way if he wants.

(By the look of it, that journal badly needs a decent editor)

Are these essentially normal data no longer consistent with a normal distribution simply because of 2 outliers?

A contaminated distribution where a large fraction of the distribution (leading to 99.6% of the values in the sample) are from a standard normal and a very small fraction of the distribution (leading to 0.4% of the sample values) are from some distribution likely to produce values 10 sd's (10 of the uncontaminated distributions sd's) away from the mean is - clearly - not normal; we just said words to that effect.

You may well ask how much impact does it have on whatever thing you're interested in. For some things the impact may be great for other things it may be small, but normal populations have almost no chance of producing a sample like that.

Can the new data be considered "heavy-tailed" because of these two dots sticking out at either end of the otherwise flat and straight QQ plot?

A distribution that is likely to produce a sample like that would be called heavy tailed. We might also reasonably refer to the ecdf of that sample as 'heavy tailed'.

And is this consistent with the mathematical definition of heavy tails?

Which definition are we talking about?

Note that the ecdf doesn't have any values at all beyond the largest and smallest sample values; if you have a definition of heavy tailed that talks about limiting behavior of the tail of (say) the survivor function (on the right, and perhaps the cdf on the left), it might not be heavy tailed at all ... but the distribution from which the sample was drawn might -- or might not -- be heavy tailed by the same definition, depending on what the actual distribution was and what the definition of heavy-tailed was.

Related Solutions

Solved – Interpreting QQ plot (Normal vs Heavy-tailed)

The null hypothesis for a Shapiro-Wilk test is that the population from which a sample was randomly sampled has some normal distribution (parameters unspecified). By contrast, $H_0$ for our Kolmogorov-Smirnov test is that the population is normal with specified $\mu$ and $\sigma.$ (If you estimate $\mu$ by $\bar X$ and $\sigma$ by $S,$ the P-value needs to be adjusted.)

Here is an example of normal Q-Q plots and tests for samples of size $n=250$ from normal and heavy tailed $\mathsf{T}(\nu=2)$ distributions. Because you show a Q-Q plot with Sample Quantiles on the vertical axis (default in R), that is the type of Q=Q plots I show.

Moderate sample size. We use $n=250$ here because formal tests for various distributions may be at their best for such moderate sample sizes.

The S-W, and especially the K-S test, may have very poor power for small sample sizes.
Also, in practice with huge samples, these tests may too 'readily' reject a (nearly) normal sample as being non-normal because of some small quirk that is not of practical importance.

Normal data. The sample is of moderate size so the tests work well. Neither S-W nor K-S for $\mathsf{Norm}(0.1) rejects.

set.seed(1234)
z = rnorm(250)  # standard normal
summary(z);  sd(z)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-3.233152 -0.657095 -0.043433 -0.004079  0.623527  3.043766 
[1] 1.017413  # sample SD
shapiro.test(z)$p.val
[1] 0.1382135         # Not Rejected (correct)
ks.test(z, pnorm, 0,1)$p.val
[1] 0.7156302         # Not Rejected (correct)

Heavy-tailed $\mathsf{T}(\nu=2)$ population. This distribution has such heavy tails that it has no variance (or standard deviation), so we do not show its sample standard deviation in the summary. Notice max and min both far from $\mu=0.$

u = rt(250, 2)
summary(u)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-12.75136  -0.70875   0.05952   0.19331   0.93226  20.32579

The S-W test strongly rejects the sample as non-normal, the K-S barely rejects the sample as not from $\mathsf{Norm}(0,1).$ The K-S test [based on the CDF of $\mathsf{T}(2)]$correctly fails to reject the population as sampled from this heavy-tailed distribution.

shapiro.test(u)$p.val
[1] 3.118322e-19       # Strongly Rejected (correct)

ks.test(u, pnorm, 0,1)$p.val
[1] 0.02851291         # Barely Rejected (correct)

ks.test(u, pt, 2)$p.val
[1] 0.1142186          # Not Rejected (correct)

Normal probability plots of the two samples. Many statisticians prefer to judge normality "by eye," using Q-Q plots, rather than by using formal tests.

One expects normal data to yield a "nearly" linear pattern of points, perhaps staying near a reference line based on upper and lower quartiles. However, in the tails were data is sparse one does not expect the data points to follow the reference line closely. There is no question that the sample from the heavy-tailed distribution fails to yield a "linear" plot.

R code for plots:

par(mfrow=c(1,2))
 qqnorm(z, main="Normal")
  qqline(z, col="blue")
 qqnorm(u, main="T(2)")
  qqline(u, col="blue")
par(mfrow=c(1,1))

Finally, we show normal probability plots for two additional samples of size 250 from these same distributions.

set.seed(1122)
z = rnorm(250);  u = rt(250, 2)

Best Answer

Related Solutions

Solved – Interpreting QQ plot (Normal vs Heavy-tailed)

Related Question