Normality Assumption – Impact of Outliers on QQ Plot

diagnosticgeneralized linear modelnormality-assumptionoutliersqq-plot

I'm trying to build an GLM regression (10k samples and 50 dimensions). I ran an analysis of the dependent variable since the regression has a normality assumption for the dependent variable.

The QQ plot (mid fig) shows the distribution of y is far away from normal distribution (does it imply a gap in y? I did not find the gap in the histogram (top fig)). After I removed top 3% and bottom 3% of y, the QQ plot (bottom fig) becomes a straight line implying heavy tails.

My questions are: 1. why is QQ plot so sensitive to extreme values? 2. since QQ plot is too sensitive to extreme values or outliers, does it make sense to run QQ plot after removing certain data?

A previous post does not help much.

Update 2023.12.20

It turns out that I confused distribution of the response vs. distribution of residuals. The original purpose of this post is to see if GLM is appropriate in my application. I know now that I should have used residuals rather than ys for normality check.

Best Answer

First, regression does NOT assume the dependent variable is normally distributed. It makes assumptions about the errors, which we look at by examining residuals.

Second, the QQ plot is sensitive to outliers because it is supposed to be. They are not "too sensitive" to outliers, they are appropriately sensitive to them. You have five (I think it's five) points that are very far from what the normal distribution would be.

Third, histograms aren't great graphs. Yeah, I know, they are very, very common, but see this thread. There is a quote from Cleveland, something like "ubiquity and longevity are not signs of utility and histograms will not be seen in this book".

Finally, while I won't say it's never sensible to use a quantile normal plot after removing a bunch of points, it's not a good general policy.

Related Solutions

Solved – Normality of residuals – contradiction between ‘symplot’ and ‘qnorm’

Note that this has nothing at all to do with residuals as such. It applies generally to looking at any distributions.

The two graphs do not have exactly the same purpose. Be clear that a symmetry plot checks for symmetry or asymmetry and would look simple for many symmetric distributions that were not Gaussian, e.g. t distributions with finite degrees of freedom. But there is still a question of whether the graphs contradict each other.

I here assume familiarity with normal probability plots (historically often so named, although Gaussian quantile-quantile plots is a minority preferred name). See for example this explanation.

However, symmetry plots seem less used and bear some explanation.

Stata's symplot, as the axis titles imply, pairs values above and below the median and plots (largest $-$ median) vs (median $-$ smallest), (second largest $-$ median) vs (median $-$ second smallest), etc. and the reference line is thus (value in upper half $-$ median) $=$ (median $-$ value in lower half), implying symmetry of distribution.

What you can't tell easily from symplot in cases like this is how many values are in the middle, often approximately symmetric part of the distribution and how many in the rest.

It is easy for symplot therefore to impart a pessimistic message because points may be heavily overplotted near the middle of the distribution.

Here is another example. I simulate 95% of values from a Gaussian and 5% of values from a gamma with the same variance (but evidently different skew).

This is the Stata recipe used:

clear 
set obs 10000 
set seed 2803
gen y = cond(_n <= 9500, rnormal(6,10), rgamma(1,10))
symplot y
qnorm y

Loosely, the symplot seems to flag lack of symmetry (and thus lack of normality) more prominently than the normal probability plot (Gaussian quantile-quantile plot) flags lack of Gaussianity.

enter image description here

It's manifestly the same data, but the tail is inevitably more prominent in one graph than another. In addition to the question of overplotting, in a symmetry plot all the bad news is usually lumped together at one end; in a normal probability plot there is often bad news in both tails.

Normality Test – Are Two Asymptotic Values Enough to Fail the Test of Normality?

The first plot looks to me quite similar to a mixture of a normal (most of the points) and something with larger variance and somewhat heavier tail (perhaps a 90-10 mixture but it's hard to judge) -- and possibly a slightly lower center. Of course, that doesn't mean the original process is physically a mixture of two original processes; you can get exactly the same appearance with something that has a distribution close to the distribution function of that mixture.

Surely statisticians must be in the habit of getting rid of these outliers

Actually, there are many dozens of questions (possibly hundreds) here where people who aren't statisticians are asking about trying to get rid of outliers. You don't quite so often see the statisticians who respond saying "get rid of outliers"; you may want to do something* but not necessarily "get rid of them".

* (such as choose a better model, or a more robust methodology; if I was using a Wilcoxon-Mann-Whitney test on a pair of samples like that I might not care at all),

The most surprising part came when a Shapiro-Wilk test also seemed to comfortably rule out normality just because of these two values:

That wasn't remotely surprising to me, since they will dramatically affect the correlation in the plot, which directly impacts the closely-related Shapiro-Francia statistic (if my recollection is correct, you can regard it as a function of $R^2$ in the QQ-plot -- or possibly it is $R^2$, and it's also typically very close to the Shapiro-Wilk). So in fact I'd be surprised if two such large outliers didn't cause the Shapiro-Wilk to be significant given how big the sample is.

In fact I was playing with much larger samples just recently (yesterday? I think it was) with only a single outlier where the squared-correlation in the corresponding plot was really quite close to 0.

then the data is said to have heavy-tailed distribution

Said by whom? Presumably the author of that sentence. (He either needs to say who is saying it, or own it as his own definition.)

It seems a somewhat odd definition, but okay, he can define it that way if he wants.

(By the look of it, that journal badly needs a decent editor)

Are these essentially normal data no longer consistent with a normal distribution simply because of 2 outliers?

A contaminated distribution where a large fraction of the distribution (leading to 99.6% of the values in the sample) are from a standard normal and a very small fraction of the distribution (leading to 0.4% of the sample values) are from some distribution likely to produce values 10 sd's (10 of the uncontaminated distributions sd's) away from the mean is - clearly - not normal; we just said words to that effect.

You may well ask how much impact does it have on whatever thing you're interested in. For some things the impact may be great for other things it may be small, but normal populations have almost no chance of producing a sample like that.

Can the new data be considered "heavy-tailed" because of these two dots sticking out at either end of the otherwise flat and straight QQ plot?

A distribution that is likely to produce a sample like that would be called heavy tailed. We might also reasonably refer to the ecdf of that sample as 'heavy tailed'.

And is this consistent with the mathematical definition of heavy tails?

Which definition are we talking about?

Note that the ecdf doesn't have any values at all beyond the largest and smallest sample values; if you have a definition of heavy tailed that talks about limiting behavior of the tail of (say) the survivor function (on the right, and perhaps the cdf on the left), it might not be heavy tailed at all ... but the distribution from which the sample was drawn might -- or might not -- be heavy tailed by the same definition, depending on what the actual distribution was and what the definition of heavy-tailed was.

Best Answer

Related Solutions

Solved – Normality of residuals – contradiction between ‘symplot’ and ‘qnorm’

Normality Test – Are Two Asymptotic Values Enough to Fail the Test of Normality?

Related Question