Normality Assumption – Impact of Outliers on QQ Plot

diagnosticgeneralized linear modelnormality-assumptionoutliersqq-plot

I'm trying to build an GLM regression (10k samples and 50 dimensions). I ran an analysis of the dependent variable since the regression has a normality assumption for the dependent variable.

The QQ plot (mid fig) shows the distribution of y is far away from normal distribution (does it imply a gap in y? I did not find the gap in the histogram (top fig)). After I removed top 3% and bottom 3% of y, the QQ plot (bottom fig) becomes a straight line implying heavy tails.

My questions are: 1. why is QQ plot so sensitive to extreme values? 2. since QQ plot is too sensitive to extreme values or outliers, does it make sense to run QQ plot after removing certain data?

enter image description here

A previous post does not help much.

Update 2023.12.20

It turns out that I confused distribution of the response vs. distribution of residuals. The original purpose of this post is to see if GLM is appropriate in my application. I know now that I should have used residuals rather than ys for normality check.

Best Answer

First, regression does NOT assume the dependent variable is normally distributed. It makes assumptions about the errors, which we look at by examining residuals.

Second, the QQ plot is sensitive to outliers because it is supposed to be. They are not "too sensitive" to outliers, they are appropriately sensitive to them. You have five (I think it's five) points that are very far from what the normal distribution would be.

Third, histograms aren't great graphs. Yeah, I know, they are very, very common, but see this thread. There is a quote from Cleveland, something like "ubiquity and longevity are not signs of utility and histograms will not be seen in this book".

Finally, while I won't say it's never sensible to use a quantile normal plot after removing a bunch of points, it's not a good general policy.

Related Question