Solved – QQ plot does not match histogram

data visualizationhistogramnormality-assumptionqq-plot

I have a histogram, kernel density and a fitted normal distribution of financial log returns, which are transformed into losses (signs are changed), and a normal QQ plot of these data:

http://tinypic.com/r/34ocwvr/6

The QQ plot shows clearly that the tails are not fitted correctly. But if I have a look at the histogram and the fitted normal distribution (blue), even the values around 0.0 are not fitted correctly. So the QQ plot shows that only the tails are not fitted appropriately, but clearly the whole distribution is not fitted correctly. Why does this not show up in the QQ plot?

Best Answer

+1 to @NickSabbe, for 'the plot just tells you that "something is wrong"', which is often the best way to use a qq-plot (as it can be difficult to understand how to interpret them). It is possible to learn how to interpret a qq-plot by thinking about how to make one, however.

You would start by sorting your data, then you would count your way up from the minimum value taking each as an equal percentage. For example, if you had 20 data points, when you counted the first one (the minimum), you would say to yourself, 'I counted 5% of my data'. You would follow this procedure until you got to the end, at which point you would have passed through 100% of your data. These percentage values can then be compared to the same percentage values from the corresponding theoretical normal (i.e., the normal with the same mean and SD).

When you go to plot these, you will discover that you have trouble with the last value, which is 100%, because when you've passed through 100% of a theoretical normal you are 'at' infinity. This problem is dealt with by adding a small constant to the denominator at each point in your data before calculating the percentages. A typical value would be to add 1 to the denominator; for example, you would call your 1st (of 20) data point 1/(20+1)=5%, and your last would be 20/(20+1)=95%. Now if you plot these points against a corresponding theoretical normal, you will have a pp-plot (for plotting probabilities against probabilities). Such a plot would most likely show the deviations between your distribution and a normal in the center of the distribution. This is because 68% of a normal distribution lies within +/- 1 SD, so pp-plots have excellent resolution there, and poor resolution elsewhere. (For more on this point, it may help to read my answer here: PP-plots vs. QQ-plots.)

Often, we are most concerned about what is happening in the tails of our distribution. To get better resolution there (and thus worse resolution in the middle), we can construct a qq-plot instead. We do this by taking our sets of probabilities and passing them through the inverse of the normal distribution's CDF (this is like reading the z-table in the back of a stats book backwards--you read in a probability and read out a z-score). The result of this operation is two sets of quantiles, which can be plotted against each other similarly.

@whuber is right that the reference line is plotted afterwards (typically) by finding the best fitting line through the middle 50% of the points (i.e., from the first quartile to the third). This is done to make the plot easier to read. Using this line, you can interpret the plot as showing you whether the quantiles of your distribution progressively diverge from a true normal as you move into the tails. (Note that the position of points further out from the center are not really independent of those closer in; so the fact that, in your specific histogram, the tails seem to come together after having the 'shoulders' differ does not mean that the quantiles are now the same again.)

You can interpret a qq-plot analytically by considering the values read from the axes compare for a given plotted point. If the data were well described by a normal distribution, the values should be about the same. For example, take the extreme point at the very far left bottom corner: its $x$ value is somewhere past $-3$, but its $y$ value is only a little past $-.2$, so it is much further out than it 'should' be. In general, a simple rubric to interpret a qq-plot is that if a given tail twists off counterclockwise from the reference line, there is more data in that tail of your distribution than in a theoretical normal, and if a tail twists off clockwise there is less data in that tail of your distribution than in a theoretical normal. In other words:

  • if both tails twist counterclockwise you have heavy tails (leptokurtosis),
  • if both tails twist clockwise, you have light tails (platykurtosis),
  • if your right tail twists counterclockwise and your left tail twists clockwise, you have right skew
  • if your left tail twists counterclockwise and your right tail twists clockwise, you have left skew