QQ-Plots vs. Histograms – Benefits and Applications Explained

binninghistogramqq-plotreferences

In this comment, Nick Cox wrote:

Binning into classes is an ancient method. While histograms can be useful, modern statistical software makes it easy as well as advisable to fit distributions to the raw data. Binning just throws away detail that is crucial in determining which distributions are plausible.

The context of this comment suggests using QQ-plots as an alternative means to evaluate the fit. The statement sounds very plausible, but I'd like to know about a reliable reference supporting this statement. Is there some paper which does a more thorough investigation of this fact, beyond a simple “well, this sounds obvious”? Any actual systematic comparisons of results or the likes?

I'd also like to see how far this benefit of QQ-plots over histograms can be stretched, to applications other than model fitting. Answers on this question agree that “a QQ-plot […] just tells you that "something is wrong"”. I am thinking about using them as a tool to identify structure in observed data as compared to a null model and wonder whether there exist any established procedures to use QQ-plots (or their underlying data) to not only detect but also describe non-random structure in the observed data. References which include this direction would therefore be particularly useful.

Best Answer

The canonical paper here was:

Wilk, M.B. and R. Gnanadesikan. 1968. Probability plotting methods for the analysis of data. Biometrika 55: 1-17

and it still repays close and repeated reading. A lucid treatment with many good examples was given by:

Cleveland, W.S. 1993. Visualizing Data. Summit, NJ: Hobart Press.

and it is worth mentioning the more introductory:

Cleveland, W.S. 1994. The Elements of Graphing Data. Summit, NJ: Hobart Press.

Other texts containing reasonable exposure to this approach include:

Davison, A.C. 2003. Statistical Models. Cambridge: Cambridge University Press.
Rice, J.A. 2007. Mathematical Statistics and Data Analysis. Belmont, CA: Duxbury.

That aside, I don't know of anything that is quite what you ask. Once you have seen the point of quantile-quantile plots, showing in detail that histograms are a second-rate alternative seems neither interesting nor useful, too much like shooting fish in a barrel.

But I would summarize like this:

Binning suppresses details, and the details are often important. This can apply not only to exactly what is going on in the tails but also to what is going on in the middle. For example, granularity or multimodality may be important as well as skewness or tail weight.
Binning requires decisions about bin origin and bin width, which can affect the appearance of histograms mightily, so it is hard to see what is real and what is a side-effect of choices. If your software makes these decisions for you, the problems remain. (For example, default bin choices are often designed so that you do not use "too many bins", i.e. with the motive of smoothing a little.)
The graphical and psychological problem of comparing two histograms is trickier than that of judging the fit of a set of points to a straight line.

[Added 27 Sept 2017] 4. Quantile plots can be varied very easily when considering one or more transformed scales. By transformation here I mean a nonlinear transformation, not e.g. scaling by a maximum or standardisation by (value $-$ mean) / SD. If the quantiles are just the order statistics, then all you need to do is to apply the transformation, as e.g. the logarithm of the maximum is identically the maximum of the logarithms, and so forth. (Trivially, reciprocation reverses order.) Even if you plot selected quantiles that are based on two order statistics, usually they are just interpolated between two original data values and the effect of the interpolation is usually minor. In contrast, histograms on log or other transformed scales require a fresh decision on bin origin and width that isn't especially difficult, but it can be awkward. Much the same can be said of density estimation as a way to summarize the distribution. Naturally, whatever transformation you apply must make sense for the data, so that logarithms can only usefully be applied for a positive variable.

Related Solutions

Goodness of Fit – How to Evaluate for 2D Histograms

OK, I've extensively revised this answer. I think rather than binning your data and comparing counts in each bin, the suggestion I'd buried in my original answer of fitting a 2d kernel density estimate and comparing them is a much better idea. Even better, there is a function kde.test() in Tarn Duong's ks package for R that does this easy as pie.

Check the documentation for kde.test for more details and the arguments you can tweak. But basically it does pretty much exactly what you want. The p value it returns is the probability of generating the two sets of data you are comparing under the null hypothesis that they were being generated from the same distribution . So the higher the p-value, the better the fit between A and B. See my example below where this easily picks up that B1 and A are different, but that B2 and A are plausibly the same (which is how they were generated).

# generate some data that at least looks a bit similar
generate <- function(n, displ=1, perturb=1){
    BV <- rnorm(n, 1*displ, 0.4*perturb)
    UB <- -2*displ + BV + exp(rnorm(n,0,.3*perturb))
    data.frame(BV, UB)
}
set.seed(100)
A <- generate(300)
B1 <- generate(500, 0.9, 1.2)
B2 <- generate(100, 1, 1)
AandB <- rbind(A,B1, B2)
AandB$type <- rep(c("A", "B1", "B2"), c(300,500,100))

# plot
p <- ggplot(AandB, aes(x=BV, y=UB)) + facet_grid(~type) + 
    geom_smooth() +     scale_y_reverse() + theme_grey(9)
win.graph(7,3)
p +geom_point(size=.7)

enter image description here

> library(ks)
> kde.test(x1=as.matrix(A), x2=as.matrix(B1))$pvalue
[1] 2.213532e-05
> kde.test(x1=as.matrix(A), x2=as.matrix(B2))$pvalue
[1] 0.5769637

MY ORIGINAL ANSWER BELOW, IS KEPT ONLY BECAUSE THERE ARE NOW LINKS TO IT FROM ELSEWHERE WHICH WON'T MAKE SENSE

First, there may be other ways of going about this.

Justel et al have put forward a multivariate extension of the Kolmogorov-Smirnov test of goodness of fit which I think could be used in your case, to test how well each set of modelled data fit to the original. I couldn't find an implementation of this (eg in R) but maybe I didn't look hard enough.

Alternatively, there may be a way to do this by fitting a copula to both the original data and to each set of modelled data, and then comparing those models. There are implementations of this approach in R and other places but I'm not especially familiar with them so have not tried.

But to address your question directly, the approach you have taken is a reasonable one. Several points suggest themselves:

Unless your data set is bigger than it looks I think a 100 x 100 grid is too many bins. Intuitively, I can imagine you concluding the various sets of data are more dissimilar than they are just because of the precision of your bins means you have lots of bins with low numbers of points in them, even when the data density is high. However this in the end is a matter of judgement. I would certainly check your results with different approaches to binning.
Once you have done your binning and you have converted your data into (in effect) a contingency table of counts with two columns and number of rows equal to number of bins (10,000 in your case), you have a standard problem of comparing two columns of counts. Either a Chi square test or fitting some kind of Poisson model would work but as you say there is awkwardness because of the large number of zero counts. Either of those models are normally fit by minimising the sum of squares of the difference, weighted by the inverse of the expected number of counts; when this approaches zero it can cause problems.

Edit - the rest of this answer I now no longer believe to be an appropriate approach.

I think Fisher's exact test may not be useful or appropriate in this situation, where the marginal totals of rows in the cross-tab are not fixed. It will give a plausible answer but I find it difficult to reconcile its use with its original derivation from experimental design. I'm leaving the original answer here so the comments and follow up question make sense. Plus, there might still be a way of answering the OP's desired approach of binning the data and comparing the bins by some test based on the mean absolute or squared differences. Such an approach would still use the $n_g\times 2$ cross-tab referred to below and test for independence ie looking for a result where column A had the same proportions as column B.

I suspect that a solution to the above problem would be to use Fisher's exact test, applying it to your $n_g \times 2$ cross tab, where $n_g$ is the total number of bins. Although a complete calculation is not likely to be practical because of the number of rows in your table, you can get good estimates of the p-value using Monte Carlo simulation (the R implementation of Fisher's test gives this as an option for tables that are bigger than 2 x 2 and I suspect so do other packages). These p-values are the probability that the second set of data (from one of your models) have the same distribution through your bins as the original. Hence the higher the p-value, the better the fit.

I simulated some data to look a bit like yours and found that this approach was quite effective at identifying which of my "B" sets of data were generated from the same process as "A" and which were slightly different. Certainly more effective than the naked eye.

With this approach testing the independence of variables in a $n_g \times 2$ contingency table, it doesn't matter that the number of points in A is different to those in B (although note that it is a problem if you use just the sum of absolute differences or squared differences, as you originally propose). However, it does matter that each of your versions of B has a different number of points. Basically, larger B data sets will have a tendency to return lower p-values. I can think of several possible solutions to this problem. 1. You could reduce all of your B sets of data to the same size (the size of the smallest of your B sets), by taking a random sample of that size from all the B sets that are bigger than that size. 2. You could first fit a two dimensional kernel density estimate to each of your B sets, and then simulate data from that estimate that is equal sizes. 3. you could use some kind of simulation to work out the relationship of p-values to size and use that to "correct" the p-values you get from the above procedure so they are comparable. There are probably other alternatives too. Which one you do will depend on how the B data were generated, how different the sizes are, etc.

Hope that helps.

Solved – Linear mixed effects models: what to do when the residual QQ-plot looks non-normal

It would seem that the distribution of the response conforms to a Poisson distribution. Though I would say that the model itself should be modeled using a log link I am unsure if the lmer function allows for generalization of the residuals. However, I believe the correct extraction of residuals for count data should be 'Pearson'.One other thing to consider is that your abline() function is set through zero with a slope of one, in other words the reference line will never actually follow the true theoretical quantiles. Use the function qqline()in order to rectify that.

Best Answer

Related Solutions

Goodness of Fit – How to Evaluate for 2D Histograms

Solved – Linear mixed effects models: what to do when the residual QQ-plot looks non-normal

Related Question