Solved – Best way to put two histograms on same scale

binningdata visualizationdensity functionhistogram

Let's say I have two distributions I want to compare in detail, i.e. in a way that makes shape, scale and shift easily visible. One good way to do this is to plot a histogram for each distribution, put them on the same X scale, and stack one underneath the other.

When doing this, how should binning be done? Should both histograms use the same bin boundaries even if one distribution is much more dispersed than the other, as in Image 1 below? Should binning be done independently for each histogram before zooming, as in Image 2 below? Is there even a good rule of thumb on this?

Best Answer

I think you need to use the same bins. Otherwise the mind plays tricks on you. Normal(0,2) looks more dispersed relative to Normal(0,1) in Image #2 than it does in Image #1. Nothing to do with statistics. It just looks like Normal(0,1) went on a "diet".

-Ralph Winters

Midpoint and histogram end points can also alter perception of the dispersion. Notice that in this applet a maximum bin selection implies a range of >1.5 - ~5 while a minimum bin selection implies a range of <1 - > 5.5

http://www.stat.sc.edu/~west/javahtml/Histogram.html

Related Solutions

Solved – Histogram with uniform vs non-uniform Bins

When is a uniform-bin histogram better than a non-uniform bin one?

This requires some kind of identification of what we'd seek to optimize; many people try to optimize average integrated mean square error, but in many cases I think that somewhat misses the point of doing a histogram; it often (to my eye) 'oversmooths'; for an exploratory tool like a histogram I can tolerate a good deal more roughness, since the roughness itself gives me a sense of the extent to which I should "smooth" by eye; I tend to at least double the usual number of bins from such rules, sometimes a good deal more. I tend to agree with Andrew Gelman on this; indeed if my interest was really getting a good AIMSE, I probably shouldn't be considering a histogram anyway.

So we need a criterion.

Let me start by discussing some of the options of non-equal area histograms:

There are some approaches that do more smoothing (fewer, wider bins) in areas of lower density and have narrower bins where the density is higher - such as "equal-area" or "equal count" histograms. Your edited question seems to consider the equal count possibility.

The histogram function in R's lattice package can produce approximately equal-area bars:

library("lattice")
histogram(islands^(1/3))  # equal width
histogram(islands^(1/3),breaks=NULL,equal.widths=FALSE)  # approx. equal area

comparison of equal width and equal area

That dip just to the right of the leftmost bin is even clearer if you take fourth roots; with equal-width bins you can't see it unless you use 15 to 20 times as many bins, and then the right tail looks terrible.

There's an equal-count histogram here, with R-code, which uses sample-quantiles to find the breaks.

For example, on the same data as above, here's 6 bins with (hopefully) 8 observations each:

equalcount histogram

ibr=quantile(islands^(1/3),0:6/6)
hist(islands^(1/3),breaks=ibr,col=5,main="")

This CV question points to a paper by Denby and Mallows a version of which is downloadable from here which describes a compromise between equal-width bins and equal-area bins.

It also addresses the questions you had to some extent.

You could perhaps consider the problem as one of identifying the breaks in a piecewise-constant Poisson process. That would lead to work like this. There's also the related possibility of looking at clustering/classification type algorithms on (say) Poisson counts, some of which algorithms would yield a number of bins. Clustering has been used on 2D histograms (images, in effect) to identify regions that are relatively homogenous.

If we had an equal-count histogram, and some criterion to optimize we could then try a range of counts per bin and evaluate the criterion in some way. The Wand paper mentioned here [paper, or working paper pdf] and some of its references (e.g. to the Sheather et al papers for example) outline "plug in" bin width estimation based on kernel smoothing ideas to optimize AIMSE; broadly speaking that kind of approach should be adaptable to this situation, though I don't recall seeing it done.

Solved – Log scale in histogram + Stata

Resources for Stata-specific help can be found under Internet Support for Statistics Software.

Of course, a general technique is to take the logs of your data values and then make a regular histogram. Here's the same data in raw form and in log form.

Best Answer

Related Solutions

Solved – Histogram with uniform vs non-uniform Bins

Solved – Log scale in histogram + Stata

Related Question