Solved – Approaches for comparing visual representation of two distributions with unequal sample sizes

data visualizationdistributionshistogram

I have two distributions of continuous, unpaired measurements. I would like to visualize the two distributions with a pair of histograms, counting measurements that fall over a bin's interval.

Are there ways to rescale or process the smaller of the two sets, so that when I make two histograms (or other visualizations, violin, box, etc.) of their data, the visualization does not lead the viewer to favor a bin interval containing under- or over-represented measurements from one set, relative to the other.

Best Answer

If you really need to compare histograms at different sample sizes, scale them both to area 1 (i.e. to be density estimates).

However, as Nick suggested in comments, there are other ways of comparing the distributions that don't require binning.

You could plot ecdfs, or a pair of theoretical QQ plots on the same axes (the theoretical distribution doesn't need to be perfect, though a reasonable approximation will help with detailed comparisons), or perhaps kernel density estimates, for example.

Related Solutions

Solved – Visualizing multiple “histograms” (bar-charts)

As you have found out there are no easy answers to your question!

I presume that you interested in finding strange or different book stores? If this is the case then you could try things like PCA (see the wikipedia cluster analysis page for more details).

To give you an idea, consider this example. You have 26 bookshops (with names A, B,..Z). All bookshops are similar, except:

Shop Z sells only a few History books.
Shops O-Y sell more romance books than average.

A principal components plot highlights these shops for further investigation.

Here's some sample R code:

> d = data.frame(Romance = rpois(26, 50), Horror = rpois(26, 100), 
               Science = rpois(26, 75), History = rpois(26, 125))
> rownames(d) = LETTERS
#Alter a few shops
> d[15:25,][1] = rpois(11,150)
> d[26,][4] = rpois(1, 10)
#look at the data
> head(d, 2)
       Romance Horror Science History
 A      36    107      62     139
 B      47     93      64     118
> books.PC.cov = prcomp(d)
> books.scores.cov = predict(books.PC.cov)
# Plot of PC1 vs PC2
> plot(books.scores.cov[,1],books.scores.cov[,2],
       xlab="PC 1",ylab="PC 2", pch=NA)
> text(books.scores.cov[,1],books.scores.cov[,2],labels=LETTERS)

This gives the following plot:

PCA plot http://img265.imageshack.us/img265/7263/tmplx.jpg

Notice that:

Shop z is an outlying point.
The others shops form two distinct groups.

Other possibilities

You could also look at GGobi, I've never used it, but it looks interesting.

Solved – Is it possible to visualize the data set in Excel

Using Excel, a quick way to visualize your data set is a small-multiple dot plot. Take each University in a separate chart and plot their housing counts per zip code. Your result could look something like this:

Small Multiple Dot Plot

Obviously, sorting and layout (columns v rows) will signicantly change what is emphasized. As an example, this chart is sorted by College A's difference between on and off campus housing.

Also, as @whuber mentioned, mapping is another good option (even in Excel). You could also build a small multiple choropleth set of zip codes with the difference between on and off campus counts expressed through a divergent color scheme. My question would be if their location is really an important analytical component, or if their zip code really was more a proxy for another character (e.g. income, race, age, etc...).

For a quick easy way to map in Excel, check out the tutorials at tushar-mehta and Clearly and Simply

And, you can't go wrong checking out John Peltier's website.

EDIT: How to create a small-multiples dot plot in Excel (that doesn't look like Excel).

Start with your data-I used the same format you described in your question, with two columns per university (on and off campus) and zip codes in rows. To make things easier I formatted them as a table. Then I added an additional column (that I used to sort the results) for each university that calculated the difference between on and off campus housing. Now's as good a time as any to choose your focus-which university will be first and how will the data be sorted? For the example, I chose university A, sorted by the largest difference between on and off campus housing.
Next, create your first chart (there's a total of three, one per university). The chart is a simple line chart with two series on-campus and off-campus counts. To make things simple, format this chart like you want the others. In the case of the example, I did the following:
- formatted each series with no line
- formatted each series with the same size/shape marker, but had one with white fill, and the other with gray fill
- added high-low lines and formatted them with a dash
- removed the x-axis (markers, labels and line
- formatted the y-axis to have a max that was greater than all values (since this chart will be copied repeatedly, once for each university)
- formatted the chart and plot areas to no fill and no border
- removed legend
Here's an example of the before/after of the chart (pay no attention to the chart junk shadows I have on the data points). Once you find a style you like, its worth saving as a chart template so you can quickly apply it in the future.
Now that your chart looks like you want them all to look, copy the chart for as many universities as you want to compare.
Now modify each series for the appropriate university's data. It's easiest to do this by selecting the series in the chart and modifying the formula directly in the formula bar. In this case, its simply a case of changing the column reference (e.g. B1:B10 to C1:C10). If its more complex, or if you have alot of changes to make, I would suggest named ranges or VBA.
Using Excel's grid, line the charts up so that the y-axis and the x-axis categories are aligned. To make it easy, use Excel's Snap-to-Grid feature (Page Layout > Arrange > Align) and align both the chart and plot areas, both vertically and horizontally. If you size approriately before copy/pasting, all you need to do is stack.
Add a data legend at the bottom of the bottom chart.
If your Excel columns are equal width (which they should be unless you modified them, in which case, make them equal again), you can put your column labels in the cells directly above each dot-plot column.
Add a title in the cells above your column labels and center across your selection.
Finally, in view turn off your grid-lines and you'll have a chart that doesn't look like Excel.

Here's what it looks like when complete. The red lines show how the charts match up with the Excel grid (which I've turned back on for the image). All the text in red is directly in Excel (I used formulas to pull the appropriate zip code from the table, based upon the sort order). All other text, lines and markers are in the 3 charts, but instead of default black, I've changed them all to dark gray.

enter image description here

Hope this helps.

Best Answer

Related Solutions

Solved – Visualizing multiple “histograms” (bar-charts)

Solved – Is it possible to visualize the data set in Excel

Related Question