Solved – Approaches for comparing visual representation of two distributions with unequal sample sizes

data visualizationdistributionshistogram

I have two distributions of continuous, unpaired measurements. I would like to visualize the two distributions with a pair of histograms, counting measurements that fall over a bin's interval.

Are there ways to rescale or process the smaller of the two sets, so that when I make two histograms (or other visualizations, violin, box, etc.) of their data, the visualization does not lead the viewer to favor a bin interval containing under- or over-represented measurements from one set, relative to the other.

Best Answer

If you really need to compare histograms at different sample sizes, scale them both to area 1 (i.e. to be density estimates).

Plot showing two comparable histograms -- scaled to be densities -- even though n differs by a factor of 4

However, as Nick suggested in comments, there are other ways of comparing the distributions that don't require binning.

You could plot ecdfs, or a pair of theoretical QQ plots on the same axes (the theoretical distribution doesn't need to be perfect, though a reasonable approximation will help with detailed comparisons), or perhaps kernel density estimates, for example.

Related Question