Solved – Comparing histograms

frequencyhistogram

The data I have are the ages male or female won an award (79 data points, ages, for each gender). When constructing a freq table I used to find the number of classes,
$$ n = 79 $$
$$ \frac{\log(n)}{\log(2)} +1 \approx 8 $$

to find the class width for each gender

$$ \frac{max – min}{8} $$

The freq tables look like this

Female
Values      Frequency   Rel Freq.
1. 21 – 28   20          25%
2. 29 – 36   31          39%
3. 37 – 44   17          22%
4. 45 – 52   4           5%
5. 53 – 60   2           3%
6. 61 – 68   3           4%
7. 69 – 76   1           1%
8. 77 – 84   1           1%

Male
Values      Frequency   Rel Freq.
1. 29 – 34   10          13%
2. 35 – 40   20          25%
3. 41 – 46   24          30%
4. 47 – 52   13          17%
5. 53 – 58   6           8%
6. 59 – 64   5           6%
7. 65 – 70   0           0%
8. 71 – 76   1           1%

The histograms end up looking like this (STATDISK),
Histograms http://www.xdcclan.com/images/histograms.jpg

Using a class start with the lowest data point in each data set, the histograms
appear uneven with one starting at 0 and ending at 100 and the other starting at 20 and ending at 80

The histograms need the same amount of classes, but wouldn't it be better if I did at values as,

Values      Freq
20 - 24     #
25 - 30     #
35 - 39     #
etc

to get a better histogram to compare with? or does this not matter?

Do the # of classes only have to be the same? Do the histograms both have to start at the same point (0 or 20)? Are these histograms sufficient/correct? what if I said it is mandatory to use the formulas above to find the # of classes and class width? Should I use the same class start for each in the histograms?

Best Answer

Welcome to CV!

Do the # of classes only have to be the same?

Not necessarily.

Do the histograms both have to start at the same point (0 or 20)?

Highly recommended, and the axes should also end at the same number, and better if they are of the same length as well.

More babbling: It depends what do you mean by "comparing." From just the two histograms I can definitely compare and contrast the skewness of the distribution. But beyond that, visually it's not easy to say anything about the frequencies at different age because the class widths are different, and what's more crucial, the x-axes are different.

A panelled histogram that can successfully enhance visual comparison should have common y and x axes like this example:

enter image description here

If you allow the x- and y-axes to extend to wherever they want, it's difficult to compare across graphs.

Are these histograms sufficient/correct?

By themselves they are correct, but if it's for cross-gender comparison it's insufficient.

What if I said it is mandatory to use the formulas above to find the # of classes and class width?

I'd have a hard time not laughing. For many reasons:

  1. The formula was probably aimed for making one histogram, and not for making two or more then put together for comparison. It sounds to me like a shoe-smith always wants to use up the whole piece of leather in making a shoe, and he can never sell any pair cause no any two of them actually make a pair.

  2. Histogram has more parameters than just sample size. The min and the max, the distribution of the people, the nuances one would like to highlight, etc. It's unfair to overlook all other important information.

  3. When it comes to visualization of data, it's important to realize that we are to communicate the data, not package it and force them down the throat of the viewers'. I'd probably stay away from all these pragmatic "you have to do this and do that..." bossy instructions. (Though I give those instructions as well, guilty as charged.)

But(!), if your professor/supervisor--who can decide if you will pass/fail the course--said you have to use this formula, then I'll suggest pick your battle wisely.

Should I use the same class start for each in the histograms?

Not necessarily, but recommended because this form is easiest to perceive. There are histogram that uses unequal bin sizes. If for some reason a certain age categorization is more important in females than males, you may make them unequal.

For instance, if some country's legal drinking age is 18, and another is 21, you may group all <18 into one column for the first country, and <21 into another column for the second country. The question is that, would this be more meaningful? Or would you see the same pattern if you bin them at single year age?

In a nut shell, you'll need to know what you want the viewers to know, and work backward. Avoid starting with recipes.