Solved – Label on the y-axis in a normalised histogram

data visualizationhistogrammatplotlibnormalizationpython

If you have a histogram with frequency on the y-axis and bins for different ranges of values on the x-axis, then it is reasonable that the label on the y-axis should be frequency. But if these frequencies are normalised, what is the correct y-label? Normalised frequency? Quota?

I use Python's matplotlib:

import matplotlib.pyplot as plt
l = [3,3,3,2,1,4,4,5,5,5,5,5]
plt.hist(l,normed=True)
plt.show()

It seems that a bin can reach above 1.0 which I don't really want. I want it to be more like proportion or fraction; the height of the bins should sum up to one.

Best Answer

It depends what you mean by "normalised"; it also depends on your software's choices.

"Normalised" is a word often avoided by statistical people, because it is ambiguous as between (a) scaled or standardized (e.g. to total 1 or to mean 0 and SD 1) and (b) transformed (approximately) to normality, meaning normal or Gaussian distribution. Naturally, your language may well reflect your community and differ from this, but watch out, because usages are not universal across the statistical sciences.

On histograms I have variously seen

  • frequencies, or bin counts

  • frequency density, or bin counts/bin width

  • proportions, or bin counts/total count

  • percent[age]s, or proportions multiplied by 100

(the last two are really the same, just that vulgar prejudice often regards them as different)

  • probability density, i.e. frequency density/total count, integrating to 1 over the whole histogram

There could be a good case for all of these, although "frequency density" I think is the least common and could be widely puzzling without explanation.

For completeness I note that probability density can easily exceed 1, a point that causes frequent puzzlement.

Related Question