Solved – How to describe a bin in a histogram

histogramterminology

What is common in the literature to refer to a certain bin in a histogram?

For example, say I have a histogram with 4 bins. The first bin has all values between 1 to 2, second bin 2 to 3, and so on. It seems to me strange — and too long — to call a bin "the bin between 1 to 2". Would it be acceptable just to write/say "bin 1" or "bin 35", which means that the values of the bin start at 35? It should be obvious in the histogram which bin is meant with 35 as all bin borders are marked.

Best Answer

This is partly a question of statistics terminology and partly one of English usage. (Clearly, some points may be irrelevant or need changing for anyone interested in this question but for some language other than English.)

Let's focus first on measured data.

To be completely clear in describing your bin to me you have to tell me somehow (1) where it starts, (2) how wide it is, and (3) what happens at bin boundaries. That's a matter of statistics. Sometimes (2) or even (3) are obvious from context, e.g. (2) may be obvious by looking at the graph.

In English, "between" is best paired with "and" and "from" with "to", but a problem with both usages is that they leave ambiguous what happens at the boundaries. So, "between 2 and 3", "between 3 and 4", etc. or "from 2 to 3", "from 3 to 4", etc., raise the question of what happens if data are exactly 3.

For completeness, I will stress that units of measurement when used (kg, m, USD/year, etc.) should always be mentioned prominently at least once.

While I am focusing on English usage, I'll note that usages such as "between 2-3" and "from 2-3", although very common, are widely disapproved by usage pundits and recommended against by many style guides as poor style, but you will also encounter views that such an attitude is anywhere between conservative and reactionary. (On this issue, I line up with the conservatives.) That is, it is considered poor style to use a hyphen or dash as replacement for the second word, namely "and" or "to". The argument appears to be one of symmetry, that words that deserve to be paired should indeed be paired.

If you tell me that a bin is for values $2 \le x < 3$ or for $[2, 3)$ you have told me everything I need to know. So, if you need to refer to a particular bin, using a little mathematics can be simpler and better than ambiguous wording. Naturally for $x$ feel free to substitute a word description of the variable. Or use that word description elsewhere and use some example-based explanation as this.

Bin width is 1 and lower limits are inclusive, so (e.g.) the bin for 2-3 includes values reported as 2.0.

Things are usually and naturally simpler with discrete (e.g.) counted data. It is still best to report that bins are (e.g.) 0-3, 4-7, 8-11, etc. and never as 0-4, 4-8, 8-12, etc. (It may surprise you how common the latter practice is.)

However, much depends on your readership. Perhaps your readership are not comfortable with notation for inequalities, in which case you still have the problem of explaining what happens at bin boundaries, although only context and audience can determine how far that matters. I've found that you can't presume familiarity with use of $[, )$ notation unless you are addressing people with good mathematical backgrounds. Even statistics users forget much of what school or college mathematics they once knew if they don't use it routinely.

I wouldn't presume that all bins are labelled with their numeric limits on the histogram. If there are tens or even hundreds of bins that would usually be busy, impracticable or both. Conversely, it is difficult to imagine discussing an individual bin unless it is identifiable.

EDIT: Thanks to other contributors for reminding me of interval notation.

Related Question