Solved – Normalize histogram with different bin width

histogramnormalization

I know how to normalize histogram (so that the area =1) with the same bin width, but how to do it when the histogram has different bin width? Any idea?

Best Answer

The principle of a histogram is that for bins that touch (i.e. no overlaps or gaps), bar height should be proportional to bin frequency/bin width. If bin width is constant, then it can be left out of the calculation.

It's conventional to present histograms as showing frequencies, percents, or proportions, and that's all possible and legitimate with constant bin width.

With varying bin width, however, the choices include (a) probability density (b) frequency density (c) frequencies scaled to some standard bin width (really (b) with possibly differing magnitudes).

The principles transcend software choice, but more discussion, and some references, can be found in this Stata FAQ.

If you have varying bin widths, that has no impact on the principle that the total area of the bars represents total frequency or total probability of 1. If you draw the individual bars correctly, that will be satisfied.

Related Solutions

Solved – Optimal bin width for two dimensional histogram

My advice would generally be that it's even more critical than in 1-D to smooth where possible i.e. to do something like kernel density estimation (or some other such method, like log-spline estimation), which tends to be substantially more efficient than using histograms. As whuber points out, it's quite possible to be fooled by the appearance of a histogram, especially with few bins and small to moderate sample sizes.

If you're trying to optimize mean integrated squared error (MISE), say, there are rules that apply in higher dimensions (the number of bins depends on the number of observations, the variance, the dimension, and the "shape"), for both kernel density estimation and histograms.

[Indeed many of the issues for one are also issues for the other, so some of the information in this wikipedia article will be relevant.]

This dependence on shape seems to imply that to choose optimally, you already need to know what you're plotting. However, if you're prepared to make some reasonable assumptions, you can use those (so for example, some people might say "approximately Gaussian"), or alternatively, you can use some form of "plug-in" estimator of the appropriate functional.

Wand, 1997$^{[1]}$ covers the 1-D case. If you're able to get that article, take a look as much of what's there is also relevant to the situation in higher dimensions (in so far as the kinds of analysis that are done). (It exists in working paper form on the internet if you don't have access to the journal.)

Analysis in higher dimensions is somewhat more complicated (in pretty much the same way it proceeds from 1-D to r-dimensions for kernel density estimation), but there's a term in the dimension that comes into the power of n.

Sec 3.4 Eqn 3.61 (p83) of Scott, 1992$^{[2]}$ gives the asymptotically optimal binwidth:

$h^∗=R(f_k)^{-1/2}\,\left(6\prod_{i=1}^dR(f_i)^{1/2}\right)^{1/(2+d)} n^{−1/(2+d)}$

where $R(f)=\int_{\mathfrak{R}^d} f(x)^2 dx$ is a roughness term (not the only one possible), and I believe $f_i$ is the derivative of $f$ with respect to the $i^\text{th}$ term in $x$.

So for 2D that suggests binwidths that shrink as $n^{−1/4}$.

In the case of independent normal variables, the approximate rule is $h_k^*\approx 3.5\sigma_k n^{−1/(2+d)}$, where $h_k$ is the binwidth in dimension $k$, the $*$ indicates the asymptotically optimal value, and $\sigma_k$ is the population standard deviation in dimension $k$.

For bivariate normal with correlation $\rho$, the binwidth is

$h_i^* = 3.504 \sigma_i(1-\rho^2)^{3/8}n^{-1/4}$

When the distribution is skewed, or heavy tailed, or multimodal, generally much smaller binwidths result; consequently the normal results would often be at best upper bounds on bindwith.

Of course, it's entirely possible you're not interested in mean integrated squared error, but in some other criterion.

[1]: Wand, M.P. (1997),
"Data-based choice of histogram bin width",
American Statistician 51, 59-64

[2]: Scott, D.W. (1992),
Multivariate Density Estimation: Theory, Practice, and Visualization,
John Wiley & Sons, Inc., Hoboken, NJ, USA.

Solved – How to describe a bin in a histogram

This is partly a question of statistics terminology and partly one of English usage. (Clearly, some points may be irrelevant or need changing for anyone interested in this question but for some language other than English.)

Let's focus first on measured data.

To be completely clear in describing your bin to me you have to tell me somehow (1) where it starts, (2) how wide it is, and (3) what happens at bin boundaries. That's a matter of statistics. Sometimes (2) or even (3) are obvious from context, e.g. (2) may be obvious by looking at the graph.

In English, "between" is best paired with "and" and "from" with "to", but a problem with both usages is that they leave ambiguous what happens at the boundaries. So, "between 2 and 3", "between 3 and 4", etc. or "from 2 to 3", "from 3 to 4", etc., raise the question of what happens if data are exactly 3.

For completeness, I will stress that units of measurement when used (kg, m, USD/year, etc.) should always be mentioned prominently at least once.

While I am focusing on English usage, I'll note that usages such as "between 2-3" and "from 2-3", although very common, are widely disapproved by usage pundits and recommended against by many style guides as poor style, but you will also encounter views that such an attitude is anywhere between conservative and reactionary. (On this issue, I line up with the conservatives.) That is, it is considered poor style to use a hyphen or dash as replacement for the second word, namely "and" or "to". The argument appears to be one of symmetry, that words that deserve to be paired should indeed be paired.

If you tell me that a bin is for values $2 \le x < 3$ or for $[2, 3)$ you have told me everything I need to know. So, if you need to refer to a particular bin, using a little mathematics can be simpler and better than ambiguous wording. Naturally for $x$ feel free to substitute a word description of the variable. Or use that word description elsewhere and use some example-based explanation as this.

Bin width is 1 and lower limits are inclusive, so (e.g.) the bin for 2-3 includes values reported as 2.0.

Things are usually and naturally simpler with discrete (e.g.) counted data. It is still best to report that bins are (e.g.) 0-3, 4-7, 8-11, etc. and never as 0-4, 4-8, 8-12, etc. (It may surprise you how common the latter practice is.)

However, much depends on your readership. Perhaps your readership are not comfortable with notation for inequalities, in which case you still have the problem of explaining what happens at bin boundaries, although only context and audience can determine how far that matters. I've found that you can't presume familiarity with use of $[, )$ notation unless you are addressing people with good mathematical backgrounds. Even statistics users forget much of what school or college mathematics they once knew if they don't use it routinely.

I wouldn't presume that all bins are labelled with their numeric limits on the histogram. If there are tens or even hundreds of bins that would usually be busy, impracticable or both. Conversely, it is difficult to imagine discussing an individual bin unless it is identifiable.

EDIT: Thanks to other contributors for reminding me of interval notation.

Best Answer

Related Solutions

Solved – Optimal bin width for two dimensional histogram

Solved – How to describe a bin in a histogram

Related Question