Solved – Doane’s formula for histogram binning

binninghistogram

I'm implementing various algorithms to estimated the best number of bins to use for histograms. Most of the ones I am implementing are described on the Wikipedia "Histogram" page in the section "Number of bins and width"*.

I am stuck on a problem with Doane's formula:

1 + log(n) + log(1 + kurtosis(data) * sqrt(n / 6.))

where n is the data size.

The problem is when the kurtosis is negative and n >> 1 because the argument of the log becomes negative.

* (that page has changed since this was posted, link edited to point to the page as it was at the time of posting)

Best Answer

This answer has undergone significant changes as I investigate the Wikipedia page. I've left the answers largely as they were but added to them, so at present this forms a progression of understanding; the last parts are where the best information is.

Short answer: the Wikipedia page (at the time the question was posted) - and the OP's formula, which seems to have been the same - are simply wrong, for at least three different reasons. I'll leave my original discussion (which assumed that the OP and Wikipedia had it right) since that explains some issues. Better discussion follows later. The short advice: simply forget Doane. If you must use it, use what Wikipedia says now (I fixed it).

I believe that formula must refer to excess kurtosis; my reason for that is that it modifies a formula for normal data to account for non-normal data so you'd expect it to reproduce the unmodified one at the normal. It does that if you use excess kurtosis.

That does, however, raise the problem that the term in the log can go negative with large samples (indeed, it's possible to be be $\leq 0$ at quite small $n$). I'd suggest not using it with negative excess kurtosis (I'd never use it beyond unimodality anyway; once things get multimodal you want to apply the excess kurtosis idea to each mode, not smooth over them!), though with mild cases (excess kurtosis just less than 0) and modest sample sizes it won't be a big issue.

I'd also suggest that in any case it's going to give far too few bins at large sample sizes, even when it works as intended.

You may find this paper (by regular CVer Rob Hyndman):

http://www.robjhyndman.com/papers/sturges.pdf

of some interest. If Sturges' argument is wrong, Doane's formula has the same problem... as Rob clearly notes in the paper.

In that paper (and in this answer) he gives a nod to the Freedman-Diaconis rule. In the paper he also points to the approach mentioned by Matt Wand (he refers to the working paper which doesn't seem to be online, but the subsequent paper is available if you have access):

http://www.jstor.org/discover/10.2307/2684697

[Edit: actually a link to the working paper is on the citeseer page]

That approach involves approximately estimating particular functionals in order to get approximately optimal (in terms of mean integrated squared error, MISE) bin widths for estimating the underlying density. While these work well and give many more bins than Sturges or Doane in general, sometimes I still prefer to use more bins still, though it's usually a very good first attempt.

Frankly I don't know why Wand's approach (or at the very least the Freedman-Diaconis rule) isn't a default pretty much everywhere.

R does at least offer the Freedman-Diaconis calculation of the number of bins:

 nclass.FD(rnorm(100))
[1] 11
 nclass.FD(runif(100))
[1] 6
 nclass.FD(rt(100,1))
[1] 71

See ?nclass.FD

Personally, for me that's too few bins in the first two cases at least; I'd double both of those in spite of the fact that it might be a bit noisier than optimal. As $n$ becomes large, I think it does very well in most cases.

Edit 2:

I decided to investigate the skewness vs kurtosis issue that @PeterFlom rightly expressed puzzlement at.

I just had a look at the Doane paper wiso linked to (I'd read it before .... but that was almost 30 years ago) - it makes no reference to kurtosis at all, only to skewness.

Doane's actual formula is: $K_e = \log_2(1+\frac{g_1}{\sigma_{g_1}})$

where $K_e$ is the number of added bins, $g_1$ is the 3rd moment skewness. [Well actually, Doane, following fairly common usage of the time, uses $\sqrt{b_1}$ for the signed (!) 3rd moment skewness (the origin of this particularly unedifying abuse of notation is quite old and I'm not going to pursue it, except to say that it's fortunately appearing much less often now).]

Now at the normal, $\sigma_{g_1} = \sqrt{\frac{6(n-2)}{(n+1)(n+3)}} \approx \sqrt{\frac{6}{n}}$
(although that approximation is pretty poor until $n$ is well past 100; Doane uses the first form).

However, it seems that along the way someone has tried to adapt it to kurtosis (at the time I write this Wikipedia has it in terms of kurtosis, for example, and I don't think they made it up) - but there's clear reason to believe that the formula is simply wrong (note that the standard error used is that final approximation for the s.e. of the skewness I gave above). I think I've seen this use of kurtosis in several places other than Wikipedia, but besides not being in Doane's paper, it isn't present in Scott's paper, nor the Hyndman paper I point to, nor in Wand's paper. It does seem to have come from somewhere, however (i.e. I am sure it's not original to Wikipedia), because Doane doesn't have the approximation to $\sigma_{g_1}$. It looks like it's been played with several times before it ended up there; I'd be interested if anyone tracked it down.

It does look to me like Doane's argument should happily extend to kurtosis, but the correct standard error would have to be used.

However, since Doane relies on Sturges and Sturges' argument seems to be flawed, perhaps the entire enterprise is doomed. In any case I have edited the Histogram talk page on wikipedia noting the error.

---

Edit 3: I have corrected the wikipedia page (but I took the liberty of taking the absolute value of the skewness, otherwise Doane's original formula can't be used for left-skewed distributions as it stood - clearly for number of bins the sign of the skewness is immaterial). Strictly speaking I should have presented the formula in its original (wrong) form, and then explained why it doesn't make sense but I think that's problematic for several reasons - not least that people will be tempted to just copy the formula and ignore an explanation. I believe it actually covers Doane's original intent. In any case it's a vast improvement over the nonsense that was in the original. (Please, anyone who can access the original paper, take a look at it and how $\sqrt{b_1}$ is defined and check my changes on Wikipedia to make sure it's reasonable - there were at least three things wrong - the kurtosis, the standard error, and the wrong base of logs, plus Doane's own small error.)

Related Solutions

Solved – How the Pearson’s Chi Squared Test works

A Chi-square test is designed to analyze categorical data. That means that the data has been counted and divided into categories. It will not work with parametric or continuous data. So it doesn't work to determine resulting fit in every instance.

Source: http://www.ling.upenn.edu/~clight/chisquared.htm

Solved – Optimal number of bins in histogram by the Freedman–Diaconis rule: difference between theoretical rate and actual number

The reason comes from the fact that the histogram function is expected to include all the data, so it must span the range of the data.

The Freedman-Diaconis rule gives a formula for the width of the bins.

The function gives a formula for the number of bins.

The relationship between number of bins and the width of bins will be impacted by the range of the data.

With Gaussian data, the expected range increases with $n$.

Here's the function:

> nclass.FD
function (x) 
{
    h <- stats::IQR(x)
    if (h == 0) 
        h <- stats::mad(x, constant = 2)
    if (h > 0) 
        ceiling(diff(range(x))/(2 * h * length(x)^(-1/3)))
    else 1L
}
<bytecode: 0x086e6938>
<environment: namespace:grDevices>

diff(range(x)) is the range of the data.

So as we see, it divides the range of the data by the FD formula for bin width (and rounds up) to get the number of bins.

It seems I could have been clearer, so here's a more detailed explanation:
The actual Freedman-Diaconis rule is not a rule for the number of bins, but for the bin-width. By their analysis, the bin width should be proportional to $n^{−1/3}$. Since the total width of the histogram must be closely related to the sample range (it may be a bit wider, because of rounding up to nice numbers), and the expected range changes with $n$, the number of bins is not quite inversely proportional to bin-width, but must increase faster than that. So the number of bins should not grow as $n^{1/3}$ - close to it, but a little faster, because of the way the range comes into it.

Looking at data from Tippett's 1925 tables[1], the expected range in standard normal samples seems to grow quite slowly with $n$, though -- slower even than $\log(n)$:

enter image description here

(indeed, amoeba points out in comments below that it should be proportional - or nearly so - to $\sqrt{\log(n)}$, which grows more slowly than your analysis in the question seem to suggest. This makes me wonder whether there's some other issue coming in, but I haven't investigated whether this range effect fully explains your data.)

A quick look at Tippett's numbers (which go up to n=1000) suggest that the expected range in a Gaussian is very close to linear in $\sqrt{\log(n)}$ over $10\leq n\leq 1000$, but it seems to be not actually proportional for values in this range.

enter image description here

[1]: L. H. C. Tippett (1925). "On the Extreme Individuals and the Range of Samples Taken from a Normal Population". Biometrika 17 (3/4): 364–387

Best Answer

Related Solutions

Solved – How the Pearson’s Chi Squared Test works

Solved – Optimal number of bins in histogram by the Freedman–Diaconis rule: difference between theoretical rate and actual number

Related Question