Solved – Doane’s formula for histogram binning

binninghistogram

I'm implementing various algorithms to estimated the best number of bins to use for histograms. Most of the ones I am implementing are described on the Wikipedia "Histogram" page in the section "Number of bins and width"*.

I am stuck on a problem with Doane's formula:

1 + log(n) + log(1 + kurtosis(data) * sqrt(n / 6.))

where n is the data size.

The problem is when the kurtosis is negative and n >> 1 because the argument of the log becomes negative.

* (that page has changed since this was posted, link edited to point to the page as it was at the time of posting)

Best Answer

This answer has undergone significant changes as I investigate the Wikipedia page. I've left the answers largely as they were but added to them, so at present this forms a progression of understanding; the last parts are where the best information is.

Short answer: the Wikipedia page (at the time the question was posted) - and the OP's formula, which seems to have been the same - are simply wrong, for at least three different reasons. I'll leave my original discussion (which assumed that the OP and Wikipedia had it right) since that explains some issues. Better discussion follows later. The short advice: simply forget Doane. If you must use it, use what Wikipedia says now (I fixed it).


I believe that formula must refer to excess kurtosis; my reason for that is that it modifies a formula for normal data to account for non-normal data so you'd expect it to reproduce the unmodified one at the normal. It does that if you use excess kurtosis.

That does, however, raise the problem that the term in the log can go negative with large samples (indeed, it's possible to be be $\leq 0$ at quite small $n$). I'd suggest not using it with negative excess kurtosis (I'd never use it beyond unimodality anyway; once things get multimodal you want to apply the excess kurtosis idea to each mode, not smooth over them!), though with mild cases (excess kurtosis just less than 0) and modest sample sizes it won't be a big issue.

I'd also suggest that in any case it's going to give far too few bins at large sample sizes, even when it works as intended.

You may find this paper (by regular CVer Rob Hyndman):

http://www.robjhyndman.com/papers/sturges.pdf

of some interest. If Sturges' argument is wrong, Doane's formula has the same problem... as Rob clearly notes in the paper.

In that paper (and in this answer) he gives a nod to the Freedman-Diaconis rule. In the paper he also points to the approach mentioned by Matt Wand (he refers to the working paper which doesn't seem to be online, but the subsequent paper is available if you have access):

http://www.jstor.org/discover/10.2307/2684697

[Edit: actually a link to the working paper is on the citeseer page]

That approach involves approximately estimating particular functionals in order to get approximately optimal (in terms of mean integrated squared error, MISE) bin widths for estimating the underlying density. While these work well and give many more bins than Sturges or Doane in general, sometimes I still prefer to use more bins still, though it's usually a very good first attempt.

Frankly I don't know why Wand's approach (or at the very least the Freedman-Diaconis rule) isn't a default pretty much everywhere.

R does at least offer the Freedman-Diaconis calculation of the number of bins:

 nclass.FD(rnorm(100))
[1] 11
 nclass.FD(runif(100))
[1] 6
 nclass.FD(rt(100,1))
[1] 71

See ?nclass.FD

Personally, for me that's too few bins in the first two cases at least; I'd double both of those in spite of the fact that it might be a bit noisier than optimal. As $n$ becomes large, I think it does very well in most cases.


Edit 2:

I decided to investigate the skewness vs kurtosis issue that @PeterFlom rightly expressed puzzlement at.

I just had a look at the Doane paper wiso linked to (I'd read it before .... but that was almost 30 years ago) - it makes no reference to kurtosis at all, only to skewness.

Doane's actual formula is: $K_e = \log_2(1+\frac{g_1}{\sigma_{g_1}})$

where $K_e$ is the number of added bins, $g_1$ is the 3rd moment skewness. [Well actually, Doane, following fairly common usage of the time, uses $\sqrt{b_1}$ for the signed (!) 3rd moment skewness (the origin of this particularly unedifying abuse of notation is quite old and I'm not going to pursue it, except to say that it's fortunately appearing much less often now).]

Now at the normal, $\sigma_{g_1} = \sqrt{\frac{6(n-2)}{(n+1)(n+3)}} \approx \sqrt{\frac{6}{n}}$
(although that approximation is pretty poor until $n$ is well past 100; Doane uses the first form).

However, it seems that along the way someone has tried to adapt it to kurtosis (at the time I write this Wikipedia has it in terms of kurtosis, for example, and I don't think they made it up) - but there's clear reason to believe that the formula is simply wrong (note that the standard error used is that final approximation for the s.e. of the skewness I gave above). I think I've seen this use of kurtosis in several places other than Wikipedia, but besides not being in Doane's paper, it isn't present in Scott's paper, nor the Hyndman paper I point to, nor in Wand's paper. It does seem to have come from somewhere, however (i.e. I am sure it's not original to Wikipedia), because Doane doesn't have the approximation to $\sigma_{g_1}$. It looks like it's been played with several times before it ended up there; I'd be interested if anyone tracked it down.

It does look to me like Doane's argument should happily extend to kurtosis, but the correct standard error would have to be used.

However, since Doane relies on Sturges and Sturges' argument seems to be flawed, perhaps the entire enterprise is doomed. In any case I have edited the Histogram talk page on wikipedia noting the error.

---

Edit 3: I have corrected the wikipedia page (but I took the liberty of taking the absolute value of the skewness, otherwise Doane's original formula can't be used for left-skewed distributions as it stood - clearly for number of bins the sign of the skewness is immaterial). Strictly speaking I should have presented the formula in its original (wrong) form, and then explained why it doesn't make sense but I think that's problematic for several reasons - not least that people will be tempted to just copy the formula and ignore an explanation. I believe it actually covers Doane's original intent. In any case it's a vast improvement over the nonsense that was in the original. (Please, anyone who can access the original paper, take a look at it and how $\sqrt{b_1}$ is defined and check my changes on Wikipedia to make sure it's reasonable - there were at least three things wrong - the kurtosis, the standard error, and the wrong base of logs, plus Doane's own small error.)