Solved – general/golden rule for appropriate binning in a histogram

binninghistogram

I was wondering, is there a general rule or a "golden rule" that sets the appropriate bin size as a function of statistical parameters such as sample size, mean, median, mode, standard deviation, etc. when the data is not known to follow a certain distribution?

The reason why I ask is because I have a relatively large data set (about 1.5 million values). However, the range of these values is from 0 to about 0.6. Nothing is known about the distribution of this data (i.e, is it "normal") because the data itself is derived from an experimental process of which its mechanism is not categorized fully as of yet. I am worried that certain bin sizes will encompass values that would distort the shape of my histogram. I understand that eventually you have to create a hardline boundary for your bins as too fine of a bin control can give you a very awkward shape, a shape that may explain too much diversity even though there might not be. However, if the lines are too ambiguous, you could have a very general shape and you could lose information on data stratification.

Here are some statistical parameters for my data set:

Range = ~0.6

Min = 0

Max = ~0.6

St. Dev = 0.063

n = $1.5*10^6$

Best Answer

With 1.5 million observations, the choice of bin size should be irrelevant. In fact, one could use density smoothing estimates to have something like a continuous histogram to represent their data. Regardless, the number of total overall bins should simply be a function of how finely you wish to present these data. 10 bins, visually, can be a lot to take in but can present complicated distributions that are either skewed or multimodal. 6 bins is good for presenting a global mode and ranges.

Related Solutions

Solved – Doane’s formula for histogram binning

This answer has undergone significant changes as I investigate the Wikipedia page. I've left the answers largely as they were but added to them, so at present this forms a progression of understanding; the last parts are where the best information is.

Short answer: the Wikipedia page (at the time the question was posted) - and the OP's formula, which seems to have been the same - are simply wrong, for at least three different reasons. I'll leave my original discussion (which assumed that the OP and Wikipedia had it right) since that explains some issues. Better discussion follows later. The short advice: simply forget Doane. If you must use it, use what Wikipedia says now (I fixed it).

I believe that formula must refer to excess kurtosis; my reason for that is that it modifies a formula for normal data to account for non-normal data so you'd expect it to reproduce the unmodified one at the normal. It does that if you use excess kurtosis.

That does, however, raise the problem that the term in the log can go negative with large samples (indeed, it's possible to be be $\leq 0$ at quite small $n$). I'd suggest not using it with negative excess kurtosis (I'd never use it beyond unimodality anyway; once things get multimodal you want to apply the excess kurtosis idea to each mode, not smooth over them!), though with mild cases (excess kurtosis just less than 0) and modest sample sizes it won't be a big issue.

I'd also suggest that in any case it's going to give far too few bins at large sample sizes, even when it works as intended.

You may find this paper (by regular CVer Rob Hyndman):

http://www.robjhyndman.com/papers/sturges.pdf

of some interest. If Sturges' argument is wrong, Doane's formula has the same problem... as Rob clearly notes in the paper.

In that paper (and in this answer) he gives a nod to the Freedman-Diaconis rule. In the paper he also points to the approach mentioned by Matt Wand (he refers to the working paper which doesn't seem to be online, but the subsequent paper is available if you have access):

http://www.jstor.org/discover/10.2307/2684697

[Edit: actually a link to the working paper is on the citeseer page]

That approach involves approximately estimating particular functionals in order to get approximately optimal (in terms of mean integrated squared error, MISE) bin widths for estimating the underlying density. While these work well and give many more bins than Sturges or Doane in general, sometimes I still prefer to use more bins still, though it's usually a very good first attempt.

Frankly I don't know why Wand's approach (or at the very least the Freedman-Diaconis rule) isn't a default pretty much everywhere.

R does at least offer the Freedman-Diaconis calculation of the number of bins:

 nclass.FD(rnorm(100))
[1] 11
 nclass.FD(runif(100))
[1] 6
 nclass.FD(rt(100,1))
[1] 71

See ?nclass.FD

Personally, for me that's too few bins in the first two cases at least; I'd double both of those in spite of the fact that it might be a bit noisier than optimal. As $n$ becomes large, I think it does very well in most cases.

Edit 2:

I decided to investigate the skewness vs kurtosis issue that @PeterFlom rightly expressed puzzlement at.

I just had a look at the Doane paper wiso linked to (I'd read it before .... but that was almost 30 years ago) - it makes no reference to kurtosis at all, only to skewness.

Doane's actual formula is: $K_e = \log_2(1+\frac{g_1}{\sigma_{g_1}})$

where $K_e$ is the number of added bins, $g_1$ is the 3rd moment skewness. [Well actually, Doane, following fairly common usage of the time, uses $\sqrt{b_1}$ for the signed (!) 3rd moment skewness (the origin of this particularly unedifying abuse of notation is quite old and I'm not going to pursue it, except to say that it's fortunately appearing much less often now).]

Now at the normal, $\sigma_{g_1} = \sqrt{\frac{6(n-2)}{(n+1)(n+3)}} \approx \sqrt{\frac{6}{n}}$
(although that approximation is pretty poor until $n$ is well past 100; Doane uses the first form).

However, it seems that along the way someone has tried to adapt it to kurtosis (at the time I write this Wikipedia has it in terms of kurtosis, for example, and I don't think they made it up) - but there's clear reason to believe that the formula is simply wrong (note that the standard error used is that final approximation for the s.e. of the skewness I gave above). I think I've seen this use of kurtosis in several places other than Wikipedia, but besides not being in Doane's paper, it isn't present in Scott's paper, nor the Hyndman paper I point to, nor in Wand's paper. It does seem to have come from somewhere, however (i.e. I am sure it's not original to Wikipedia), because Doane doesn't have the approximation to $\sigma_{g_1}$. It looks like it's been played with several times before it ended up there; I'd be interested if anyone tracked it down.

It does look to me like Doane's argument should happily extend to kurtosis, but the correct standard error would have to be used.

However, since Doane relies on Sturges and Sturges' argument seems to be flawed, perhaps the entire enterprise is doomed. In any case I have edited the Histogram talk page on wikipedia noting the error.

---

Edit 3: I have corrected the wikipedia page (but I took the liberty of taking the absolute value of the skewness, otherwise Doane's original formula can't be used for left-skewed distributions as it stood - clearly for number of bins the sign of the skewness is immaterial). Strictly speaking I should have presented the formula in its original (wrong) form, and then explained why it doesn't make sense but I think that's problematic for several reasons - not least that people will be tempted to just copy the formula and ignore an explanation. I believe it actually covers Doane's original intent. In any case it's a vast improvement over the nonsense that was in the original. (Please, anyone who can access the original paper, take a look at it and how $\sqrt{b_1}$ is defined and check my changes on Wikipedia to make sure it's reasonable - there were at least three things wrong - the kurtosis, the standard error, and the wrong base of logs, plus Doane's own small error.)

Solved – Appropriate test for detecting a signal in normally distributed noise

The main problem I had to solve here was choosing the appropriate bin size.

I found this paper by David Scott which gave me a useful formula.

$h = 3.49 s n^{-1/3}$

where s is standard deviation and n is the number of samples.

This one worked quite well for me and was derived assuming normally distributed data. I also tried deriving an equation based on a Laplace distribution but the results were not as good in my case.

Best Answer

Related Solutions

Solved – Doane’s formula for histogram binning

Solved – Appropriate test for detecting a signal in normally distributed noise

Related Question