Histogram Bins – Calculating Optimal Number of Bins

histogramrule-of-thumb

I'm interested in finding as optimal of a method as I can for determining how many bins I should use in a histogram. My data should range from 30 to 350 objects at most, and in particular I'm trying to apply thresholding (like Otsu's method) where "good" objects, which I should have fewer of and should be more spread out, are separated from "bad" objects, which should be more dense in value. A concrete value would have a score of 1-10 for each object. I'd had 5-10 objects with scores 6-10, and 20-25 objects with scores 1-4. I'd like to find a histogram binning pattern that generally allows something like Otsu's method to threshold off the low scoring objects. However, in the implementation of Otsu's I've seen, the bin size was 256, and often I have many fewer data points that 256, which to me suggests that 256 is not a good bin number. With so few data, what approaches should I take to calculating the number of bins to use?

Best Answer

The Freedman-Diaconis rule is very robust and works well in practice. The bin-width is set to $h=2\times\text{IQR}\times n^{-1/3}$. So the number of bins is $(\max-\min)/h$, where $n$ is the number of observations, max is the maximum value and min is the minimum value.

In base R, you can use:

hist(x, breaks="FD")

For other plotting libraries without this option (e.g., ggplot2), you can calculate binwidth as:

bw <- 2 * IQR(x) / length(x)^(1/3)

### for example #####
ggplot() + geom_histogram(aes(x), binwidth = bw)