I'm interested in finding as optimal of a method as I can for determining how many bins I should use in a histogram. My data should range from 30 to 350 objects at most, and in particular I'm trying to apply thresholding (like Otsu's method) where "good" objects, which I should have fewer of and should be more spread out, are separated from "bad" objects, which should be more dense in value. A concrete value would have a score of 1-10 for each object. I'd had 5-10 objects with scores 6-10, and 20-25 objects with scores 1-4. I'd like to find a histogram binning pattern that generally allows something like Otsu's method to threshold off the low scoring objects. However, in the implementation of Otsu's I've seen, the bin size was 256, and often I have many fewer data points that 256, which to me suggests that 256 is not a good bin number. With so few data, what approaches should I take to calculating the number of bins to use?
Histogram Bins – Calculating Optimal Number of Bins
histogramrule-of-thumb
Related Solutions
There are lots of possible data sets that could generate these summary bins, so it's impossible to be exact, but you can make reasonable guesses.
One way to get subinterval estimates is to create a function that gives the number of people at each income level. The easiest, and perhaps the best (simplest assumptions), is to connect known points and interpolate between them. You don't really have known points, but I used the (x=median, y=intervalCount/intervalWidth)
. There's not much difference between the mean and medium in this set, which suggests the data values are pretty well-behaved in each interval.
Once you have such a function, you can integrate it between any two points to get any subinterval counts.
I left out the 0-0 interval because the value is literally off the chart and 1000+ because it has no real width.
Since the data is obviously not any traditional distribution, a local smoother is a decent way to smooth it out. Here's a spline smoother:
It does better at the tail, but is perhaps too smooth at the beginning.
The 100-119 interval looks high in both populations. It could be due to a propensity for people to round up to 100 when answering the survey.
As far are truth in graphics goes, it best to just plot the data that you have, which is the intervals. It might be useful to show the mean/medians, but they only depart from the middle for the high ranges, which might be worth separate study.
We can try in double our bin count by considering the medians. Theoretically, the median divides each interval into two intervals with equal population (two bars of equal area but possibly different heights). However, the breakdown is not so obvious due to possible ties and fractional medians. Here is it with interval widths of (median-lo)
and (hi-median+1)
: (each full interval width is (hi-lo+1)
).
Simulation (Sheldon Ross) gives an algorithm in section 4.6 by noting that an outcome $X_1,\ldots,X_K$ can be generated in sequence because $X_1$ has a binomial distribution and each conditional distribution $X_i | X_{i-1}, \ldots, X_1$ is also binomial.
Specifically, if the $K$ bin probabilities are $p_1, p_2, \ldots, p_K$, draw $x_1$ from a binomial$(N,p_1)$ distribution, then draw $x_2$ from a binomial$(N-x_1, p_2/(1-p_1))$ distribution, ..., $x_{i+1}$ from a binomial$(N-x_1-\cdots-x_i, p_{i+1}/(1-p_1-\cdots-p_i))$ distribution, etc.
Here is a (recursive) Mathematica implementation.
ClearAll[f];
f[n_, p_] /; Length[p] >= 2 :=
With[{x = RandomInteger[BinomialDistribution[n, First[p]]]},
{x, f[n - x, Rest[p] / (Plus @@ Rest[p])]}];
f[0, p_] := 0 p;
f[n_, p_] /; Length[p] == 1 := n;
Example of use:
x = Block[{p = 1/Range[500], $RecursionLimit = 2000},
f[10^30, p/(Plus @@ p)] // Flatten
];
(4.7 seconds).
When the expectations are all greater than 10 or so you can do this much faster by approximating the multinomial as a (degenerate) multivariate normal distribution.
Best Answer
The Freedman-Diaconis rule is very robust and works well in practice. The bin-width is set to $h=2\times\text{IQR}\times n^{-1/3}$. So the number of bins is $(\max-\min)/h$, where $n$ is the number of observations, max is the maximum value and min is the minimum value.
In base R, you can use:
For other plotting libraries without this option (e.g.,
ggplot2
), you can calculate binwidth as: