Solved – general/golden rule for appropriate binning in a histogram

binninghistogram

I was wondering, is there a general rule or a "golden rule" that sets the appropriate bin size as a function of statistical parameters such as sample size, mean, median, mode, standard deviation, etc. when the data is not known to follow a certain distribution?

The reason why I ask is because I have a relatively large data set (about 1.5 million values). However, the range of these values is from 0 to about 0.6. Nothing is known about the distribution of this data (i.e, is it "normal") because the data itself is derived from an experimental process of which its mechanism is not categorized fully as of yet. I am worried that certain bin sizes will encompass values that would distort the shape of my histogram. I understand that eventually you have to create a hardline boundary for your bins as too fine of a bin control can give you a very awkward shape, a shape that may explain too much diversity even though there might not be. However, if the lines are too ambiguous, you could have a very general shape and you could lose information on data stratification.

Here are some statistical parameters for my data set:

Range = ~0.6

Min = 0

Max = ~0.6

St. Dev = 0.063

n = $1.5*10^6$

Best Answer

With 1.5 million observations, the choice of bin size should be irrelevant. In fact, one could use density smoothing estimates to have something like a continuous histogram to represent their data. Regardless, the number of total overall bins should simply be a function of how finely you wish to present these data. 10 bins, visually, can be a lot to take in but can present complicated distributions that are either skewed or multimodal. 6 bins is good for presenting a global mode and ranges.

Related Question