Solved – good number of bins for logarithmic bin width

binninghistogram

I was wondering how to estimate a good number of bins for my histogram.

I know quite certainly that my data is well approximated by a LogNormal distribution. Previous studies have used logarithmic bins, and I was planning to do the same.

First question:

Intuitively it seems to make sense to use logarithmic bins for a LogNormal distribution is there some quantitative reason why people use log bins?

Second question:

What is a method I can use to find a good number of bins? I'd usually fall back to Freedman Diakonis, but since it applies to constant bin widths I dont know if k~n^(1/3) is applicable in this case?

Also, I'm a humpy-dumpy astronomer that just tries to do things not too wrong, please don't give a too technical answer :). I would also appreciate cite-able reading links!

Best Answer

Number of bins. Section 3.2 of this Wikipedia page (as of 2 Aug 2019) lists several formulas for the number of bins in a histogram based on sample size. Statistical software programs use a variety of rules--especially Freedman-Diaconis, Sturges, and square-root. (Wikipedia gives brief rationales for some of the formulas.) Some programs make it easy for the user to specify the approximate number of bins.

In practice, I think it is impossible for any general algorithm to pick the 'best' number of bins to give an 'optimum' description of a particular dataset. Regardless of the formula used, most programs alter the number of bins actually used for a particular dataset, so that bin endpoints are 'round' or 'convenient' numbers. When you are making a graphic for a report or publication, it is a good idea to try different numbers of bins, perhaps half and double the number chosen by software. You want to have enough bins to show unexpected "dips or bumps" in the parent distribution, but not so many bins that a lot of them are empty.

All bins the same width? Generally, it is best to make all bins of the same width. This makes it easier to interpret the vertical scale of a histogram. However, there are important exceptions to this general preference. Histograms of (right skewed) lognormal data may present one of the exceptions.

A lognormal distribution becomes normal upon taking logs. (All logarithms on this page are natural logs, base $e.)$ If your histogram is based on logged data, then it probably makes sense to use bins of equal widths. However, if you are plotting your histogram on the original lognormal scale, it may be best to have narrower bins toward the left and wider ones toward the right.

Here some data with $n = 1000$ observations $Y_i$ from the lognormal distribution with parameters $\mu = 4, \sigma = 1,$ with $X_i$ from the corresponding normal distribution. (It is customary to use the normal mean and variance as parameters for the lognormal distribution.)

set.seed(1234)  # for reproducibility
x = rnorm(1000, 4, 1); y = exp(x)
summary(x)  # normal
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.6039  3.3267  3.9602  3.9734  4.6158  7.1959 
summary(y)  # lognormal
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.829   27.848   52.468   87.719  101.071 1333.952 

In the figure below, the first row shows the $X_i$ plotted on a frequency scale (The height of each bar is the number of the observations in the interval represented by that bar.) Then the $Y_i$ are plotted similarly. Such frequency histograms make sense only when bin widths are equal.

The last histogram in the first row shows the lognormal $Y_i$ observations plotted with unequal interval lengths. For that histogram, R switches automatically to 'density' mode, where the area of each bar represents the proportion of the observation in its associated interval. I chose the bin boundaries at the quintiles of the observed $Y_i.$ This gives five intervals, with enough observations in the last one to show at the resolution of the plot. (I don't claim this is a general rule for selecting irregular bin boundaries that will give a useful graphical description for every lognormal sample.)

enter image description here

All of the histogram in the second row use the density scale on the vertical axis. Because density histograms use areas to represent proportions, it is possible to compare a density histogram with population density (when known) or the best-fitting density function when parameters are estimated. Density functions are shown as solid red curves. (If bin widths are equal, the parameter prob=T is used in the code hist in order to make a density histogram.) In some applications, it may be argued that a histogram with unequal bin widths gives the appearance of a better fit to the lognormal population distribution.


Note: The R code for the 6-panel plot above is shown below, in case it is of interest.

par(mfrow=c(2,3))  # enable 6-panel plot

 hist(x, col="skyblue2")              # R's default bin boundaries used
 hist(y, col="skyblue2")
 cutp = cutp = c(0,20,40,70,120,1500) # selected bin boundaries
  hist(y, br=cutp, ylim=c(0,.01), col="skyblue2")

 hist(x, prob=T, col="skyblue2")
  curve(dnorm(x,4,1), add=T, col="red", lwd=2)
 hist(y, prob=T, ylim=c(0,.01), col="skyblue2")
  curve(dlnorm(x,4,1), add=T, col="red", lwd=2, n=10001)
 cutp = c(0,30,100, 200,1500)
  hist(y, br=cutp, ylim=c(0,.01), col="skyblue2")
   curve(dlnorm(x,4,1), add=T, col="red", lwd=2, n=10001)

par(mfrow=c(1,1))  # return to default single-panel plotting

Addendum: Here are some details for making a plot when the parameters of the lognormal distribution are not known and must be estimated from data. The estimated density may be grossly inaccurate of $n$ is too small.

set.seed(1234)  # same sample as above
x = rnorm(1000, 4, 1); y = exp(x)
# estimate parameters
mu.hat = mean(log(y));  sg.hat = sd(log(y))
# find mode of estimated density, and height at mode
md = exp(mu.hat-sg.hat^2);  mx = dlnorm(md, mu.hat, sg.hat)
# use quintiles as endpoints of bins
cutp = quantile(y, c(0:5)/5)

hist(y, br=cutp, ylim=c(0,mx), col="skyblue2")
 curve(dlnorm(x, mu.hat, sg.hat), add=T, lwd=2, col="red", n=10001)

enter image description here

# use deciles as endpoints of bins
cutp = quantile(y, c(0:10)/10)
hist(y, br=cutp, ylim=c(0,mx), col="skyblue2", 
     main="10 Bins Based on Deciles")
curve(dlnorm(x, mu.hat, sg.hat), add=T, lwd=2, col="red", n=10001)

enter image description here