Solved – Rules of Thumb to choose an initial number of class intervals and refine that choice (potentially automatically)

descriptive statisticshistogramrule-of-thumb

I was wondering if there are established rules of thumb (or algorithms) that, given a set of observations can help:

choose an initial number of class intervals.
refine that choice to a better number.

I could find talk of using square-root(N), where N is the number of observations as an initial guess of the number of class intervals.

Thanks in advance.

Best Answer

The help of the R command hist http://stat.ethz.ch/R-manual/R-patched/library/grDevices/html/nclass.html has some references to algorithms for computing the number of the bins:

Sturges, H. A. (1926) The choice of a class interval. Journal of the American Statistical Association 21, 65–66.

Scott, D. W. (1979) On optimal and data-based histograms. Biometrika 66, 605–610.

Freedman, D. and Diaconis, P. (1981) On the histogram as a density estimator: L_2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 453–476.

Related Solutions

Solved – “When to use boxplot and when barplot” rules (of thumb?)

Specifically for graphical illustration of ANOVA:

A box plot or bar chart is much better than nothing graphically for ANOVA, but as commonly plotted, both are indirect or incomplete as a graphical summary.
ANOVA is about comparisons of means in a context of variations of one or more kinds, so the most appropriate graphic would show, minimally, means as well as the raw data. Group standard deviations (SDs) or related quantities would do no harm.
Although some varieties of box plots show means as well as medians, the standard kind shows medians, quartiles and some information in the tails of the distribution. The most common variant seems to be that in which individual data points are shown if and only if they lie more than 1.5 IQR away from the nearer quartile. That is: interquartile range IQR $=$ upper quartile $-$ lower quartile, so plot as points values greater than upper quartile $+$ 1.5 IQR or less than lower quartile $-$ 1.5 IQR. Such a convention can be helpful at showing gross outliers which may be problematic for ANOVA, but neither medians nor quartiles play any part in ANOVA and whether medians approximate means is a point to be checked, not assumed. Commonly, experienced data analysts take e.g. pronounced marked outliers and/or asymmetry of distribution as a sign of a problem that needs action, such as transformation of the data or need for a generalized linear model with a non-identity link function. Nevertheless it is surprising how many textbook and other accounts show box plots when an ANOVA is being presented but don't mention the elephants not in the room, the means that are not plotted.
Conversely, the most common kind of bar chart in this context summarizes data by means and SDs or standard errors, but omits any display of individual data points otherwise. So, for example, outliers or marked asymmetry can only be inferred from out-of-line means or inflated variability within individual groups.

Generally, there are many suggestions of which kinds of graphs are useful but little consensus about which are best. I'd suggest as criteria that a good graph shows

The complete pattern of variation in the data, at least as backdrop or context
Relevant summaries of the data, specifically those relevant to the model being entertained or the descriptors being considered
Indications of possible problems with the data that cast doubt on assumptions being made.

There are several designs that help with ANOVA, such as dot or strip plots with added means and SEs.

This paper by John Tukey explains the difference between propaganda graphs and analytical graphs that is pertinent here. Too many graphical illustrations of ANOVA are propaganda graphs (look! the groups are very different) without much analysis (and what else can we learn about the data or the limitations of the technique in this application?).

Solved – Scott’s and Freedman–Diaconis rules of the thumb for selecting bin width – disatvantages

Comment continued. Here is a mixture of three normal samples (each of size 50) with means sufficiently far apart, relative to their standard deviations, to show separate modes. The default binning in R provides a histogram that does find the modes. The default KDE in R (with the default bandwidth) roughly matches the three modes (at 12, 18, and 25).

set.seed(930)
x = cbind(rnorm(50,12,2), rnorm(50,18,2), rnorm(50,25,2))
hist(x, prob=T, col="skyblue2"); rug(x)
lines(density(x), col="red", lwd=2)

Best Answer

Related Solutions

Solved – “When to use boxplot and when barplot” rules (of thumb?)

Solved – Scott’s and Freedman–Diaconis rules of the thumb for selecting bin width – disatvantages

Related Question