[Math] Histogram, box plot and probability plot – which is better for assessing normality

descriptive statisticsprobability distributionsstatistics

Which method of the three: histogram, box plot and probability plot is best at determining whether a distribution is approximately normally distributed? Why?

Best Answer

Normal probability plots: The main purpose of a normal probability plot (normal Q-Q plot) is to assess normality. Here are plots, each of $n = 500$ observations, from uniform, normal, and Laplace (double-exponential) families, respectively. Only the normal sample shows points along a reasonably straight line in its normal probability plot. Of the three kinds of graphs a normal probability plot is most directly relevant to assessing normality.

enter image description here

Boxplots: Major purposes of boxplots are to show quartiles--and also outliers, if any are present. The boxplots below are for the same three datasets as above. All three distributions are symmetrical, and their respective boxplots are almost symmetrical. First and third quartiles (ends of boxes) become closer together as we scan from left to right.

In a boxplot, outliers are plotted individually as dots. A uniform distribution has no 'tails', and outliers are rare. A normal distribution has long thin tails, and and a boxplot of a moderately large sample will typically show a few outliers (in each tail). A Laplace distribution has heavy tails, and it is rare for a boxplot not to show many outliers.

If a boxplot shows many far outliers or if the whiskers are greatly different in length, then the population from which the sample came is unlikely to be normal. However, boxplots may be the weakest of the three kinds of plots in assessing normality. (They are better at showing a sample is not normal, than confirming that it is.)

enter image description here

Histograms: Below we show histograms of the three samples along with the respective density functions of their populations. Especially for small samples, important information can be lost when data are sorted into histogram bins. Even with our moderately large samples, the shape of the histogram is not necessarily a close match with the shape of the population distribution. Nevertheless, of the three kinds of graphical descriptions, histograms may be second-best (to normal probability plots) for assessing normality.

enter image description here

Related Question