Data Visualization – Is Plotting the Mean in a Histogram Appropriate?

data visualizationhistogrammeanself-study

Is it "okay" to add a vertical line to a histogram to visualize the mean value?

It seems okay to me, but I've never seen this in textbooks and the likes, so I'm wondering if there's some sort of convention not to do that?

The graph is for a term paper, I just want to make sure I don't accidentally break some super important unspoken stats rule. 🙂

Best Answer

Of course, why not?

histogram with mean

Here's an example (one of dozens I found with a simple google search):

hist with mean and median

(Image source is is the measuring usability blog, here.)

I've seen means, means plus or minus a standard deviation, various quantiles (like median, quartiles, 10th and 90th percentiles) all displayed in various ways.

Instead of drawing a line right across the plot, you might mark information along the bottom of it - like so:

histogram with marginal boxplot

There's an example (one of many to be found) with a boxplot across the top instead of at the bottom, here.

Sometimes people mark in the data:

histogram rugplot with jitter
(I have jittered the data locations slightly because the values were rounded to integers and you couldn't see the relative density well.)

There's an example of this kind, done in Stata, on this page (see the third one here)

Histograms are better with a little extra information - they can be misleading on their own

You just need to take care to explain what your plot consists of! (You'd want a better title and x-axis label than I used here, for starters. Plus an explanation in a figure caption explaining what you had marked on it.)

--

One last plot:

histogram with stripchart

--

My plots are generated in R.

Edit:

As @gung surmised, abline(v=mean... was used to draw the mean-line across the plot and rug was used to draw the data values (though I actually used rug(jitter(... because the data was rounded to integers).

Here's a way to do the boxplot in between the histogram and the axis:

hist(Davis2[,2],n=30)
boxplot(Davis2[,2],
  add=TRUE,horizontal=TRUE,at=-0.75,border="darkred",boxwex=1.5,outline=FALSE)

I'm not going to list what everything there is for, but you can check the arguments in the help (?boxplot) to find out what they're for, and play with them yourself.

However, it's not a general solution - I don't guarantee it will always work as well as it does here (note I already changed the at and boxwex options*). If you don't write an intelligent function to take care of everything, it's necessary to pay attention to what everything does to make sure it's doing what you want.

Here's how to create the data I used (I was trying to show how Theil regression was really able to handle several influential outliers). It just happened to be data I was playing with when I first answered this question.

 library("car")
 add <- data.frame(sex=c("F","F"),
       weight=c(150,130),height=c(NA,NA),repwt=c(55,50),repht=c(NA,NA))
 Davis2 <- rbind(Davis,add)

* -- an appropriate value for at is around -0.5 times the value of boxwex; that would be a good default if you write a function to do it; boxwex would need to be scaled in a way that relates to the y-scale (height) of the boxplot; I'd suggest 0.04 to 0.05 times the upper y-limit might often be okay.

Code for the marginal stripchart:

 hist(Davis2[,2],n=30)
 stripchart(jitter(Davis2[,2],amount=.5),
       method="jitter",jitter=.5,pch=16,cex=.05,add=TRUE,at=-.75,col='purple3')