Solved – How to determine whether or not the y-axis of a graph should start at zero

data visualization

One common way to "lie with data" is to use a y-axis scale that makes it seem as if changes are more significant than they really are.

When I review scientific publications, or students' lab reports, I am often frustrated by this "data visualization sin" (which I believe the authors commit unintentionally, but still results in a misleading presentation.)

However, "always start the y-axis at zero" is not a hard-and-fast rule. For example, Edward Tufte points out that in a time series, the baseline is not necessarily zero:

In general, in a time-series, use a baseline that shows the data not the zero point. If the zero point reasonably occurs in plotting the data, fine. But don't spend a lot of empty vertical space trying to reach down to the zero point at the cost of hiding what is going on in the data line itself. (The book, How to Lie With Statistics, is wrong on this point.)

For examples, all over the place, of absent zero points in time-series, take a look at any major scientific research publication. The scientists want to show their data, not zero.

The urge to contextualize the data is a good one, but context does not come from empty vertical space reaching down to zero, a number which does not even occur in a good many data sets. Instead, for context, show more data horizontally!

I want to point out misleading presentation in papers I review, but I don't want to be a zero-y-axis purist.

Are there any guidelines that address when to start the y-axis at zero, and when this is unnecessary and/or inappropriate? (Especially in the context of academic work.)

Best Answer

Don't use space in a graph in any way that doesn't help understanding. Space is needed to show the data!
Use your scientific (engineering, medical, social, business, ...) judgement as well as your statistical judgement. (If you are not the client or customer, talk to someone in the field to get an idea of what is interesting or important, preferably those commissioning the analysis.)
Show zero on the $y$ axis if comparisons with zero are central to the problem, or even of some interest.

Those are three simple rules. (Nothing rules out some tension between them on occasion.)

Here is a simple example, but all three points arise: You measure body temperature of a patient in Celsius, or in Fahrenheit, or even in kelvin: take your pick. In what sense whatsoever is it either helpful or even logical to insist on showing zero temperatures? Important, even medically or physiologically crucial, information will be obscured otherwise.

Here is a true story from a presentation. A researcher was showing data on sex ratios for various states and union territories in India. The graphic was a bar chart with all bars starting at zero. All bars were close to the same length despite some considerable variation. That was correct, but the interesting story was that areas were different despite similarities, not that they were similar despite differences. I suggested that parity between males and females (1 or 100 females/100 males) was a much more natural reference level. (I would also be open to using some overall level, such as the national mean, as a reference.) Even some statistical people who have heard this little story have sometimes replied, "No; bars should always start at zero." To me that is no better than irrelevant dogma in such a case. (I would also argue that dot charts make as much or more sense for such data.)

Mentioning bar charts points up that the kind of graph used is important too. Suppose for body temperatures a $y$ axis range from 35 to 40$^\circ$C is chosen for convenience as including all the data, so that the $y$ axis "starts" at 35. Clearly bars all starting at 35 would be a poor encoding of the data. But here the problem would be inappropriate choice of graph element, not poorly chosen axis range.

A common kind of plot, especially it seems in some biological and medical sciences, shows means or other summaries by thick bars starting at zero and standard error or standard deviation-based intervals indicating uncertainty by thin bars. Such detonator or dynamite plots, as they have been called by those who disapprove, may be popular partly because of a dictum that zero should always be shown. The net effect is to emphasise comparisons with zero that are often lacking in interest or utility.

Some people would want to show zero, but also to add a scale break to show that the scale is interrupted. Fashions change and technology changes. Decades ago, when researchers drew their own graphs or delegated the task to technicians, it was easier to ask that this be done by hand. Now graphics programs often don't support scale breaks, which I think is no loss. Even if they do, that is fussy addition that can waste a moderate fraction of the graphic's area.

Note that no-one insists on the same rule for the $x$ axis. Why not? If you show climatic or economic fluctuations for the last century or so, it would be bizarre to be told that the scale should start at the BC/CE boundary or any other origin.

There is naturally a zeroth rule that applies in addition to the three mentioned.

Whatever you do, be very clear. Label your axes consistently and informatively. Then trust that careful readers will look to see what you have done.

Thus on this point I agree strongly with Edward Tufte, and I disagree with Darrell Huff.

EDIT 9 May 2016:

rather than trying to invariably include a 0-baseline in all your charts, use logical and meaningful baselines instead

Cairo, A. 2016. The Truthful Art: Data, Charts, and Maps for Communication. San Francisco, CA: New Riders, p.136.

Citation

Drummond, Gordon B. & Sarah L. Vowler. 2011. Show the data, don't conceal them. The Journal of Physiology 598(8): 1861-1863. PDF available from publisher.

Note this article is being simultaneously published in 2011 in The Journal of Physiology, Experimental Physiology, The British Journal of Pharmacology, Advances in Physiology Education, Microcirculation, and Clinical and Experimental Pharmacology and Physiology

Below is an example extended to your data. I have posted full examples of generating similar plots in R using ggplot2 and in SPSS on my blog in this post, Avoid Dynamite Plots! Visualizing dot plots with super-imposed confidence intervals in SPSS and R. enter image description here

Solved – How to visualize three different data sets on the same graph

I am going to use R. I used dput after reading in the data to make all this reproducible. Define the data and the levels:

example <- structure(list(
  V1 = structure(c(4L, 7L, 8L, 3L, 6L, 10L, 11L, 1L, 5L, 12L, 2L, 9L),
    .Label = c("12.7.", "14.11.", "14.4.", "15.1.", "15.10.", "15.5.", "17.2.",
      "18.3.", "22.12.", "22.6.", "24.6.", "27.10."), class = "factor"),
  V2 = c(NA, NA, NA, 7L, 42L, 57L, 41L, 17L, NA, NA, NA, NA),
  V3 = c(NA, NA, 22L, 71L, 135L, 175L, 139L, 103L, 29L, NA, NA, NA),
  V4 = c(NA, 43L, 109L, 175L, 244L, 256L, 299L, 240L, 152L, 77L, 22L, NA),
  V5 = c(95L, 165L, 245L, 300L, 374L, 375L, 400L, 375L, 299L, 200L, 95L, 45L),
  V6 = c(180L, 252L, 334L, 421L, 470L, 400L, 529L, 555L, 440L, 330L, 175L, 125L),
  V7 = c(237L, 325L, 495L, 500L, 540L, 535L, 626L, 616L, 557L, 440L, 225L, 189L),
  V8 = c(257L, 356L, 450L, 575L, 600L, 602L, 650L, 663L, 616L, 475L, 303L, 199L),
  V9 = c(245L, 355L, 455L, 550L, 597L, 602L, 657L, 678L, 643L, 499L, 357L, 232L),
  V10 = c(259L, 401L, 500L, 521L, 576L, 575L, 655L, 645L, 375L, 400L, 295L, 218L),
  V11 = c(222L, 295L, 375L, 495L, 527L, 579L, 599L, 585L, 518L, 400L, 245L, 175L),
  V12 = c(157L, 230L, 313L, 398L, 415L, 425L, 517L, 481L, 400L, 310L, 166L, 120L),
  V13 = c(67L, 121L, 195L, 255L, 299L, 305L, 382L, 332L, 275L, 99L, 65L, 21L),
  V14 = c(NA, NA, 89L, 109L, 208L, 265L, 225L, 201L, 118L, 43L, NA, NA),
  V15 = c(NA, NA, NA, 48L, 108L, 121L, 118L, 70L, 12L, NA, NA, NA),
  V16 = c(NA, NA, NA, NA, 22L, 39L, 21L, NA, NA, NA, NA, NA)),
  .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8",
    "V9", "V10", "V11", "V12", "V13", "V14", "V15", "V16"),
  class = "data.frame",
  row.names = c(NA, -12L))
example.levels <- c(115,170,250,330,385,600)

Then we plot twelve subplots. In each subplot, we add your levels as horizontal lines. Note that I am constraining the $y$ axis to be identical across plots so we can visually compare them:

opar <- par(mfrow=c(3,4),mai=c(.2,.3,.3,.1)+.02)
for ( ii in 1:12 ) {
  plot(1:15,as.numeric(example[ii,-1]),xlab="",ylab="",
    xaxt="n",main=example[ii,1],ylim=c(0,700),type="o")
  abline(h=example.levels,col="grey")
}
par(opar)

I am not putting the times on the x axis since they will be hard to read anyway, but perhaps one could truncate the minutes and just note the hours. Result:

enter image description here

Best Answer

Related Solutions

Solved – Analyze and visualize participants response towards particular condition

Citation

Solved – How to visualize three different data sets on the same graph

Related Question