Error Bars – Issues with Standard Deviation Error Bars Spanning Negative Scale in Non-Negative Variables

confidence intervaldescriptive statisticserrornormal distributionstandard deviation

I have a question regarding error bars. I understand that error bars (EBs) constructed with 1 standard deviation (SD) present different things about the population than EBs constructed with 95% confidence intervals (CI). Namely, EBs with SD show the spread (or dispersion) of the variable's actual values, while EBs with CI show the range that the actual mean should most likely fall within.

My data include a variable, the number (count) of times a person visits the doctor per year. The mean visit number is 3 and the SD is 5, while the confidence interval is 2.5 to 3.5. Is it inherently wrong to show the EBs based on SD since it would extend to negative values (i.e., 3-5 = -2)? Does it violate any assumption?

If I draw the bar graph showing mean 3 and EBs based on 1 SD, the EBs will range from 0 to 8, can I still claim that ~68% of values fall within 0 to 8, or because it is right skewed and the supposed lower EBs largely reaches the negative, this no longer holds? If so, how can I interpret the 0 to 8 which truncates the negative?

Best Answer

No, in this case, it does not make sense to draw error bars using SDs.

Take a step back. Why do we draw error bars with SDs? As you write, it's to show where "much" of the data lies. This makes sense if your data come from a normal distribution: 68% of your data will lie within $\pm 1$ SD from the mean, so showing the mean with an error bar of $\pm 1$ SD will give you an interval that contains 68% of your data.

However, the number of visits to a doctor is a count, so it is discrete. And it can't be negative. Thus, it can't be normal. For high counts, you can often treat counts as normal, but not for a mean of 3 and an SD of 5. Using SD-based error bars is the wrong way of answering the original question, i.e., showing where "much" of the data falls.

Better: calculate the top and bottom ends of your interval directly, by calculating (e.g.) the 16% and the 84% quantile of your observations. The range between them will again contain 68% of your data, as in the normal case the interval around the mean $\pm 1$ SD.

Alternatively, you can fit a distribution. For instance, a mean of 3 and an SD of 5 are consistent with a negative binomial distribution with a mean of 3 and a size parameter of $\frac{3^2}{5^2-3}$ (see R's help page ?qnbinom - there are many different parameterizations of the negbin). For such a distribution, we can again calculate the parametric 16%/84% quantiles, which turns out to give us an interval $[0,6]$:

> qnbinom(pnorm(c(-1,1)),mu=3,size=3^2/(5^2-3))
[1] 0 6
Related Question