Error Bars – Issues with Standard Deviation Error Bars Spanning Negative Scale in Non-Negative Variables

confidence intervaldescriptive statisticserrornormal distributionstandard deviation

I have a question regarding error bars. I understand that error bars (EBs) constructed with 1 standard deviation (SD) present different things about the population than EBs constructed with 95% confidence intervals (CI). Namely, EBs with SD show the spread (or dispersion) of the variable's actual values, while EBs with CI show the range that the actual mean should most likely fall within.

My data include a variable, the number (count) of times a person visits the doctor per year. The mean visit number is 3 and the SD is 5, while the confidence interval is 2.5 to 3.5. Is it inherently wrong to show the EBs based on SD since it would extend to negative values (i.e., 3-5 = -2)? Does it violate any assumption?

If I draw the bar graph showing mean 3 and EBs based on 1 SD, the EBs will range from 0 to 8, can I still claim that ~68% of values fall within 0 to 8, or because it is right skewed and the supposed lower EBs largely reaches the negative, this no longer holds? If so, how can I interpret the 0 to 8 which truncates the negative?

Best Answer

No, in this case, it does not make sense to draw error bars using SDs.

Take a step back. Why do we draw error bars with SDs? As you write, it's to show where "much" of the data lies. This makes sense if your data come from a normal distribution: 68% of your data will lie within $\pm 1$ SD from the mean, so showing the mean with an error bar of $\pm 1$ SD will give you an interval that contains 68% of your data.

However, the number of visits to a doctor is a count, so it is discrete. And it can't be negative. Thus, it can't be normal. For high counts, you can often treat counts as normal, but not for a mean of 3 and an SD of 5. Using SD-based error bars is the wrong way of answering the original question, i.e., showing where "much" of the data falls.

Better: calculate the top and bottom ends of your interval directly, by calculating (e.g.) the 16% and the 84% quantile of your observations. The range between them will again contain 68% of your data, as in the normal case the interval around the mean $\pm 1$ SD.

Alternatively, you can fit a distribution. For instance, a mean of 3 and an SD of 5 are consistent with a negative binomial distribution with a mean of 3 and a size parameter of $\frac{3^2}{5^2-3}$ (see R's help page ?qnbinom - there are many different parameterizations of the negbin). For such a distribution, we can again calculate the parametric 16%/84% quantiles, which turns out to give us an interval $[0,6]$:

> qnbinom(pnorm(c(-1,1)),mu=3,size=3^2/(5^2-3))
[1] 0 6

Related Solutions

confidence-interval – Confidence Interval Coverage for Discrete Functions: In-Depth Analysis

Neyman's confidence intervals make no attempt to provide coverage of the parameter in the case of any particular interval. Instead they provide coverage over all possible parameter values in the long run. In a sense they attempt to be globally accurate at the expense of local accuracy.

Confidence intervals for binomial proportions offer a clear illustration of this issue. Neymanian assessment of intervals yields the irregular coverage plots like this, which is for 95% Clopper-Pearson intervals for n=10 Binomial trials:

Clopper-Pearson coverage plot

There is an alternative way to do coverage, one that I personally think is much more intuitively approachable and (thus) useful. The coverage by intervals can be specified conditional on the observed result. That coverage would be local coverage. Here is a plot showing local coverage for three different methods of calculation of confindence intervals for binomial proportions: Clopper-Pearson, Wilson's scores, and a conditional exact method that yield intervals identical to Bayesian intervals with a uniform prior:

Conditional coverage for three types of interval

Notice that the 95% Clopper-Pearson method gives over 98% local coverage but the exact conditional intervals are, well, exact.

A way to think of the difference between the global and local intervals is to consider the global to be inversions of Neyman-Pearson hypothesis tests where the outcome is a decision that is made on the basis of consideration of long-term error rates for the current experiment as a member of the global set of all experiments that might be run. The local intervals are more akin to inversion of Fisherian significance tests which yield a P value which represents evidence against the null in from this particular experiment.

(As far as I know, the distinction between global and local statistics was first made in an unpublished Master’s thesis by Claire F Leslie (1998) Lack of confidence : a study of the suppression of certain counter-examples to the Neyman-Pearson theory of statistical inference with particular reference to the theory of confidence intervals. That thesis is held by the Baillieu library at The University of Melbourne.)

Solved – Error bars and coefficient computation on linear regressions in Matlab

If sounds like you have some prior notion of the precision of your measurements. You can use

fitlm(x,y,'linear','weight',w)

to supply a vector of weights. It is common to use the inverse variance as a weight. So if the error bars are +/- 1 standard deviation, you might supply the reciprocal of the squares of these inverse standard deviations.

If you do that, the coefCI method will produce coefficient confidence intervals that take the weighting into account.

Best Answer

Related Solutions

confidence-interval – Confidence Interval Coverage for Discrete Functions: In-Depth Analysis

Solved – Error bars and coefficient computation on linear regressions in Matlab

Related Question