[Math] What to call & how to compute errors in a very asymmetric sample

standard deviationstatistics

Consider the following sample $\{1.25,1.5,1.75,2.0,2.25,2.5,2.75,10.\}$. Mean is $\mu=3.$, standard deviation is $\sigma\approx 2.69$. I am wondering how to compute and what to call error bars in the context of this sample.

Using the standard error ($\mu \pm \sigma$) will obviously not be very useful as it would comprise the complete sample except for one outlier. Using 1st and 4th quartile $[1.75,2.5]$ describes the sample much better but seems odd to use for plotting error bars as well as the mean is not inside this interval.

One could use the mean deviation of all upper (above the mean) and all lower (below the mean) observations, which would give $[2.,10.]$, but there does not seem to be a proper name for this concept (discussed here). I tend to favor this option but have trouble referring to what is plotted in a convenient way (calling it "mean deviation", "standard deviation", "standard errors", or "mean errors" is just plain wrong; calling it "errors" or "upper/lower errors" seems unspecific and also seems to imply some kind of estimation which is not involved; calling it "upper/lower deviation" is unspecific and readers would likely assume it to refer to something like the standard deviation).

How are error bars usually computed for such a sample and what are they then usually referred to (other than "error bars")?

Best Answer

ERROR BARS. As far as I can discover, the term 'error bar' can refer to almost any line that indicates a degree of uncertainty. In different disciplines and applications they can indicate the standard deviation (SD) of data, the standard error (often SD divided by square root of sample size), or confidence interval (90%, 95%, or 99%). See the very brief Wikipedia 'error bars' article.

ASYMMETRICAL SAMPLES. However, I think your main question has to do with asymmetrical populations, samples from them, and error bars representing an estimate of a population parameter. A common asymmetrical family of distributions in statistics is the gamma family, which includes highly right-skewed exponential distributions.

As an example of asymmetrical confidence intervals, I will show that confidence intervals for estimates of the mean of an exponential distribution are asymmetrical.

The population mean of an exponential distribution is estimated by the sample mean of a sample. Suppose the data are $X_1, \dots, X_{10}$ from an exponential distribution with mean $\mu.$ Then the sample mean $\bar X/\mu$ has a gamma distribution with shape parameter $n = 10$ and scale parameter $1/n =0.1$ (rate parameter $n=10$). Let numbers $L$ and $U$ cut probability 2.5% from the lower and upper ends of this distribution. Then $$P(L < \bar X/\mu < U) = P(\bar X/U < \mu < \bar X/L) = 0.95,$$ so that $(\bar X/U, \bar X/L)$ is a 95% confidence interval for $\mu.$ Specifically for $n = 10$: $L = 0.480$ and $U = 1.708$ (from statistical software).

CI FOR EXPONENTIAL MEAN. As a numerical example, consider the following sorted data (perhaps lifetimes in weeks of 10 electronic components subjected to unfavorable operating conditions):

0.546, 0.742, 1.005, 3.160, 3.594, 4.057, 4.156, 5.483, 12.590, 21.383.

Here, $\bar X = 5.6716$ and the 95% CI for $\mu$ is $(3.32, 11.83).$

Notice that the data are strongly right skewed, with several small values close together at the left end and a 'tail' with large values scattered out towards the right end. Also, notice that 5.67 is closer to the left end of the CI than to the right end. (Typically, a CI will not include ALL of the data, its purpose is to give an interval of values in which we might logically expect the population mean to lie.)

Notes (1) Error bars often occur in clusters. If we conducted half a dozen experiments of this type on components of different specifications. We might have made a data summary of six confidence intervals in a column. Even though each CI can be considered to have an 'error probability' of only 5%, the COLLECTION of six CIs taken together may be considerably higher. So we would have to be cautious drawing conclusions about PATTERNS of behavior among the six kinds of components. (There are ways to make CIs so that the error rate of the FAMILY of CIs is only 5%, but that is another topic.)

(2) Suppose we had not noticed the skewness or known that the data above are from an exponential distribution. Then we might INCORRECTLY have assumed a symmetrical normal distribution and used a symmetrical CI based on the t-distribution. That incorrect symmetrical interval would be of the form $\bar X \pm 2.262 S/\sqrt{10},$ which would compute to $(1.01, 10.34).$ [For the data shown above $S = 6.52,$ a statistic not needed for the correct CI.]

Related Solutions

[Math] Calculating mean and standard deviation of very large sample sizes

Posting as an answer in response to comments.

Here's a way to compute the mean and standard deviation in one pass over the file. (Pseudocode.)

n = r1 = r2 = 0;
while (more_samples()) {
    s = next_sample();
    n += 1;
    r1 += s;
    r2 += s*s;
}
mean = r1 / n;
stddev = sqrt(r2/n - (mean * mean));

Essentially, you keep a running total of the sum of the samples and the sum of their squares. This lets you easily compute the standard deviation at the end.

[Math] Is it possible to calculate the mean and standard deviation from a median and quartiles

It's mathematically impossible to deduce mean or standard deviation from median/quartiles, because medians and quartiles discard most of the data on which the mean and standard deviation are based.

Example:

data   frequency  
   0       50      
 1.4        4     
   2       50

That has a mean of 1.0 and standard deviation of 0.9. (I'm using 2 significant figures so I don't have to go into population versus sample standard deviation.)

data     frequency    
   0       30        
 1.4       44        
   2       30

That data also has the median and quartiles the same as in your example, but now the mean is 1.2 and the standard deviation is 0.8.

data     frequency        
   0       30        
 1.4        3        
   2       70        
10000000    1

Now I've changed my maximum without changing the median or quartiles, you can see even more clearly how the median and quartiles exclude extreme data, because the mean is now 96000 and the standard deviation is 98000 (still 2 sig.fig.).

Best Answer

Related Solutions

[Math] Calculating mean and standard deviation of very large sample sizes

[Math] Is it possible to calculate the mean and standard deviation from a median and quartiles

Related Question