Standard Deviation – Relationship Between the Range and the Standard Deviation

descriptive statisticsrangestandard deviation

In an article I found the formula for the standard deviation of a sample size $N$

$\sigma=\frac{\overline{R}}{2.534}$

where $\overline{R}$ is the average range of subsamples (size $6$) from the main sample. How the number $2.534$ is calculated? This is the correct number?

Best Answer

In an a sample $x$ of $n$ independent values from a distribution $F$ with pdf $f$, the pdf of the joint distribution of the extremes $\min(x)=x_{[1]}$ and $\max(x)=x_{[n]}$ is proportional to

$$f(x_{[1]})\left(F(x_{[n]})-F(x_{[1]})\right)^{n-2}f(x_{[n]})dx_{[1]}dx_{[n]} = H_F(x_{[1]}, x_{[n]})dx_{[1]}dx_{[n]}.$$

(The constant of proportionality is the reciprocal of the multinomial coefficient $\binom{n}{1,n-2,1} = n(n-1)$. Intuitively, this joint PDF expresses the chance of finding the smallest value in the range $[x_{[1]},x_{[1]}+dx_{[1]})$, the largest value in the range $[x_{[n]},x_{[n]}+dx_{[n]})$, and the middle $n-2$ values between them within the range $[x_{[1]}+dx_{[1]}, x_{[n]})$. When $F$ is continuous, we may replace that middle range by $(x_{[1]}, x_{[n]}]$, thereby neglecting only an "infinitesimal" amount of probability. The associated probabilities, to first order in the differentials, are $f(x_{[1]})dx_{[1]},$ $f(x_{[n]})dx_{[n]},$ and $F(x_{[n]})-F(x_{[1]}),$respectively, now making it obvious where the formula comes from.)

Taking the expectation of the range $x_{[n]} - x_{[1]}$ gives $2.53441\ \sigma$ for any Normal distribution with standard deviation $\sigma$ and $n=6$. The expected range as a multiple of $\sigma$ depends on the sample size $n$:

Normal

These values were computed by numerically integrating $\binom{n}{1,n-2,1}\left(y-x\right)H_F(x,y)dxdy$ over $\{(x,y)\in\mathbb{R}^2|x\le y\}$, with $F$ set to the standard Normal CDF, and dividing by the standard deviation of $F$ (which is just $1$).

A similar multiplicative relationship between the expected range and the standard deviation will hold for any location-scale family of distributions, because it is a property of the shape of the distribution alone. For instance, here is a comparable plot for uniform distributions:

Uniform

and exponential distributions:

Exponential

The values in the preceding two plots were obtained by exact--not numerical--integration, which is possible due to the relatively simple algebraic forms of $f$ and $F$ in each case. For the uniform distributions they equal $\frac{n-1}{(n+1)}\sqrt{12}$ and for the exponential distributions they are $\gamma + \psi(n) = \gamma + \frac{\Gamma'(n)}{\Gamma(n)}$ where $\gamma$ is Euler's constant and $\psi$ is the "polygamma" function, the logarithmic derivative of Euler's Gamma function.

Although they differ (because these distributions display a wide range of shapes), the three roughly agree around $n=6$, showing that the multiplier $2.5$ does not depend heavily on the shape and therefore can serve as an omnibus, robust assessment of the standard deviation when ranges of small subsamples are known. (Indeed, the very heavy-tailed Student $t$ distribution with three degrees of freedom still has a multiplier around $2.3$ for $n=6$, not far at all from $2.5$.)