Solved – Expected standard deviation for a sample from a uniform distribution

distributionssamplesamplinguniform distribution

I've been trying to find information on the sampling distribution of the standard deviation for uniform distributions and have been having a heck of a time figuring out the expected value for the standard deviation of a sample. Hopefully someone can help out!

Basically, I'm trying to find the expected standard deviation given sample size n. Given a uniform, continuous distribution with population range (b-a), I know that the population standard deviation would be (b-a)/sqrt(12); however, what would be the expected standard deviation for a sample size n drawn from that continuous uniform distribution?

My hope is that given a Uniform Distribution with range W (or "PopRange" in the example below), I would like to calculate an expected standard deviation for a given sample size. Since the mean sample standard deviation changes as a function of the sample size, I'm hoping to quantify it. Here's some MATLAB code illustrating what I am hoping to achieve.

% Number of samples for the brute-force estimation
totalSamples = 10000;

% The population range (b-a)
PopRange = 30;

for n = 2:100
    % Draw samples of sample size n
    theSamples = rand(n,totalSamples)*PopRange;

    % Calculate the mean of the standard deviations and ranges.  This
    % should be  equivalent to the means of the sampling distributions
    % of the standard deviation and range
    uniformSampleSD(n) = mean(std(theSamples,[],1));
    uniformSampleRange(n) = mean(max(theSamples,[],1)-min(theSamples,[],1));

    % Calculate the expected standard deviation, note that this is just the
    % population standard deviation at the moment and does not incorporate
    % the sample size
    uniformSampleESD(n) = PopRange/sqrt(12);
    % SEE: https://en.wikipedia.org/wiki/Prediction_interval#Non-parametric_methods
    % Calculate the expected sample range
    uniformSampleERange(n) = (n-1)/(n+1)*PopRange;
end

% Note the red curve underestimating the population standard deviation for
% smaller sample sizes? This is fine, but I would like to find a numerical
% estimate for it
figure
plot(uniformSampleSD,'r')
hold on
plot(uniformSampleESD,'b')

Any help or pointers would be greatly appreciated!


Edit:

Perhaps I mis-worded what I am looking for? I am looking for the mean of the sampling distribution of the standard deviation for a continuous distribution.

If I draw 100000 sets from a uniform distribution [0,1] of sample size n, the distribution of standard deviations will not have a mean of sqrt((1-0)^2 / 12) = 0.289. Empirically, the mean of the sampling distribution of the sd will be:

n = 2, sd_mean = 0.235
n = 3, sd_mean = 0.263
n = 4, sd_mean = 0.273
n = 5, sd_mean = 0.278
n = 6, sd_mean = 0.280
n = 7, sd_mean = 0.282
n = 8, sd_mean = 0.283
n = 9, sd_mean = 0.284
n = 10, sd_mean = 0.285

Note that it asymptotically approaches sqrt((b – a)^2 / 12), but that this is inaccurate for small n. See these examples of the sampling distributions for those small n:

http://imgur.com/a/Asgkz (note: the "population" and "observed" sd line labels were accidentally swapped)

These were created using this code:
totalSamples = 100000;

for n = 2:10
    theSamples = rand(n,totalSamples);
    setSDs = std(theSamples,[],1);
    popSD = sqrt((1-0)^2/12);
    nelements = hist(setSDs,50);

    figure
    hist(setSDs,50)
    hold on
    plot([popSD popSD],[0,max(nelements)],'r')
    plot([mean(setSDs) mean(setSDs)],[0,max(nelements)],'g')
    xlabel('Count')
    ylabel('Sample Standard Deviation')
    title({'Sampling Distribution of the Standard Deviation with 100000 sets';sprintf('drawn from a uniform distribution [0,1] with sample size %u',n)})
    legend('Observations',sprintf('Pop SD = %f',popSD),sprintf('Observed Mean Sample SD = %f',mean(setSDs)))
end

Best Answer

The integration is difficult even with as few as $3$ values. Why not estimate the bias in the sample SD by using a surrogate measure of spread? One set of choices is afforded by differences in the order statistics.

Consider, for instance, Tukey's H-spread. For a data set of $n$ values, let $m = \lfloor\frac{n+1}{2}\rfloor$ and set $h = \frac{m+1}{2}$. Let $n$ be such that $h$ is integral; values $n = 4i+1$ will work. In these cases the H-spread is the difference between the $h^\text{th}$ highest value $y$ and $h^\text{th}$ lowest value $x$. (For large $n$ it will be very close to the interquartile range.) The beauty of using the H-spread is that, being based on order statistics, its distribution can be obtained analytically, because the joint PDF of the $j,k$ order statistics $(x,y)$ is proportional to

$$x^{j-1}(1-y)^{n-k}(y-x)^{k-j-1},\ 0\le x\le y\le 1.$$

From this we can obtain the expectation of $y-x$ as

$$s(n; j,k) = \mathbb{E}(y-x) = \frac{k-j}{n+1}.$$

Set $j=h$ and $k=n+1-h$ for the H-spread itself. When $n=4i+1$, $j=i+1$ and $k=3i+1$, whence $s(4i+1; i+1, 3i+1)=\frac{2i}{4i+1}.$

At this point, consider regressing simulated (or even calculated values) of the expected SD ($sd(n)$) against the H-spreads $s(4i+1,i+1,3i+1) = s(n).$ We might expect to find an asymptotic series for $sd(n)/s(n)$ in negative powers of $n$:

$$sd(n)/s(n) = \alpha_0 + \alpha_1 n^{-1} + \alpha_2 n^{-2} + \cdots.$$

By spending two minutes to simulate values of $sd(n)$ and regressing them against computed values of $s(n)$ in the range $5\le n\le 401$ (at which point the bias becomes very small), I find that $\alpha_0 \approx 0.5774$ (which estimates $2\sqrt{1/12}\approx 0.57735$), $\alpha_1\approx 1.091,$ and $\alpha_2 \approx 1.$ The fit is excellent. For instance, basing the regression on the cases $n\ge 9$ and extrapolating down to $n=5$ is a pretty severe test and this fit passes with flying colors. I expect it to give four significant figures of accuracy for all $n\ge 5$.

Figure

#
# Expected spread of the j and kth order statistics (k > j) in n
# iid uniform values.
#
sd.r <- function(n,j,k) (k-j)/(n+1)
#
# Expected sd of n iid uniform values.
#
sim <- function(n, effort=10^6) {
  x <- matrix(runif(n * ceiling(effort/n)), ncol=n)
  y <- apply(x, 1, sd)
  mean(y)
}
#
# Study the relationship between sd.r and sim.
#
i <- c(1:7, 9, 15, 30, 300)
system.time({
  d <- replicate(9, t(sapply(i, function(i) c(4*i+1, sim(4*i+1), i))))
})
#
# Plot the results.
#
data <- as.data.frame(matrix(aperm(d, c(2,1,3)), ncol=3, byrow=TRUE))
colnames(data) <- c("n", "y", "i")
data$x <- with(data, sd.r(4*i+1,i+1,3*i+1))

plot(subset(data, select=c(x,y)), col="Gray", cex=1.2,
     xlab="Expected H-spread", ylab="Expected SD (via simulation)")

fit <- lm(y ~ x + I(x/n) + I(x/n^2) - 1, data=subset(data, n > 5))
j <- seq(1, 1000, by=1/4)
x <- sd.r(4*j+1, j+1, 3*j+1)
y <- cbind(x,  x/(4*j+1), x/(4*j+1)^2) %*% coef(fit)
lines(x[-(1:4)], y[-(1:4)], col="#606060", lwd=2, lty=2)
lines(x[(1:5)], y[(1:5)], col="#b0b0b0", lwd=2, lty=3)

points(subset(data, select=c(x,y)), col=rainbow(length(i)), pch=19)
#
# Report the fit.
#
summary(fit)
par(mfrow=c(2,2))
plot(fit)
par(mfrow=c(1,1))
#
# The fit based on all the data.
#
summary(fit <- lm(y ~ x + I(x/n) + I(x/n^2) - 1, data=data))
# 
# An alternative fit (fixing alpha_0).
#
summary(fit <- lm((y - sqrt(1/12))/x ~ I(1/n) + I(1/n^2) + I(1/n^3) - 1, data=data))
Related Question