I've been trying to find information on the sampling distribution of the standard deviation for uniform distributions and have been having a heck of a time figuring out the expected value for the standard deviation of a sample. Hopefully someone can help out!
Basically, I'm trying to find the expected standard deviation given sample size n. Given a uniform, continuous distribution with population range (b-a), I know that the population standard deviation would be (b-a)/sqrt(12); however, what would be the expected standard deviation for a sample size n drawn from that continuous uniform distribution?
My hope is that given a Uniform Distribution with range W (or "PopRange" in the example below), I would like to calculate an expected standard deviation for a given sample size. Since the mean sample standard deviation changes as a function of the sample size, I'm hoping to quantify it. Here's some MATLAB code illustrating what I am hoping to achieve.
% Number of samples for the brute-force estimation
totalSamples = 10000;
% The population range (b-a)
PopRange = 30;
for n = 2:100
% Draw samples of sample size n
theSamples = rand(n,totalSamples)*PopRange;
% Calculate the mean of the standard deviations and ranges. This
% should be equivalent to the means of the sampling distributions
% of the standard deviation and range
uniformSampleSD(n) = mean(std(theSamples,[],1));
uniformSampleRange(n) = mean(max(theSamples,[],1)-min(theSamples,[],1));
% Calculate the expected standard deviation, note that this is just the
% population standard deviation at the moment and does not incorporate
% the sample size
uniformSampleESD(n) = PopRange/sqrt(12);
% SEE: https://en.wikipedia.org/wiki/Prediction_interval#Non-parametric_methods
% Calculate the expected sample range
uniformSampleERange(n) = (n-1)/(n+1)*PopRange;
end
% Note the red curve underestimating the population standard deviation for
% smaller sample sizes? This is fine, but I would like to find a numerical
% estimate for it
figure
plot(uniformSampleSD,'r')
hold on
plot(uniformSampleESD,'b')
Any help or pointers would be greatly appreciated!
Edit:
Perhaps I mis-worded what I am looking for? I am looking for the mean of the sampling distribution of the standard deviation for a continuous distribution.
If I draw 100000 sets from a uniform distribution [0,1] of sample size n, the distribution of standard deviations will not have a mean of sqrt((1-0)^2 / 12) = 0.289. Empirically, the mean of the sampling distribution of the sd will be:
n = 2, sd_mean = 0.235
n = 3, sd_mean = 0.263
n = 4, sd_mean = 0.273
n = 5, sd_mean = 0.278
n = 6, sd_mean = 0.280
n = 7, sd_mean = 0.282
n = 8, sd_mean = 0.283
n = 9, sd_mean = 0.284
n = 10, sd_mean = 0.285
Note that it asymptotically approaches sqrt((b – a)^2 / 12), but that this is inaccurate for small n. See these examples of the sampling distributions for those small n:
http://imgur.com/a/Asgkz (note: the "population" and "observed" sd line labels were accidentally swapped)
These were created using this code:
totalSamples = 100000;
for n = 2:10
theSamples = rand(n,totalSamples);
setSDs = std(theSamples,[],1);
popSD = sqrt((1-0)^2/12);
nelements = hist(setSDs,50);
figure
hist(setSDs,50)
hold on
plot([popSD popSD],[0,max(nelements)],'r')
plot([mean(setSDs) mean(setSDs)],[0,max(nelements)],'g')
xlabel('Count')
ylabel('Sample Standard Deviation')
title({'Sampling Distribution of the Standard Deviation with 100000 sets';sprintf('drawn from a uniform distribution [0,1] with sample size %u',n)})
legend('Observations',sprintf('Pop SD = %f',popSD),sprintf('Observed Mean Sample SD = %f',mean(setSDs)))
end
Best Answer
The integration is difficult even with as few as $3$ values. Why not estimate the bias in the sample SD by using a surrogate measure of spread? One set of choices is afforded by differences in the order statistics.
Consider, for instance, Tukey's H-spread. For a data set of $n$ values, let $m = \lfloor\frac{n+1}{2}\rfloor$ and set $h = \frac{m+1}{2}$. Let $n$ be such that $h$ is integral; values $n = 4i+1$ will work. In these cases the H-spread is the difference between the $h^\text{th}$ highest value $y$ and $h^\text{th}$ lowest value $x$. (For large $n$ it will be very close to the interquartile range.) The beauty of using the H-spread is that, being based on order statistics, its distribution can be obtained analytically, because the joint PDF of the $j,k$ order statistics $(x,y)$ is proportional to
$$x^{j-1}(1-y)^{n-k}(y-x)^{k-j-1},\ 0\le x\le y\le 1.$$
From this we can obtain the expectation of $y-x$ as
$$s(n; j,k) = \mathbb{E}(y-x) = \frac{k-j}{n+1}.$$
Set $j=h$ and $k=n+1-h$ for the H-spread itself. When $n=4i+1$, $j=i+1$ and $k=3i+1$, whence $s(4i+1; i+1, 3i+1)=\frac{2i}{4i+1}.$
At this point, consider regressing simulated (or even calculated values) of the expected SD ($sd(n)$) against the H-spreads $s(4i+1,i+1,3i+1) = s(n).$ We might expect to find an asymptotic series for $sd(n)/s(n)$ in negative powers of $n$:
$$sd(n)/s(n) = \alpha_0 + \alpha_1 n^{-1} + \alpha_2 n^{-2} + \cdots.$$
By spending two minutes to simulate values of $sd(n)$ and regressing them against computed values of $s(n)$ in the range $5\le n\le 401$ (at which point the bias becomes very small), I find that $\alpha_0 \approx 0.5774$ (which estimates $2\sqrt{1/12}\approx 0.57735$), $\alpha_1\approx 1.091,$ and $\alpha_2 \approx 1.$ The fit is excellent. For instance, basing the regression on the cases $n\ge 9$ and extrapolating down to $n=5$ is a pretty severe test and this fit passes with flying colors. I expect it to give four significant figures of accuracy for all $n\ge 5$.