Solved – Standard deviation of (assumed) normal distribution

meannormal distributionstandard deviation

Say I measure a system for $N$ periods of time $T$. In each of these time periods $T_i$, with $1 \leq i \leq N$, I get $M_i$ occurrences.

I have reason to assume that the variable $M$ follows a normal distribution. I believe that the calculation of the mean of this distribution is straightforward: just take the average. I am not so sure about the standard deviation, though:

At first I thought I should just do it the "regular way", by taking the square root of the sum of the squared deviations from the mean, but now I'm not sure if this is completely correct. The mean I get will obviously have some uncertainty, as I don't have the time to make an infinite number of measurements. Shouldn't this somehow be taken into account when calculating the standard deviation? After all, the standard deviation is a function of the mean, but it's not possible to know the mean of the distribution without uncertainty. So I can only know the mean of my sample, which is not the same as the mean of the distribution. What I want to know is how this difference propagates to the standard deviation.

Best Answer

(Let's set aside the question of whether your data are normally distributed. It isn't relevant to the issue at hand.)

It is true that when you estimate a mean from sample data, you get an imprecise estimate in the sense that it is subject to sampling error. Moreover, anything you estimate subsequently that takes the mean into account (e.g., the standard deviation, the skewness, a regression model, etc.) is made more imprecise than it otherwise might have been by virtue of relying on an imperfect estimate of the mean. There isn't really anything that you have to do about this, though. This is all well understood and is automatically addressed in the various formulas that have been derived for statistical quantities.

In contrast, the arithmetic average of a sample is an unbiased estimator of the population mean, but the 'population' variance formula (the Maximum Likelihood Estimator of the variance) is a biased estimator of the population variance. That is the case because you have 'consumed' a degree of freedom from your data by previously estimating the mean. With small samples (technically any sample size $<\infty$), the variance estimated by MLE will be too small on average. Bessel's correction (dividing the sum of squared deviations from the mean by $N-1$ instead of $N$) was developed to correct for this by inflating the estimated variance slightly. (For what it's worth, the standard deviation, when estimated by taking the square root of the sample variance, is again a biased estimate, see here.)

The uncertainty in the estimated value of the population mean does propagate as I noted above. But it doesn't show up in the point estimate of (say) the standard deviation. Instead, it has the effect of widening the confidence interval for the estimate1. If you knew the mean a-priori, then the confidence interval would be calculated differently (and should be narrower)2.

1. It is when you calculate the confidence interval for the standard deviation that it really matters if the underlying distribution
    is normal, cf. here and here.
2. N.B., I don't know what the formula would be; I've never seen it. Presumably someone has worked it out, but only as a
    curiosity: we just don't have many situations were the mean is known a-priori, the distribution is exactly normal, and we
    need to estimate the standard deviation with its confidence interval.