Solved – Intuition behind standard deviation

intuitionstandard deviation

I'm trying to gain a better intuitive understanding of standard deviation.

From what I understand it is representative of the average of the differences of a set of observations in a data set from the mean of that data set. However it is NOT actually equal to the averages of the differences as it gives more weight to observations further from the mean.

Say I have the following population of values – $\{1, 3, 5, 7, 9\}$

The mean is $5$.

If I take a measure of spread based on absolute value I get

$$\frac{\sum_{i = 1}^5|x_i – \mu|}{5} = 2.4$$

If I take a measure of spread based using standard deviation I get

$$\sqrt{\frac{\sum_{i = 1}^5(x_i – \mu)^2}{5}} = 2.83$$

The result using standard deviation is larger, as expected, because of the extra weight it gives to values further from the mean.

But if I was just told that I was dealing with a population with a mean of $5$ and a standard deviation of $2.83$ how would I infer that the population was comprised of values something like the $\{1, 3, 5, 7, 9\}$? It just seems that the figure of $2.83$ is very arbitrary…I don't see how you are supposed to interpret it. Does $2.83$ mean the values are spread very wide or are they all tightly clustered around the mean…

When you are presented with a statement that you are dealing with a population with a mean of $5$ and a standard deviation of $2.83$ what does that tell you about the population?

Best Answer

My intuition is that the standard deviation is: a measure of spread of the data.

You have a good point that whether it is wide, or tight depends on what our underlying assumption is for the distribution of the data.

Caveat: A measure of spread is most helpful when the distribution of your data is symmetric around the mean and has a variance relatively close to that of the Normal distribution. (This means that it is approximately Normal.)

In the case where data is approximately Normal, the standard deviation has a canonical interpretation:

  • Region: Sample mean +/- 1 standard deviation, contains roughly 68% of the data
  • Region: Sample mean +/- 2 standard deviation, contains roughly 95% of the data
  • Region: Sample mean +/- 3 standard deviation, contains roughly 99% of the data

(see first graphic in Wiki)

This means that if we know the population mean is 5 and the standard deviation is 2.83 and we assume the distribution is approximately Normal, I would tell you that I am reasonably certain that if we make (a great) many observations, only 5% will be smaller than 0.4 = 5 - 2*2.3 or bigger than 9.6 = 5 + 2*2.3.

Notice what is the impact of standard deviation on our confidence interval? (the more spread, the more uncertainty)

Furthermore, in the general case where the data is not even approximately normal, but still symmetrical, you know that there exist some $\alpha$ for which:

  • Region: Sample mean +/- $\alpha$ standard deviation, contains roughly 95% of the data

You can either learn the $\alpha$ from a sub-sample, or assume $\alpha=2$ and this gives you often a good rule of thumb for calculating in your head what future observations to expect, or which of the new observations can be considered as outliers. (keep the caveat in mind though!)

I don't see how you are supposed to interpret it. Does 2.83 mean the values are spread very wide or are they all tightly clustered around the mean...

I guess every question asking "wide or tight", should also contain: "in relation to what?". One suggestion might be to use a well-known distribution as reference. Depending on the context it might be useful to think about: "Is it much wider, or tighter than a Normal/Poisson?".

EDIT: Based on a useful hint in the comments, one more aspect about standard deviation as a distance measure.

Yet another intuition of the usefulness of the standard deviation $s_N$ is that it is a distance measure between the sample data $x_1,… , x_N$ and its mean $\bar{x}$:

$s_N = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \overline{x})^2}$

As a comparison, the mean squared error (MSE), one of the most popular error measures in statistics, is defined as:

$\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(\hat{Y_i} - Y_i)^2$

The questions can be raised why the above distance function? Why squared distances, and not absolute distances for example? And why are we taking the square root?

Having quadratic distance, or error, functions has the advantage that we can both differentiate and easily minimise them. As far as the square root is concerned, it adds towards interpretability as it converts the error back to the scale of our observed data.

Related Question