If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.
The benefits of squaring include:
- Squaring always gives a non-negative value, so the sum will always be zero or higher.
- Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).
Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.
I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)
It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.
My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.
An interesting analysis can be read here:
I think there are two potential sources of confusion here: (1) What the variance pertains to. (2) What kind of intervals are computed.
The variance is the predicted variance of the response and not the variance of the predicted mean. Thus, it does not correspond to predict(..., se.fit = TRUE)
in an lm()
regression. Instead, it would correspond to the residual variance which is assumed to be constant in lm()
regressions but not in beta regressions.
As the variance pertains to the response (see 1) you should not use +/- 2 times the standard deviation to obtain a confidence interval. A normal approximation makes no sense here. Instead, it would be better to look at the, say, 2.5% and 97.5% quantiles of the predicted beta distribution. You can obtain these in betareg
with predict(..., type = "quantile", at = c(0.025, 0.975))
.
Finally, if you are indeed looking for the equivalent of predict.lm(..., se.fit = TRUE)
then betareg
currently has no infrastructure for this. An alternative would be to bootstrap the predictions yourself. Or you could look at what the package lsmeans
offers for betareg
objects.
Best Answer
In some sense this is a trivial question, but in another, it is actually quite deep!
As others have mentioned, taking the square root implies $\operatorname{Stdev}(X)$ has the same units as $X$.
Taking the square root gives you absolute homogeneity aka absolute scalability. For any scalar $\alpha$ and random variable $X$, we have: $$ \operatorname{Stdev}[\alpha X] = |\alpha| \operatorname{Stdev}[X]$$ Absolute homogeneity is a required property of a norm. The standard deviation can be interpreted as a norm (on the vector space of mean zero random variables) in a similar way that $\sqrt{x^2 + y^2+z^2}$ is the standard Euclidian norm in a three-dimensional space. The standard deviation is a measure of distance between a random variable and its mean.
Standard deviation and the $L_2$ norm
Finite dimension case:
In an $n$ dimensional vector space, the standard Euclidian norm aka the $L_2$ norm is defined as:
$$\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}$$
More broadly, the $p$-norm $\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p \right)^{\frac{1}{p}}$ takes the $p$th root to get absolute homogeneity: $\|\alpha \mathbf{x}\|_p = \left( \sum_i |\alpha x_i|^p \right)^\frac{1}{p} = | \alpha | \left( \sum_i |x_i|^p \right)^\frac{1}{p} = |\alpha | \|\mathbf{x}\|_p $.
If you have weights $q_i$ then the weighted sum $\sqrt{\sum_i x_i^2 q_i}$ is also a valid norm. Furthermore, it's the standard deviation if $q_i$ represent probabilities and $\operatorname{E}[\mathbf{x}] \equiv \sum_i x_i q_i = 0$
Infinite dimension case:
In an infinite dimensional Hilbert Space we similarly may define the $L_2$ norm:
$$ \|X\|_2 = \sqrt{\int_\omega X(\omega)^2 dP(\omega) }$$
If $X$ is a mean zero random variable and $P$ is the probability measure, what's the standard deviation? It's the same: $\sqrt{\int_\omega X(\omega)^2 dP(\omega) }$.
Summary:
Taking the square root makes means the standard deviation satisfies absolute homogeneity, a required property of a norm.
On a space of random variables, $\langle X, Y \rangle = \operatorname{E}[XY]$ is an inner product and $\|X\|_2 = \sqrt{\operatorname{E}[X^2]}$ the norm induced by that inner product. Thus the standard deviation is the norm of a demeaned random variable: $$\operatorname{Stdev}[X] = \|X - \operatorname{E}[X]\|_2$$ It's a measure of distance from mean $\operatorname{E}[X]$ to $X$.
(Technical point: while $\sqrt{\operatorname{E}[X^2]}$ is a norm, the standard deviation $\sqrt{\operatorname{E}[(X - \operatorname{E}[X])^2]}$ isn't a norm over random variables in general because a requirement for a normed vector space is $\|x\| = \mathbf{0}$ if and only if $x = \mathbf{0}$. A standard deviation of 0 doesn't imply the random variable is the zero element.)