If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.
The benefits of squaring include:
- Squaring always gives a non-negative value, so the sum will always be zero or higher.
- Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).
Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.
I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)
It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.
My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.
An interesting analysis can be read here:
I think these concepts are easy to explain. So I would rather just describe it here. I am sure many elementary statistics books cover this including my book "The Essentials of Biostatistics for Physicians, Nurses and Clinicians."
Think of a target with a bulls-eye in the middle. The mean square error represent the average squared distance from an arrow shot on the target and the center. Now if your arrows scatter evenly arround the center then the shooter has no aiming bias and the mean square error is the same as the variance.
But in general the arrows can scatter around a point away from the target. The average squared distance of the arrows from the center of the arrows is the variance. This center could be looked at as the shooters aim point. The distance from this shooters center or aimpoint to the center of the target is the absolute value of the bias.
Thinking of a right triangle where the square of the hypotenuse is the sum of the sqaures of the two sides. So a squared distance from the arrow to the target is the square of the distance from the arrow to the aim point and the square of the distance between the center of the target and the aimpoint. Averaging all these square distances gives the mean square error as the sum of the bias squared and the variance.
Best Answer
In theory, this should be determined by how important different sized errors are to you, or in other words, your loss function.
In the real world, people put ease of use first. So RMS deviations (or the related variances) are easier to combine, and easier to calculate in a single pass, while average absolute deviations are more robust to outliers and exist for more distributions. Basic linear regression and many of its offshoots are based on minimsing RMS errors.
Another point is that the mean will minimise RMS deviations while the median will minimise absolute deviations, and you may prefer one of these.