If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.
The benefits of squaring include:
- Squaring always gives a non-negative value, so the sum will always be zero or higher.
- Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).
Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.
I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)
It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.
My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.
An interesting analysis can be read here:
It sounds like you're talking about what's sometimes called a regressogram, with a log-scaled x-variable.
There are a number of issues here, not necessarily in logical order:
the quantity you're plotting is a mean, so if you want to plot median absolute deviation, it's the MAD of the means you want.
your suggestion $\text{MAD}/\sqrt n$ leads to the question "when is the MAD of the mean equal to the MAD of the data divided by $\sqrt n$?"
when you say "it seems that median absolute deviation is a better estimator than mean absolute deviation" ... that depends what we're talking about - a better estimator of what?, and under what circumstances?
So, "when is the MAD of the mean equal to the MAD of the data divided by $\sqrt n$?"
The answer is, unlike the situation with standard deviation, this is not generally the case. The reason why standard deviations of averages scale as they do is that variances of independent random variables add (more precisely, the variance of the sum is the sum of the variances when the variables are independent), irrespective of the distributions of the components (as long as the variances all exist). It is this particular property that largely accounts for the popularity of variances and standard deviations.
Neither the median deviation, nor the mean deviation have that property in general.
However, when the data are normal, they will in effect inherit that property, since the ratio of the population mean deviation or median deviation to the standard deviation at a normal will be a constant, normals are closed under convolution, and standard deviations scale that way.
If the data were reasonably close to normal, it could perhaps be adequate.
What else might be done? One way to estimate the standard error of a statistic is via the bootstrap; for the mean deviation - being a mean - this should do well in large samples. Unfortunately, medians don't do so well under the bootstrap, and this issue will carry over to median absolute deviations.
If you have some probability model for your data, there's also simulation as a way of approaching the problem.
Best Answer
Robustness to outliers is a double-edged sword: Sometimes we want to estimate things in a way that is robust to outliers, which means that we do not mind getting large outliers. At other times we want to avoid large outliers, so we want to estimate things in a way that is not robust to outliers. Similarly, with measures of spread, sometimes we want something that is robust to outliers, so that large outliers do not increase the measure. At other times we want our measure of spread to reflect the presence of large outliers by manifesting in a larger value.
In decision-theory, issues like this are dealt with by specifying a penalty/loss function which penalises you for your error in estimation of a quantity. Two common loss functions are absolute-error loss and squared-error loss (shown in the following plots, taken from this answer by Jean-Paul).
Absolute-error loss penalises you according to the absolute deviation of your estimate from the true value. This form of loss function leads to estimation using medians. This form of loss function is robust to outliers in the sense that outliers contribute a penalty that is proportionate to their size. Measures of spread in this context reflect the expected loss of a particular estimate of central location, with the expected loss being a weighted sum of absolute deviations from the estimated central location.
Squared-error loss penalises you according to the squared deviation of your estimate from the true value. This form of loss function leads to estimation using means. This form of loss function is sensitive to outliers in the sense that outliers contribute a penalty that is proportionate to their squared deviation - this magnifies the effect of large outliers. Measures of spread in this context reflect the expected loss of a particular estimate of central location, with the expected loss being a weighted sum of squared deviations from the estimated central location.
In regard to the choice between median absolute deviation and standard deviation these same considerations apply. The former measure is a measure of spread that represents expected absolute-error loss, and is more robust to outliers. In this case, outliers do not manifest in large increases in the measure of spread. The latter is a measure of spread that represents expected squared-error loss, and is more sensitive to outliers. In this case, the outliers will manifest in large increases in the measure of spread.