Solved – Why the first moment is standardized before computing higher moments, but higher moments are not

moments

Wikipedia says:

For the second and higher moments, the central moments (moments about the mean, with c being the mean) are usually used rather than the moments about zero, because they provide clearer information about the distribution's shape.

Could someone explain/convince me why this is true? Why is there a discrepancy?
This has always bugged me and I have never seen a good explanation for it — I just don't quite understand why/how standardization provides "clear" information in one case but not in another.

For example:

  1. To compute the skewness, why not standardize both the mean and the variance?
  2. To compute the kurtosis, why not standardize the mean, the variance, and the skewness?
  3. To compute the nth moment, why not first standardize all the mth moments for m < n?
    If standardization is useful then why only do this for m = 1?

Best Answer

Since the question was updated, I update my answer:

The first part (To compute the skewness, why not standardize both the mean and the variance?) is easy: That is precisely how it's done! See the definitions of skewness and kurtosis in wiki.

The second part is both easy and hard. On one hand we could say that it is impossible to normalize random variable to satisfy three moment conditions, as linear transformation $X \to aX + b$ allows only for two. But on the other hand, why should we limit ourselves to linear transformations? Sure, shift and scale are by far the most prominent (maybe because they are sufficient most of the time, say for limit theorems), but what about higher order polynomials or taking logs, or convolving with itself? In fact, isn't it what Box-Cox transform is all about -- removing skew?

But in the case of more complicated transformations, I think, the context and the transformation itself becomes important, so maybe that is why there are no more "moments with names". That does not mean that r.v.s are not transformed and that the moments are not calculated, on the contrary. You just chose your transformation, calculate what you need and move on.


The old answer about why centralized moments represent shape better than raw:

The keyword is shape. As whuber suggested, by shape we want consider the properties of the distribution that are invariant to translation and scaling. That is, when you consider variable $X + c$ instead of $X$, you get the same distribution function (just shifted to the right or left), so we would like to say that its shape stayed the same.

The raw moments do change when you translate the variable, so they reflect not only the shape, but also a location. In fact, you can take any random variable, and shift it $X \to X + c$ appropriately to get any value for its, say, raw third moment.

The same observation holds for all odd moments and to lesser extent for even moments (they are bounded from below and lower bound does depend on shape).

The centralized moment, on the other hand, does not change when you translate the variable, so that's why they are more descriptive of the shape. For example, if your even centralized moment is large, you known that random variable has some mass not too close to mean. Or if your odd moment is zero, you known that your random variable has some symmetry around mean.

The same argument extends to scale, which is transformation $X\to cX$. The usual normalization in this case is division by standard deviation, and the corresponding moments are called normalized moments, at least by wikipedia.