Solved – Intuition for moments about the mean of a distribution

intuitionkurtosismathematical-statisticsmomentsskewness

Can someone provide an intuition on why the higher moments of a probability distribution $p_X$, like the third and fourth moments, correspond to skewness and kurtosis respectively? Specifically, why does the deviation about the mean raised to the third or fourth power end up translating into a measure of skewness and kurtosis? Is there a way to relate this to the third or fourth derivatives of the function?

Consider this definition of skewness and kurtosis:

$$\begin{matrix}
\text{Skewness}(X) = \mathbb{E}[(X – \mu_{X})^3] / \sigma^3, \\[6pt]
\text{Kurtosis}(X) = \mathbb{E}[(X – \mu_{X})^4] / \sigma^4. \\[6pt]
\end{matrix}$$

In these equations we raise the normalised value $(X-\mu)/\sigma$ to a power and take its expected value. It is not clear to me why raising the normalised random variable to the power of four gives "peakedness" or why raising the normalised random variable to the power of three should give "skewness". This seems magical and mysterious!

Best Answer

There is a good reason for these definitions, which becomes clearer when you look at the general form for moments of standardised random variables. To answer this question, first consider the general form of the $k$th standardised central moment:$^\dagger$

$$\phi_k = \mathbb{E} \Bigg[ \Bigg( \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg)^k \text{ } \Bigg].$$

The first two standardised central moments are the values $\phi_1=0$ and $\phi_2=1$, which hold for all distributions for which the above quantity is well-defined. Hence, we can consider the non-trivial standardised central moments that occur for values $k \geqslant 3$. To facilitate our analysis we define:

$$\begin{equation} \begin{aligned} \phi_k^+ &= \mathbb{E} \Bigg[ \Bigg| \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg|^k \text{ } \Bigg| X > \mathbb{E}[X] \Bigg] \cdot \mathbb{P}(X > \mathbb{E}[X]), \\[8pt] \phi_k^- &= \mathbb{E} \Bigg[ \Bigg| \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg|^k \text{ } \Bigg| X < \mathbb{E}[X] \Bigg] \cdot \mathbb{P}(X < \mathbb{E}[X]). \end{aligned} \end{equation}$$

These are non-negative quantities that give the $k$th absolute power of the standardised random variable conditional on it being above or below its expected value. We will now decompose the standardised central moment into these parts.


Odd values of $k$ measure the skew in the tails: For any odd value of $k \geqslant 3$ we have an odd power in the moment equation and so we can write the standardised central moment as $\phi_k = \phi_k^+ - \phi_k^-$. From this form we see that the standardised central moment gives us the difference between the $k$th absolute power of the standardised random variable, conditional on it being above or below its mean respectively.

Thus, for any odd power $k \geqslant 3$ we will get a measure that gives positive values if the expected absolute power of the standardised random variable is higher for values above the mean than for values below the mean, and gives negative values if the expected absolute power is lower for values above the mean than for values below the mean. Any of these quantities could reasonably be regarded as a measure of a type of "skewness", with higher powers giving greater relative weight to values that are far from the mean.

Since this phenomenon occurs for every odd power $k \geqslant 3$, the natural choice for an archetypal measure of "skewness" is to define $\phi_3$ as the skewness. (The higher-order odd moments $k=5,7,9,...$ are sometimes called measures of "hyperskewness".)This is a lower standardised central moment than the higher odd powers, and it is natural to explore lower-order moments before consideration of higher-order moments. In statistics we have adopted the convention of referring to this standardised central moment as the skewness, since it is the lowest standardised central moment that measures this aspect of the distribution. (The higher odd powers also measure types of skewness, but with greater and greater emphasis on values far from the mean; these are sometimes called measures of "hyperskewness".)


Even values of $k$ measure fatness of tails: For any even value of $k \geqslant 3$ we have an even power in the moment equation and so we can write the standardised central moment as $\phi_k = \phi_k^+ + \phi_k^-$. From this form we see that the standardised central moment gives us the sum of the $k$th absolute power of the standardised random variable, conditional on it being above or below its mean respectively.

Thus, for any even power $k \geqslant 3$ we will get a measure that gives non-negative values, with higher values occurring if the tails of the distribution of the standardised random variable are fatter. Note that this is a result with respect to the standardised random variable, and so a change in scale (changing the variance) has no effect on this measure. Rather, it is effectively a measure of the fatness of the tails, after standardising for the variance of the distribution. Any of these quantities could reasonably be regarded as a measure of a type of "kurtosis", with higher powers giving greater relative weight to values that are far from the mean.

Since this phenomenon occurs for every even power $k \geqslant 3$, the natural choice for an archetypal measure of kurtosis is to define $\phi_4$ as the kurtosis. This is a lower standardised central moment than the higher even powers, and it is natural to explore lower-order moments before consideration of higher-order moments. In statistics we have adopted the convention of referring to this standardised central moment as the "kurtosis", since it is the lowest standardised central moment that measures this aspect of the distribution. (The higher even powers also measure types of kurtosis, but with greater and greater emphasis on values far from the mean; these are sometimes called measures of "hyperkurtosis".)


$^\dagger$ This equation is well defined for any distribution whose first two moments exist, and which has non-zero variance. We will assume that the distribution of interest falls in this class for the rest of the analysis.

Related Question