Solved – Intuition for moments about the mean of a distribution

intuitionkurtosismathematical-statisticsmomentsskewness

Can someone provide an intuition on why the higher moments of a probability distribution $p_X$, like the third and fourth moments, correspond to skewness and kurtosis respectively? Specifically, why does the deviation about the mean raised to the third or fourth power end up translating into a measure of skewness and kurtosis? Is there a way to relate this to the third or fourth derivatives of the function?

Consider this definition of skewness and kurtosis:

$$\begin{matrix}
\text{Skewness}(X) = \mathbb{E}[(X – \mu_{X})^3] / \sigma^3, \\[6pt]
\text{Kurtosis}(X) = \mathbb{E}[(X – \mu_{X})^4] / \sigma^4. \\[6pt]
\end{matrix}$$

In these equations we raise the normalised value $(X-\mu)/\sigma$ to a power and take its expected value. It is not clear to me why raising the normalised random variable to the power of four gives "peakedness" or why raising the normalised random variable to the power of three should give "skewness". This seems magical and mysterious!

Best Answer

There is a good reason for these definitions, which becomes clearer when you look at the general form for moments of standardised random variables. To answer this question, first consider the general form of the $k$th standardised central moment:$^\dagger$

$$\phi_k = \mathbb{E} \Bigg[ \Bigg( \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg)^k \text{ } \Bigg].$$

The first two standardised central moments are the values $\phi_1=0$ and $\phi_2=1$, which hold for all distributions for which the above quantity is well-defined. Hence, we can consider the non-trivial standardised central moments that occur for values $k \geqslant 3$. To facilitate our analysis we define:

$$\begin{equation} \begin{aligned} \phi_k^+ &= \mathbb{E} \Bigg[ \Bigg| \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg|^k \text{ } \Bigg| X > \mathbb{E}[X] \Bigg] \cdot \mathbb{P}(X > \mathbb{E}[X]), \\[8pt] \phi_k^- &= \mathbb{E} \Bigg[ \Bigg| \frac{X - \mathbb{E}[X]}{\mathbb{S}[X]} \Bigg|^k \text{ } \Bigg| X < \mathbb{E}[X] \Bigg] \cdot \mathbb{P}(X < \mathbb{E}[X]). \end{aligned} \end{equation}$$

These are non-negative quantities that give the $k$th absolute power of the standardised random variable conditional on it being above or below its expected value. We will now decompose the standardised central moment into these parts.

Odd values of $k$ measure the skew in the tails: For any odd value of $k \geqslant 3$ we have an odd power in the moment equation and so we can write the standardised central moment as $\phi_k = \phi_k^+ - \phi_k^-$. From this form we see that the standardised central moment gives us the difference between the $k$th absolute power of the standardised random variable, conditional on it being above or below its mean respectively.

Thus, for any odd power $k \geqslant 3$ we will get a measure that gives positive values if the expected absolute power of the standardised random variable is higher for values above the mean than for values below the mean, and gives negative values if the expected absolute power is lower for values above the mean than for values below the mean. Any of these quantities could reasonably be regarded as a measure of a type of "skewness", with higher powers giving greater relative weight to values that are far from the mean.

Since this phenomenon occurs for every odd power $k \geqslant 3$, the natural choice for an archetypal measure of "skewness" is to define $\phi_3$ as the skewness. (The higher-order odd moments $k=5,7,9,...$ are sometimes called measures of "hyperskewness".)This is a lower standardised central moment than the higher odd powers, and it is natural to explore lower-order moments before consideration of higher-order moments. In statistics we have adopted the convention of referring to this standardised central moment as the skewness, since it is the lowest standardised central moment that measures this aspect of the distribution. (The higher odd powers also measure types of skewness, but with greater and greater emphasis on values far from the mean; these are sometimes called measures of "hyperskewness".)

Even values of $k$ measure fatness of tails: For any even value of $k \geqslant 3$ we have an even power in the moment equation and so we can write the standardised central moment as $\phi_k = \phi_k^+ + \phi_k^-$. From this form we see that the standardised central moment gives us the sum of the $k$th absolute power of the standardised random variable, conditional on it being above or below its mean respectively.

Thus, for any even power $k \geqslant 3$ we will get a measure that gives non-negative values, with higher values occurring if the tails of the distribution of the standardised random variable are fatter. Note that this is a result with respect to the standardised random variable, and so a change in scale (changing the variance) has no effect on this measure. Rather, it is effectively a measure of the fatness of the tails, after standardising for the variance of the distribution. Any of these quantities could reasonably be regarded as a measure of a type of "kurtosis", with higher powers giving greater relative weight to values that are far from the mean.

Since this phenomenon occurs for every even power $k \geqslant 3$, the natural choice for an archetypal measure of kurtosis is to define $\phi_4$ as the kurtosis. This is a lower standardised central moment than the higher even powers, and it is natural to explore lower-order moments before consideration of higher-order moments. In statistics we have adopted the convention of referring to this standardised central moment as the "kurtosis", since it is the lowest standardised central moment that measures this aspect of the distribution. (The higher even powers also measure types of kurtosis, but with greater and greater emphasis on values far from the mean; these are sometimes called measures of "hyperkurtosis".)

$^\dagger$ This equation is well defined for any distribution whose first two moments exist, and which has non-zero variance. We will assume that the distribution of interest falls in this class for the rest of the analysis.

Related Solutions

Solved – Exponential weighted moving skewness/kurtosis

The formulas are straightforward but they are not as simple as intimated in the question.

Let $Y$ be the previous EWMA and let $X = x_n$, which is presumed independent of $Y$. By definition, the new weighted average is $Z = \alpha X + (1 - \alpha)Y$ for a constant value $\alpha$. For notational convenience, set $\beta = 1-\alpha$. Let $F$ denote the CDF of a random variable and $\phi$ denote its moment generating function, so that

$$\phi_X(t) = \mathbb{E}_F[\exp(t X)] = \int_\mathbb{R}{\exp(t x) dF_X(x)}.$$

With Kendall and Stuart, let $\mu_k^{'}(Z)$ denote the non-central moment of order $k$ for the random variable $Z$; that is, $\mu_k^{'}(Z) = \mathbb{E}[Z^k]$. The skewness and kurtosis are expressible in terms of the $\mu_k^{'}$ for $k = 1,2,3,4$; for example, the skewness is defined as $\mu_3 / \mu_2^{3/2}$ where

$$\mu_3 = \mu_3^{'} - 3 \mu_2^{'}\mu_1^{'} + 2{\mu_1^{'}}^3 \text{ and }\mu_2 = \mu_2^{'} - {\mu_1^{'}}^2$$

are the third and second central moments, respectively.

By standard elementary results,

$$\eqalign{ &1 + \mu_1^{'}(Z) t + \frac{1}{2!} \mu_2^{'}(Z) t^2 + \frac{1}{3!} \mu_3^{'}(Z) t^3 + \frac{1}{4!} \mu_4^{'}(Z) t^4 +O(t^5) \cr &= \phi_Z(t) \cr &= \phi_{\alpha X}(t) \phi_{\beta Y}(t) \cr &= \phi_X(\alpha t) \phi_Y(\beta t) \cr &= (1 + \mu_1^{'}(X) \alpha t + \frac{1}{2!} \mu_2^{'}(X) \alpha^2 t^2 + \cdots) (1 + \mu_1^{'}(Y) \beta t + \frac{1}{2!} \mu_2^{'}(Y) \beta^2 t^2 + \cdots). } $$

To obtain the desired non-central moments, multiply the latter power series through fourth order in $t$ and equate the result term-by-term with the terms in $\phi_Z(t)$.

Solved – (Inter)quantile-based Kurtosis measure

The denominator is essentially just some kind of "middle" part of the distribution for the tail to be large or not so large relative to; you need something to scale the $p$-distance by. For my illustrations here I chose the middle half (interquartile range) as a reasonable default; it's the most obvious one to try.

So let's take say $p=0.01$ and $\eta=0.25$ (I don't claim that's a n ideal choice, but $\eta=0.25$ seems a reasonable default).

Then $k = R_{0.25,0.01}= \frac{q_{0.99}-q_{0.01}}{q_{0.75}-q_{0.25}}$.

This then will be larger when the tail is heavy (since the small $p$ with a heavy tail will make the numerator large) and large when the distribution is peaked (since the larger $\eta$ with a peaked distribution will make the denominator small).

Here's an illustration for the normal:

For that particular pair $\eta,p$ we have:

 distribution     k
   uniform       1.96
   normal        3.45
   Cauchy        31.8
   exponential   4.18

You can of course choose different values for $\eta$ and $p$.

Best Answer

Related Solutions

Solved – Exponential weighted moving skewness/kurtosis

Solved – (Inter)quantile-based Kurtosis measure

Related Question