Solved – Covariance of Categorical variables

categorical datacovariance

I know that the following is true for one categorical variable $X \in \{1,…,k\}$

$cov(X=i, X=j) = -p_i p_j$

However, I don't have intuition behind this and cannot find a proof for this.
Is there a proof behind this calculation?

To illustrate exactly why I'm interested in this, I'd like to understand the calculation behind the covariance of multinomial random variables. Given two multinomial random variables $Y_a$ and $Y_b$ from the same multinomial distribution with $k$ categories, I know that the covariance can be calculated as follows for $n$ trials

$cov(Y_a, Y_b) = \sum\limits_{s=1}^{n} \sum\limits_{t=1}^{n} cov(1[X=a], 1[X=b]) = -n p_a p_b$

And this depends on the covariance of categorical variables.

Best Answer

Use your crayons!

That's all you need to know. The rest of this answer elaborates on it, for those who have not read the link, and then it supplies a formal demonstration of the claim in that link: coloring rectangles in a scatterplot really does give the correct covariance in all cases.


Figure

The figure shows two indicator variables $X$ and $Y$: $(X,Y)=(1,0)$ with probability $p_1$, $(X,Y)=(0,1)$ with probability $p_2$, and otherwise $(X,Y)=(0,0)$. The probabilities are indicated by sets of points, where a proportion $p_1$ of them all are located close to $(1,0)$ (but spread about so you can see each of them), $p_2$ are located close to $(0,1)$, and the remaining fraction $1-p_1-p_2$ around $(0,0)$. All possible rectangles that use some two of these points have been drawn. As explained in the linked post, rectangles are positive (and drawn in red) when the points are at the upper right and lower left and otherwise they are negative (and drawn in cyan).

It is always the case that many of the rectangles cannot be seen because their width, their height, or both are zero. In the present situation, many of the rest are extremely slender because of the slight spreads of the points: they really should be invisible, too. The ones that can be seen all use one point near $(1,0)$ and one point near $(0,1)$. That makes them all negative, explaining the overall cyan cast to the picture.


Solution

A fraction $p_1$ of all rectangles have a corner at $(1,0)$. Independently of that, the proportion of those with another corner at $(0,1)$ is $p_2$. When the locations are not spread out, all such rectangles have unit width $1-0$ and unit height $1-0$ and they are negative. Therefore the covariance is

$$\text{Cov}(X,Y) = p_1 p_2 (1-0)(1-0)(-1) = -p_1p_2,$$

QED.


Mathematical Proof

The question asks for a proof.

To get started, let's establish the notation. Suppose $X$ is a discrete random variable that takes on the values $x_1,x_2,\ldots,x_k$ with probabilities $p_1,p_2,\ldots,p_k$, respectively. Let $Y_i$ be the indicator of $x_i$; that is, $Y_i = 1$ when $X = x_i$ and otherwise $Y_i=0$. Let $i\ne j$. The chance that $(Y_i,Y_j)=(1,0)$, which corresponds to $X=x_i$, is $p_i$; and the chance that $(Y_i,Y_j)=(0,1)$, which corresponds to $X=x_j$, is $p_j$. Since it is impossible for $(Y_i,Y_j)=(1,1)$, the chance that $(Y_i,Y_j)=(0,0)$ must be $1-p_i-p_j$, corresponding to $X\ne x_i$ and $X\ne x_j$. (The vector-valued random variable $(Y_1,Y_2,\ldots,Y_k)$ has a Multinomial$(1;p_1,p_2,\ldots,p_k)$ distribution.)

The question asks for the covariances of $Y_i$ and $Y_j$ for any indexes $i$ and $j$ in $1,2,\ldots, k$.

The proof uses two ideas. Their demonstrations are simple and easy.

  1. Let $(X,Y)$ be any bivariate random variable. Suppose $(X^\prime, Y^\prime)$ is another random variable with the same distribution but is independent of $(X,Y)$. Then

    $$\text{Cov}(X,Y) = \frac{1}{2}\mathbb{E}((X-X^\prime)(Y-Y^\prime)).$$

    To see why this is so, note that the right hand side remains unchanged when $X$ and $X^\prime$ are shifted by the same amount and also when $Y$ and $Y^\prime$ are shifted by some common amount. We may therefore apply suitable shifts to make the expectations all zero. In this situation

    $$\eqalign{\text{Cov}(X,Y) &= \mathbb{E}(XY)\\ & = \frac{1}{2}\mathbb{E}(XY + X^\prime Y^\prime)\\ & = \frac{1}{2}\mathbb{E}(XY + X^\prime Y^\prime) + \frac{1}{2}\mathbb{E}(X)\mathbb{E}(Y^\prime) + \frac{1}{2}\mathbb{E}(X^\prime)\mathbb{E}(Y) \\ & = \frac{1}{2}\mathbb{E}(XY + X^\prime Y^\prime) + \frac{1}{2}\mathbb{E}(XY^\prime) + \frac{1}{2}\mathbb{E}(X^\prime Y) \\ &= \frac{1}{2}\mathbb{E}((X-X^\prime)(Y-Y^\prime)). }$$

    Those extra terms like $\mathbb{E}(X)\mathbb{E}(Y^\prime)$ could be freely added in the middle step because they are all zero. The equalities of the form $\mathbb{E}(X)\mathbb{E}(Y^\prime) = \mathbb{E}(X Y^\prime)$ in the following step result from the independence of $X$ and $Y^\prime$ and of $X^\prime$ and $Y$.

  2. Where did that factor of $1/2$ go in the crayon calculation? When $(X,Y)$ has a discrete distribution with values $(x_i,y_i)$ and associated probabilities $\pi_{i}$,

    $$\eqalign{ \frac{1}{2}\mathbb{E}((X-X^\prime)\mathbb{E}(Y-Y^\prime)) &=\sum_{i,j=1}^k (x_i-x_j^\prime)(y_i-y_j^\prime)\pi_{i}\pi_{j} \\ &= \frac{1}{2}\left(\sum_{i\gt j} + \sum_{i \lt j} + \sum_{i=j}\right)(x_i-x_j^\prime)(y_i-y_j^\prime)\pi_{i}\pi_{j} \\ &= \frac{1}{2}\left(2\sum_{i\gt j} (x_i-x_j^\prime)(y_i-y_j^\prime)\pi_{i}\pi_{j}\right) + \sum_{i=j} 0 \\ &= \sum_{i \gt j} (x_i-x_j^\prime)(y_i-y_j^\prime)\pi_{i}\pi_{j}. }$$

    In words: the expectation averages over ordered pairs of indices, causing each non-empty rectangle to be counted twice. That's why the factor of $1/2$ is needed in the formula but does not need to be used in the crayon calculation, which counts each distinct rectangle just once.

Applying these two ideas to the bivariate $(Y_i,Y_j)$ in the question, which takes on only four possible values $(0,0),(1,0),(0,1),(1,1)$ with probabilities $1-p_ip_j, p_i, p_j$, and $0$, gives a sum that has only one nonzero term arising from $(1,0)$ and $(0,1)$ equal to

$$\text{Cov}(Y_i,Y_j) = p_i p_j (1-0)(0-1) = -p_i p_j,$$

QED.

Related Question