How can we derive the dot product from the covariance of two random vectors

covarianceinner-productsprobabilityproof-explanation

I'm trying to understand a geometric interpretation of covariance as explained in this lecture pdf, which states:

If X, Y are two random variables of zero mean, then the covariance Cov[XY ] = E[X ยท Y ] is the dot product of X and Y. The standard deviation of X is the length of X. The correlation is the cosine of the angle between the two vectors.

I can understand how, given a zero mean, the standard deviation of X is the length of X. What I don't understand is how the the covariance of X and Y is equal to it's dot product.

Specifically, the covariance is defined as:

$Cov(X,Y) = E[(X – \mu_X) (Y – \mu_Y)]$

Which with we can write as:

$Cov(X,Y) = \sum\limits_{i}\sum\limits_{j} p_{ij} (x- \mu_X) (y-\mu_Y)$

And since for two centered random variables the means are zero, and the probability is uniform, we can rewrite the formula as:

$Cov(X,Y) = \dfrac{1}{N} \sum\limits_{i}\sum\limits_{j} x y$ (Eqn 1)

The dot product, in contrast consists of a single summation:

$dot(X,Y) = \sum\limits_{i} x_i y_i$ (Eqn 2)

Note that the $\frac{1}{N}$ factors out of Eqn 1 when I divide by the standard deviations of X and Y. So I almost can derive the dot product from the covariance, but the covariance still contains a double summation and the dot product has a single element-wise summation. What am I missing?

UPDATE

I don't have the answer, but I realize that this formula:

$Cov(X,Y) = \sum\limits_{i}\sum\limits_{j} p_{ij} (x- \mu_X) (y-\mu_Y)$

… where N is the dimension of the vectors, is not equal to Eqn 1:

$Cov(X,Y) = \dfrac{1}{N} \sum\limits_{i}\sum\limits_{j} x y$ (Eqn 1)

Since if $p_{ij}$ equals $\dfrac{1}{N}$, it would sum to be greater then one in the double summation. $p_{ij}$ only equals $\dfrac{1}{N}$ when it's a single summation, which maps easily to the dot product. I've posted a new question here to try and address this confusion in isolation, which I hope in turn will clarify this derivation problem.

Best Answer

The notes seem to be giving a result before stating the interpretation on which the result is based. If you only read the first page of the notes, the statements about vectors don't make sense.

On the second page we start to get an explanation of what is really meant by the use of vector notation on the first page. We are supposed to consider a probability space $\Omega$ with $n$ elements ($n$ a positive integer), where each element has the same probability measure as any other. The elements are listed in sequential order from the $1$st element to the $n$th.

We are then supposed to consider a vector in $\mathbb R^n$ as a function that takes the sequential number of an element of the probability space as input, and gives a real number as output. This matches the definition of a real random variable over that probability space, which is a function that maps each element of the probability space to a real number.

Now let's consider something like the expected value of the square of a random variable $X$, that is, $E[X^2]$. The formula $E[X^2] = \sum_i x_i^2 p(x_i)$, where $p$ is the pmf of the random variable $X$ and the values of $X$ are represented by $x_i$, is not the only way to compute this expected value. For a finite probability space we can also compute the expected value by taking the sum of a different sequence of products, $$ E[X^2] = \sum_{\omega \in \Omega} \mu(\omega) (X(\omega))^2, \tag1 $$ where $\omega$ ranges over the elements of the probability space $\Omega$, $\mu(\omega)$ is the probability measure of the element $\omega,$ and $X(\omega)$ is the value of the random variable $X$ for input $\omega$.

The equation $$ E[X^2] = \frac1n \sum_{i=1}^n X_i^2 \tag2 $$ is just a rewriting of Equation $(1)$ for a probability space with $n$ elements of equal probability $\frac1n$, where $X_1,\ldots, X_n$ are the values of the random variable $X$ for the $1$st through $n$th elements of the probability space. The right-hand side of $(2)$ is also $$ \frac1n (X \cdot X) = \frac1n \lVert X\rVert^2 $$ if $X$ is represented by a vector as suggested on the second page of the notes.

Similarly, for $E[XY],$ the expected value of the product of random variables $X$ and $Y$, we can write $$ E[XY] = \sum_{\omega \in \Omega} \mu(\omega) X(\omega) Y(\omega). $$

For a probability space of $n$ elements of probability $\frac1n$ each, using vectors to represent $X$ and $Y,$ this translates to $$ E[XY] = \frac1n \sum_{i=1}^n X_i Y_i. \tag3$$

Note: this is not a double summation over $i$ and $j$ of terms in $X_i Y_j$, and there's a good reason for that. The summation in this formula was really over the elements of the probability space. Each index value $i$ picks out one element. Once you have picked an element of the probability space, each random variable's value is determined. So when we pick the $i$th element, the value of $X$ is $X_i$, the value of $Y$ is $Y_i$, and the value of $XY$ is $X_i Y_i$.

Now note that if we use vectors for $X$ and $Y$, the right-hand side of $(3)$ can also be written $$ \frac1n (X \cdot Y). $$

So the notes are literally correct when talking about a finite probability space with a uniform probability measure, with vectors in $\mathbb R^n$ interpreted as random variables over that probability space, except for one detail: they seem to have omitted the factor $\frac1n$.


A further note to explain why we sometimes use a single summation for covariance and sometimes a double summation. Suppose we have a probability space $\Omega = \{1,2,3,4,5\}$ with uniform probability measure (each element has probability measure $\frac15$). Suppose $X$ and $Y$ are random variables and that \begin{align} X(1) &= 11, & Y(1) &= 21, \\ X(2) &= 11, & Y(2) &= 22, \\ X(3) &= 12, & Y(3) &= 21, \\ X(4) &= 12, & Y(4) &= 23, \\ X(5) &= 13, & Y(5) &= 21. \\ \end{align} So we see that the possible values of $X$ are $x_1 = 11,$ $x_2 = 12,$ and $x_3 = 13,$ whereas the possible values of $Y$ are $y_1 = 21,$ $y_2 = 22,$ and $y_3 = 23.$ When we take pairs $(x_i,y_j)$ there are therefore $3\times 3 = 9$ possible pairs, and when we write $$ \sum_{i=1}^3 \sum_{j=1}^3 x_i y_j P(X=x_i, Y=y_j) $$ we have to add up $9$ terms in total from the double summation.

But actually, according to the definitions of $X$ and $Y$ above, there are only $5$ possible values of $(x_i,y_j)$:

$$ (11,21), (11,22), (12,21), (12,23), (13,21). $$

So the double summation is "wasteful" in that it counts several outcomes that have zero probability. The single summation $$ \frac15 \sum_{i=1}^5 X(i) Y(i), $$ on the other hand, counts each of the $5$ pairs of values from $X$ and $Y$ that can occur and does not bother to count any of the zero-probability pairs. So, among the outcomes that actually can occur, each sum counts exactly the same results. They just use different ways of identifying those results.

Related Question