Solved – Variance of the sum of random vectors

covariancenon-independentvariance

Let $X_1, X_2, \dots , X_n \sim G$ where $G$ is some distribution and the samples are not independent. If $X_i \in \mathbb{R}$, then I know that

$$\text{Var}\left(\sum_{i=1}^{n} X_i \right) = n\text{Var}X_i + 2 \sum_{i<j}\text{Cov}(X_i, X_j) $$

Does this still hold true when $X_i \in \mathbb{R}^d$ ($G$ being defined on $\mathbb{R}^d$ as well)? Here each $X_i = (X_i^{(1)}, X_i^{(2)}, \dots, X_i^{(d)})^T$. Let

$$\sum_{i=1}^{n}X_i = \left(\sum_{i=1}^{n}X_i^{(1)}, \sum_{i=1}^{n}X_i^{(2)}, \dots, \sum_{i=1}^{n}X_i^{(d)} \right)\,. $$

I am interested in calculating the formula for the variance-covariance matrix of $\sum X_i$

It seems that the same formula as before cannot be extended to this case since if ,
$$\text{Var}\left(\sum_{i=1}^{n} X_i \right) = n\text{Var}X_i + 2 \sum_{i<j}\text{Cov}(X_i, X_j) \quad \quad (1)$$

where now both terms are $d \times d$ matrices, then the $(1,2)$th term of the second matrix is
$$\text{Cov}_{1,2}(X_i, X_j) = \text{Cov}(X^{(1)}_i, X^{(2)}_j) $$

This may not be equal to
\begin{equation}
\text{Cov}_{2,1}(X_i, X_j) = \text{Cov}(X^{(2)}_i, X^{(1)}_j),
\end{equation}

and so the overall covariance matrix may not be symmetric.

Question: Does Equation (1) hold for random vectors?

Best Answer

You are asking for $\text{Var}(\sum_i X_i)$ when $\sum_i X_i$ is a vector of multiple elements, though I think what you're asking for is the covariance matrix (the generalization of variance to a vector).

You can solve this in a similar way to the univariate case.

Suppose we have two variables, $x, y \in \mathbb{R}^1$, each with $n$ observations. We want to calculate $$ \text{Cov}(\sum_ix_i, \sum_j y_j). $$ Once you do that, then you can simply generalize to $x \in \mathbb{R}^d$ by generating the covariance matrix with each pairing of dimension using the same formula as above, for $x$ and $y$. So,

$$ \begin{split} \text{Cov}(\sum_ix_i, \sum_j y_j) &= \left\langle \left( \sum_i x_i - \langle \sum_i x_i \rangle \right) \left( \sum_j y_j - \langle \sum_j y_j \rangle \right) \right\rangle \\ &= \sum_{i,j} \left\langle \langle x_i y_j \rangle + \langle x_i \rangle \langle y_j \rangle - x_i \langle y_j \rangle - \langle x_i \rangle y_j \right\rangle \\ &= \sum_{i,j} \left( \langle x_i y_j \rangle - \langle x_i \rangle \langle y_j \rangle\right) \\ &= \sum_{i,j} \text{Cov}(x_i,y_j), \end{split} $$ where the sum is over all $n^2$ pairings of $i,j$. In the second line, I used the fact that $\langle \sum_i x_i \rangle = \sum_i \langle x_i \rangle$, which is true even if the terms are not independent. To see this, express the expectation as an integral over the $n$-dimensional space of observations, with a joint distribution $p(x)$. Lack of independence means you can't split $p(x)$ into a product of individual distributions, but you can still pull the sum out of the integral, and the expectation of each term is simply taken with respect to the whole set of observations.

This is certainly symmetric because the indices that run over $x$ and $y$ are symmetric. It also reduces to the univariate case ($y=x$) because you can split the sum into the diagonal elements (the $\text{Var}$ terms) and twice the upper triangle elements (the $\text{Cov}$ terms for $i<j$, which is equivalent to the lower triangle terms). In the general case, however, you don't get to simply take the upper right triangle indices and double them, because $x_i y_j \ne x_j y_i$, so you have to sum over all pairings.

Thus, in general, $$ \text{Cov}\left(\sum_i X_i^{(k)}, \sum_j X_j^{(m)} \right) = \sum_{i,j} \text{Cov} \left( X_i^{(k)}, X_j^{(m)} \right). $$

Related Question