I have two random matrices (matrix-valued random variables) $X$ and $Y$, both of dimension $n \times n$. Is there a notion of covariance between the two random matrices, i.e., $\text{Cov}(X,Y)$? If yes, how can I calculate it?
Solved – Covariance between two random matrices
covariancecovariance-matrix
Related Solutions
Background
A covariance matrix $\mathbb{A}$ for a vector of random variables $X=(X_1, X_2, \ldots, X_n)^\prime$ embodies a procedure to compute the variance of any linear combination of those random variables. The rule is that for any vector of coefficients $\lambda = (\lambda_1, \ldots, \lambda_n)$,
$$\operatorname{Var}(\lambda X) = \lambda \mathbb{A} \lambda ^\prime.\tag{1}$$
In other words, the rules of matrix multiplication describe the rules of variances.
Two properties of $\mathbb{A}$ are immediate and obvious:
Because variances are expectations of squared values, they can never be negative. Thus, for all vectors $\lambda$, $$0 \le \operatorname{Var}(\lambda X) = \lambda \mathbb{A} \lambda ^\prime.$$ Covariance matrices must be non-negative-definite.
Variances are just numbers--or, if you read the matrix formulas literally, they are $1\times 1$ matrices. Thus, they do not change when you transpose them. Transposing $(1)$ gives $$\lambda \mathbb{A} \lambda ^\prime = \operatorname{Var}(\lambda X) = \operatorname{Var}(\lambda X) ^\prime = \left(\lambda \mathbb{A} \lambda ^\prime\right)^\prime = \lambda \mathbb{A}^\prime \lambda ^\prime.$$ Since this holds for all $\lambda$, $\mathbb{A}$ must equal its transpose $\mathbb{A}^\prime$: covariance matrices must be symmetric.
The deeper result is that any non-negative-definite symmetric matrix $\mathbb{A}$ is a covariance matrix. This means there actually is some vector-valued random variable $X$ with $\mathbb{A}$ as its covariance. We may demonstrate this by explicitly constructing $X$. One way is to notice that the (multivariate) density function $f(x_1,\ldots, x_n)$ with the property $$\log(f) \propto -\frac{1}{2} (x_1,\ldots,x_n)\mathbb{A}^{-1}(x_1,\ldots,x_n)^\prime$$ has $\mathbb{A}$ for its covariance. (Some delicacy is needed when $\mathbb{A}$ is not invertible--but that's just a technical detail.)
Solutions
Let $\mathbb{X}$ and $\mathbb{Y}$ be covariance matrices. Obviously they are square; and if their sum is to make any sense they must have the same dimensions. We need only check the two properties.
The sum.
- Symmetry $$(\mathbb{X}+\mathbb{Y})^\prime = \mathbb{X}^\prime + \mathbb{Y}^\prime = (\mathbb{X} + \mathbb{Y})$$ shows the sum is symmetric.
- Non-negative definiteness. Let $\lambda$ be any vector. Then $$\lambda(\mathbb{X}+\mathbb{Y})\lambda^\prime = \lambda \mathbb{X}\lambda^\prime + \lambda \mathbb{Y}\lambda^\prime \ge 0 + 0 = 0$$ proves the point using basic properties of matrix multiplication.
I leave this as an exercise.
This one is tricky. One method I use to think through challenging matrix problems is to do some calculations with $2\times 2$ matrices. There are some common, familiar covariance matrices of this size, such as $$\pmatrix{a & b \\ b & a}$$ with $a^2 \ge b^2$ and $a \ge 0$. The concern is that $\mathbb{XY}$ might not be definite: that is, could it produce a negative value when computing a variance? If it will, then we had better have some negative coefficients in the matrix. That suggests considering $$\mathbb{X} = \pmatrix{a & -1 \\ -1 & a}$$ for $a \ge 1$. To get something interesting, we might gravitate initially to matrices $\mathbb{Y}$ with different-looking structures. Diagonal matrices come to mind, such as $$\mathbb{Y} = \pmatrix{b & 0 \\ 0 & 1}$$ with $b\ge 0$. (Notice how we may freely pick some of the coefficients, such as $-1$ and $1$, because we can rescale all the entries in any covariance matrix without changing its fundamental properties. This simplifies the search for interesting examples.)
I leave it to you to compute $\mathbb{XY}$ and test whether it always is a covariance matrix for any allowable values of $a$ and $b$.
I will write matrix square roots of $\Sigma$ as $\Sigma=A A^T$, to be consistent with the Cholesky decomposition which is written as $\Sigma = L L^T$ where $L$ is lowtri (lower triangular). So let $X$ be a random vector with $\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\Cov}{\mathbb{Cov}}\DeclareMathOperator{\Var}{\mathbb{Var}} \E X=\mu$ and $\Var X =\Sigma$. Let now $Z$ be a random vector with expectation zero and unit covariance matrix.
Note that there are (except for the scalar case) infinitely many matrix square roots. If we let $A$ be one of the, then we can find all the others as $A \mathcal{O}$ where $\mathcal{O}$ is any orthogonal matrix, that is, $\mathcal{O} \mathcal{O}^T = \mathcal{O}^T \mathcal{O} =I$. This is known as unitary freedom of square roots.
Let us look at some particular matrix square roots.
First a symmetric square root. Use the spectral decomposition to write $\Sigma = U \Lambda U^T = U\Lambda^{1/2}(U\Lambda^{1/2})^T$. Then $\Sigma^{1/2}=U\Lambda^{1/2}$ and this can be interpreted as the PCA (principal component analysis) of $\Sigma$.
The Cholesky decomposition $\Sigma=L L^T$ and $L$ is lowtri. We can represent $X$ as $X=\mu + L Z$. Multiplying out to get scalar equations, we get a triangular system in $Z$, which in the time series case can be interpreted as a MA (moving average) representation.
The general case $A= L \mathcal{O}$, using the above we can interpret this as a MA representation after rotating $Z$.
Best Answer
The most common thing to do is probably to simply consider the covariance between the entries of the matrices. Defining $\DeclareMathOperator{\vec}{\mathrm{vec}}\vec(A)$ to be the vectorization of a matrix $A$ (that is, stack up the columns into a single column vector), you can look at $\DeclareMathOperator{\Cov}{\mathrm{Cov}}\Cov(\vec(X), \vec(Y))$. This is then an $mn \times mn$ matrix.
If you preferred, you could instead define an $m \times n \times m \times n$ tensor, which would be essentially the same thing, just reshaped.
In e.g. the matrix normal distribution, we assume that the covariance matrix of the single random matrix $X$ factors as the Kronecker product of an $m \times m$ row covariance $U$ and an $n \times n$ column covariance $V$, in which case you can often just work with $U$ or $V$.