Given $v \sim N_{p+1}(\mu, \Sigma), v = (x, y)', \mu = (\mu_x, \mu_y)', \Sigma = \begin{pmatrix}
\Sigma_{xx} & \sigma_{xy} \\
\sigma_{yx} & \sigma_{yy}
\end{pmatrix}$, how do you prove the formula for multiple correlation (the Pearson correlation between $\mu_{y|x}$ and $y$): $$\rho(y,x) = Corr(y, E[y|x]) = \left( \frac{\sigma_{yx}\Sigma_{xx}^{-1}\sigma_{xy}}{\sigma_{yy}} \right)^{1/2}$$
I tried to use $\mu_{y|x} = \mu_y – \sigma_{yx}\Sigma_{xx}^{-1}(x-\mu_x)$ and plug it into $Corr(y, E[y|x]) = \frac{ Cov(y, E[y|x])}{\sqrt{V(y)}\sqrt{V(E[y|x])}}$ without any success.
Multiple Correlation – Deriving and Understanding the Multiple Correlation Formula in Multivariate Analysis
conditionalcorrelationmultivariate analysispearson-r
Related Solutions
It is indeed something. To find out, we need to examine what we know about correlation itself.
The correlation matrix of a vector-valued random variable $\mathbf{X}=(X_1,X_2,\ldots,X_p)$ is the variance-covariance matrix, or simply "variance," of the standardized version of $\mathbf{X}$. That is, each $X_i$ is replaced by its recentered, rescaled version.
The covariance of $X_i$ and $X_j$ is the expectation of the product of their centered versions. That is, writing $X^\prime_i = X_i - E[X_i]$ and $X^\prime_j = X_j - E[X_j]$, we have
$$\operatorname{Cov}(X_i,X_j) = E[X^\prime_i X^\prime_j].$$
The variance of $\mathbf{X}$, which I will write $\operatorname{Var}(\mathbf{X})$, is not a single number. It is the array of values $$\operatorname{Var}(\mathbf{X})_{ij}=\operatorname{Cov}(X_i,X_j).$$
The way to think of the covariance for the intended generalization is to consider it a tensor. That means it's an entire collection of quantities $v_{ij}$, indexed by $i$ and $j$ ranging from $1$ through $p$, whose values change in a particularly simple predictable way when $\mathbf{X}$ undergoes a linear transformation. Specifically, let $\mathbf{Y}=(Y_1,Y_2,\ldots,Y_q)$ be another vector-valued random variable defined by
$$Y_i = \sum_{j=1}^p a_i^{\,j}X_j.$$
The constants $a_i^{\,j}$ ($i$ and $j$ are indexes--$j$ is not a power) form a $q\times p$ array $\mathbb{A} = (a_i^{\,j})$, $j=1,\ldots, p$ and $i=1,\ldots, q$. The linearity of expectation implies
$$\operatorname{Var}(\mathbf Y)_{ij} = \sum a_i^{\,k}a_j^{\,l}\operatorname{Var}(\mathbf X)_{kl} .$$
In matrix notation,
$$\operatorname{Var}(\mathbf Y) = \mathbb{A}\operatorname{Var}(\mathbf X) \mathbb{A}^\prime .$$
All the components of $\operatorname{Var}(\mathbf{X})$ actually are univariate variances, due to the Polarization Identity
$$4\operatorname{Cov}(X_i,X_j) = \operatorname{Var}(X_i+X_j) - \operatorname{Var}(X_i-X_j).$$
This tells us that if you understand variances of univariate random variables, you already understand covariances of bivariate variables: they are "just" linear combinations of variances.
The expression in the question is perfectly analogous: the variables $X_i$ have been standardized as in $(1)$. We can understand what it represents by considering what it means for any variable, standardized or not. We would replace each $X_i$ by its centered version, as in $(2)$, and form quantities having three indexes,
$$\mu_3(\mathbf{X})_{ijk} = E[X_i^\prime X_j^\prime X_k^\prime].$$
These are the central (multivariate) moments of degree $3$. As in $(4)$, they form a tensor: when $\mathbf{Y} = \mathbb{A}\mathbf{X}$, then
$$\mu_3(\mathbf{Y})_{ijk} = \sum_{l,m,n} a_i^{\,l}a_j^{\,m}a_k^{\,n} \mu_3(\mathbf{X})_{lmn}.$$
The indexes in this triple sum range over all combinations of integers from $1$ through $p$.
The analog of the Polarization Identity is
$$\eqalign{&24\mu_3(\mathbf{X})_{ijk} = \\ &\mu_3(X_i+X_j+X_k) - \mu_3(X_i-X_j+X_k) - \mu_3(X_i+X_j-X_k) + \mu_3(X_i-X_j-X_k).}$$
On the right hand side, $\mu_3$ refers to the (univariate) central third moment: the expected value of the cube of the centered variable. When the variables are standardized, this moment is usually called the skewness. Accordingly, we may think of $\mu_3(\mathbf{X})$ as being the multivariate skewness of $\mathbf{X}$. It is a tensor of rank three (that is, with three indices) whose values are linear combinations of the skewnesses of various sums and differences of the $X_i$. If we were to seek interpretations, then, we would think of these components as measuring in $p$ dimensions whatever the skewness is measuring in one dimension. In many cases,
The first moments measure the location of a distribution;
The second moments (the variance-covariance matrix) measure its spread;
The standardized second moments (the correlations) indicate how the spread varies in $p$-dimensional space; and
The standardized third and fourth moments are taken to measure the shape of a distribution relative to its spread.
To elaborate on what a multidimensional "shape" might mean, observe that we can understand principal component analysis (PCA) as a mechanism to reduce any multivariate distribution to a standard version located at the origin and equal spreads in all directions. After PCA is performed, then, $\mu_3$ would provide the simplest indicators of the multidimensional shape of the distribution. These ideas apply equally well to data as to random variables, because data can always be analyzed in terms of their empirical distribution.
Reference
Alan Stuart & J. Keith Ord, Kendall's Advanced Theory of Statistics Fifth Edition, Volume 1: Distribution Theory; Chapter 3, Moments and Cumulants. Oxford University Press (1987).
Appendix: Proof of the Generalized Polarization Identity
Let $x_1,\ldots, x_n$ be algebraic variables. There are $2^n$ ways to add and subtract all $n$ of them. When we raise each of these sums-and-differences to the $n^\text{th}$ power, pick a suitable sign for each of those results, and add them up, we will get a multiple of $x_1x_2\cdots x_n$.
More formally, let $S=\{1,-1\}^n$ be the set of all $n$-tuples of $\pm 1$, so that any element $s\in S$ is a vector $s=(s_1,s_2,\ldots,s_n)$ whose coefficients are all $\pm 1$. The claim is
$$2^n n!\, x_1x_2\cdots x_n = \sum_{s\in S} \color{red}{s_1s_2\cdots s_n}(s_1x_1+s_2x_2+\cdots+s_nx_n)^n.\tag{1}$$
A prettier way of writing this equality helps explain the factor of $2^n n!$ that appears: upon dividing by $2^n$ we obtain the average of the terms on the right side (since $S$ has $|S|=2^n$ elements) and the $n!$ counts the distinct ways to form the monomial $x_1\cdots x_n$ from products of its components--namely, it counts the elements of the symmetric group $\mathfrak{S}^n.$ Thus, upon abbreviating $s_1s_2\cdots s_n=\chi(\mathbf s)$ and letting $\mathbf{s}\cdot \mathbf{x} = s_1x_1+ \cdots + s_nx_n$ be the (usual) dot product of vectors,
$$\sum_{\sigma\in\mathfrak{S}^n} x_{\sigma(1)}x_{\sigma(2)}\cdots x_{\sigma(n)} = \frac{1}{|S|}\sum_{\mathbf s\in S} \color{red}{\chi(\mathbf s)}(\mathbf{s}\cdot \mathbf{x} )^n.\tag{1}$$
Indeed, the Multinomial Theorem states that the coefficient of the monomial $x_1^{i_1}x_2^{i_2}\cdots x_n^{i_n}$ (where the $i_j$ are nonnegative integers summing to $n$) in the expansion of any term on the right hand side is
$$\binom{n}{i_1,i_2,\ldots,i_n}s_1^{i_1}s_2^{i_2}\cdots s_n^{i_n}.$$
In the sum $(1)$, the coefficients involving $x_1^{i_1}$ appear in pairs where one of each pair involves the case $s_1=1$, with coefficient proportional to $ \color{red}{s_1}$ times $s_1^{i_1}$, equal to $1$, and the other of each pair involves the case $s_1=-1$, with coefficient proportional to $\color{red}{-1}$ times $(-1)^{i_1}$, equal to $(-1)^{i_1+1}$. They cancel in the sum whenever $i_1+1$ is odd. The same argument applies to $i_2, \ldots, i_n$. Consequently, the only monomials that occur with nonzero coefficients must have odd powers of all the $x_i$. The only such monomial is $x_1x_2\cdots x_n$. It appears with coefficient $\binom{n}{1,1,\ldots,1}=n!$ in all $2^n$ terms of the sum. Consequently its coefficient is $2^nn!$, QED.
We need take only half of each pair associated with $x_1$: that is, we can restrict the right hand side of $(1)$ to the terms with $s_1=1$ and halve the coefficient on the left hand side to $2^{n-1}n!$ . That gives precisely the two versions of the Polarization Identity quoted in this answer for the cases $n=2$ and $n=3$: $2^{2-1}2! = 4$ and $2^{3-1}3!=24$.
Of course the Polarization Identity for algebraic variables immediately implies it for random variables: let each $x_i$ be a random variable $X_i$. Take expectations of both sides. The result follows by linearity of expectation.
If $X$ and $Y$ are correlated (univariate) normal random variables and $Z = AX+BY+C$, then the linearity of expectation and the bilinearity of the covariance function gives us that
\begin{align} E[Z] &= AE[X] + BE[Y] + C,\tag{1}\\ \operatorname{cov}(Z,X) &= \operatorname{cov}(AX+BY+C,X) = A\operatorname{var}(X) + B\operatorname{cov}(Y,X)\\ \operatorname{cov}(Z,Y) &= \operatorname{cov}(AX+BY+C,Y) = B\operatorname{var}(Y) + A\operatorname{cov}(X,Y)\\ \operatorname{var}(Z) &= \operatorname{var}(AX+BY+C) \quad = A^2\operatorname{var}(X) + B^2\operatorname{var}(Y) + 2AB \operatorname{cov}(X,Y), \tag{2}\\ \end{align} but it is not necessarily true that $Z$ is a normal (a.k.a Gaussian) random variable. That $X$ and $Y$ are jointly normal random variables is sufficient to assert that $Z = AX+BY+C$ is a normal random variable. Note that $X$ and $Y$ are not required to be independent; they can be correlated as long as they are jointly normal. For examples of normal random variables $X$ and $Y$ that are not jointly normal and yet their sum $X+Y$ is normal, see the answers to Is joint normality a necessary condition for the sum of normal random variables to be normal?. As pointed out at the end of my own answer there, joint normality means that all linear combinations $aX+bY$ are normal, whereas in the special case being discussed there, only one linear combination $X+Y$ of non-jointly normal random variables is proven to be normal; most other linear combinations are not normal.
More generally, if $X$ and $Y$ are (column) $n$-vector random variables with $n\times n$ covariance matrices $\Sigma_{X,X}$, $\Sigma_{Y,Y}$, and $n\times n$ crosscovariance matrix $\Sigma_{X,Y}$, $A$ and $B$ are $m\times n$ nonrandom matrices, and $Z$ and $C$ (column) $m$-vectors, then it is indeed true that \begin{align} E[Z] &= AE[X] + BE[Y] + C &\quad \scriptstyle{\text{compare with } (1)}\\ \Sigma_{Z,Z} &= A\Sigma_{X,X}A^T + B\Sigma_{Y,Y}B^T +2A\Sigma_{X,Y}B^T &\quad \scriptstyle{\text{compare with } (2)}\\ \end{align} but, as in the univariate case, it is not necessarily true that $Z$ is a normal vector (in the sense that the $m$ components $Z_i$ are jointly normal random variables). Once again, joint normality of $(X_1, X_2, \ldots, X_n, Y_1, Y_2, \ldots, Y_n)$ suffices to allow the assertion that $Z$ is a normal random vector.
Best Answer
The initial approach can be used as well:
$$\rho(x, y) = \frac{\text{Cov}(\mu_{y|x}, y)}{\sqrt{V(\mu_{y|x}) V(y)}}.$$
Then, $$V(y) = \sigma_{yy},$$
\begin{align*} V(\mu_{y|x}) &= V(\mu_y + \sigma_{yx}\Sigma_{xx}^{-1}(x-\mu_x)) \\ &= V(\sigma_{yx}\Sigma_{xx}^{-1}x) && \small{ | \mu_y \text{ and } \sigma_{yx}\Sigma_{xx}^{-1}\mu_x \text{ are constants} }\\ &= \sigma_{yx}\Sigma_{xx}^{-1}V(x)(\sigma_{yx}\Sigma_{xx}^{-1})' && \small{ | \sigma_{xy} = \sigma_{yx}' \text{ and } \Sigma_{xx}^{-1} = (\Sigma_{xx}^{-1})' } \\ &= \sigma_{yx}\Sigma_{xx}^{-1}\Sigma_{xx}\Sigma_{xx}^{-1}\sigma_{xy} \\ &= \sigma_{yx}\Sigma_{xx}^{-1}\sigma_{xy} \end{align*} and
\begin{align*} \text{Cov}(\mu_{y|x}, y) &= \text{Cov}(\mu_{y} + \sigma_{yx}\Sigma_{xx}^{-1}(x-\mu_x), y) \\ &= \text{Cov}(\sigma_{yx}\Sigma_{xx}^{-1}x, y) \\ &= \sigma_{yx}\Sigma_{xx}^{-1} \text{Cov}(x, y) \\ &= \sigma_{yx}\Sigma_{xx}^{-1} \sigma_{xy} \end{align*} Here, we used $\text{Cov}(ax + b, cy) = a\text{Cov}(x, y)c'$ and $V(x) = \text{Cov}(x,x)$. We can plug these results back into the formula and simplify:
\begin{align} \rho(x, y) &= \frac{\text{Cov}(\mu_{y|x}, y)}{\sqrt{V(\mu_{y|x}) V(y)}} \\ &= \frac{\sigma_{yx}\Sigma_{xx}^{-1} \sigma_{xy}}{\sqrt{\sigma_{yx}\Sigma_{xx}^{-1}\sigma_{xy} \sigma_{yy}}} \\ &= \sqrt{\frac{\sigma_{yx}\Sigma_{xx}^{-1} \sigma_{xy}}{\sigma_{yy}}} \end{align}