Solved – Canonical Correlation analysis without raw data (algebra of CCA)

algorithmscanonical-correlationr

I want to run a Canonical Correlation (in R) but I don't have the original (raw) data. I have only the correlation matrix of all the variables.

I have seen some questions here about this, but my question continue unsolved. A user gave a parcial solution (http://www.stat.wmich.edu/wang/561/egs/Rcancor.html), but I need the canonical loadings, the percentual of variance in set Y that was explained from set X, and the variables significance.

Does anyone here could help me?

P.S.: I am a new R user. I have experience only on Eviews, GRETL and SPSS (also a little bit in Stata).

Best Answer

If you don't have the original casewise data but know the correlations (and hopefully the variances and the sample size) you may simply generate random data having those correlations and analyze that dataset as usual by the canonical correlations program that take in raw data. This way, every output will be correct except the computation of canonical variates' values - for this would need the true data you don't have.

But anyway, if you want to program canonical correlation analysis (CCA) youself, here is a step-by-step algorithm for you. You may use any language having basic linear algebra matrix functions.


Let $\bf R_1$ be correlations (or covariances) in Set1 of $p_1$ variables. $\bf R_2$ be correlations (or covariances) in Set2 of $p_2$ variables. $\bf R_{12}$ be $p_1 \times p_2$ correlations (or covariances) between the sets.

Make $\bf S_1$ the diagonal matrix containing standard deviations in Set1; likewise $\bf S_2$ the diagonal matrix with standard deviations in Set2. If you don't know the variances (such as when you know only the correlations) assume that they all = 1. Then, unstandardized canonical coefficients will be equal to the standardized ones.

Doing analysis on covariance matrices is equivalent to analyzing centered variables, while doing analysis on correlation matrices is equivalent to analyzing z-standardized variables.


Find $\bf H_1$, the Cholesky root of $\bf R_1$: an upper-triangular matrix whereby $\bf{H_1'H_1=R_1}$. (Please note that in the Wikipedia they show it transposed, as "L", lower-triangular.) Likewise, find $\bf H_2$, the Cholesky root of $\bf R_2$.


Compute $\bf W$:

$\bf = {H_1'}^{-1} R_{12} {H_2}^{-1}$, if $p_1 \le p_2$; or

$\bf = {H_2'}^{-1} R_{12}' {H_1}^{-1}$, if $p_1 \gt p_2$.

Do singular-value decomposition of $\bf W$, whereby $\bf W=UDV'$.

Canonical correlations $\gamma_1, \gamma_2,...,\gamma_m$ where $m=\min(p_1,p_2)$ stand on the diagonal of $\bf D$. How to test them for significance - see here.


Compute standardized canonical coefficients $\bf K_1$ (for Set1) and $\bf K_2$ (for Set2):

$\bf K_1 = H_1^{-1}U$ and $\bf K_2 = H_2^{-1}V$ (first $p_1$ columns of $\bf K_2$), if $p_1 \le p_2$; or

$\bf K_1 = H_1^{-1}V$ (first $p_2$ columns of $\bf K_1$) and $\bf K_2 = H_2^{-1}U$, if $p_1 \gt p_2$.

Standardized coefficients correspond to the decompositions of the $\bf R$-matrices as when they were correlation matrices, even if actually the matrices were covariance. Hence "standardized" label.


Compute unstandardized canonical coefficients $\bf C_1$ (for Set1) and $\bf C_2$ (for Set2):

$\bf C_1 = S_1^{-1}K_1$ and $\bf C_2 = S_2^{-1}K_2$.

When the three input $\bf R$-matrices are correlations, not covariances, and the two $\bf S$ diagonals are comprised of ones - which corresponds to the analysis of z-standardized variables - then standardized and unstandardized coefficients are same. Some CCA programs just don't display unstandardized coefficients at all - mostly the programs which base the CCA analysis only on correlations; these programs may omit label "standardized" when they output the (standardized) coefficients.


Compute canonical loadings $\bf A_1$ (for Set1) and $\bf A_2$ (for Set2):

$\bf A_1 = S_1^{-1}(S_1R_1S_1)C_1$ and $\bf A_2 = S_2^{-1}(S_2R_2S_2)C_2$ .

Mean squares in columns of $\bf A_1$ are the proportion-of-variance in Set1 explained by its own canonical variates. Likewise, analogously in $\bf A_2$.


Compute canonical cross-loadings $\bf A_{12}$ (for Set1) and $\bf A_{21}$ (for Set2):

$\bf A_{12} = S_1^{-1}(S_1R_{12}S_2)C_2$ and $\bf A_{21} = S_2^{-1}(S_1R_{12}S_2)'C_1$ .

Mean squares in columns of $\bf A_{12}$ are the proportion-of-variance in Set1 explained by the opposite set's canonical variates. Likewise, analogously in $\bf A_{21}$.


Compute canonical variates scores (if you have casewise data at hand):

Variates extracted from Set1 $\bf Z_1=X_1K_1$ and variates extracted from Set2 $\bf Z_2=X_2K_2$, where $\bf X_1$ and $\bf X_2$ are the (centered) variables of Set1 and Set2.

The variates are produced standardized (mean = 0, st. dev. = 1). Pearson correlation between variates $Z_{1(j)}$ and $Z_{2(j)}$ is the canonical correlation $\gamma_j$. For visual explanation of the idea of canonical correlations please look in here.

Related Question