Solved – Calculating CCA “scores” by hand in R

canonical-correlationeigenvaluesr

I'm trying to compute "by hand" the output of some popular Canonical Correlation Analysis functions in R, in order to be sure I understand the underlying math.

I can produce the "canonical coefficients" that can be used to transform the original variables to the "scores" that make up the canonical variates.

But I am struggling to actually compute the same scores that the popular R functions produce:

cancor() (base R)

cc() (lib cca)

cca() (lib yacca)

The ?help documentation is a little light for the functions above, in terms of not being specific about how scores are calculated. ?cca has the most detailed explanation, but it does not elaborate about how "standardized coefficients" are calculated as an intermediate step to computing the scores.

The raw canonical coefficients output for these functions are accessed via "..\$xcoef" and "..\$ycoef" (to get the canonical coeffs for the "x" and "y" variables, respectively).

The "xcoef" and "ycoef" output for the 3 functions is not identical for a given dataset, but the difference is just a constant in each column (so the output is the same besides a scaling difference).

I can produce raw coeff output by hand that is consistent with these R functions, by pulling out the eigenvectors of the appropriate Covariance Matrix.

My understanding is that taking the eigenvectors of the two matrices below is correct, and that this is one of several formulas out there that should work:

$$
C_x = C_{xx}^{-1}C_{xy}C_{yy}^{-1}C_{yx}
$$
$$
C_y = C_{yy}^{-1}C_{yx}C_{xx}^{-1}C_{xy}
$$

Where
$C_x$ is the canonical covariance matrix for the X variables, from which we pull the eigenvalues and eigenvectors for CCA.

$C_y$ is the canonical covariance matrix of the Y variables, from which we pull the eigenvalues and eigenvectors for CCA.

$C_{xx}$ is the covariance matrix of the X variables

$C_{yy}$ is the covariance matrix of the Y variables

$C_{xy}$ is the covariance matrix of the X and Y variables

$C_{yx}$ is the covariance matrix of the Y and X variables

Below I take the familiar Iris dataset to create a simple example to do by hand and compare to what the 3 R functions produce:

library(CCA)                                                                                                                                                           
library(yacca)                                                                                                                                                         

data(iris)                                                                                                                                                             
head(iris)                                                                                                                                                                                                                                                               

irisNum = iris[, 1:4] ## strip out species data                                                                                                                        
X = iris[, 1:2] ## Sepal variables                                                                                                                                     
Y = iris[, 3:4] ## Petal variables  

totalCov = cov(irisNum)                                                                                                                                                
Cxx = totalCov[1:2, 1:2]                                                                                                                                               
Cyy = totalCov[3:4, 3:4]                                                                                                                                               
Cxy = totalCov[1:2, 3:4]                                                                                                                                               
Cyx = totalCov[3:4, 1:2]                                                                                                                                               
CxxInv = solve(Cxx)                                                                                                                                                    
CyyInv = solve(Cyy)                                                                                                                                                    

Cx = CxxInv %*% Cxy %*% CyyInv %*% Cyx                                                                                                                                                                          
Cy = CyyInv %*% Cyx %*% CxxInv %*% Cxy      

vX = eigen(Cx)$vectors  ## my raw coefficient matrix for X variables                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
vY = eigen(Cy)$vectors  ## my raw coefficient matrix for Y variables

c1 = cancor(X, Y)  ## base R function                                                                                                                                                   
c2 = cc(X, Y)      ## lib CCA                                                                                                                                  
c3 = cca(X, Y)     ## lib yacca

## R functions all produce canonical coeffs that only differ by a scalar multiple:

c1$xcoef / c2$xcoef    
##                    [,1]       [,2]                                                                                                                                    
## Sepal.Length 0.08192319 0.08192319                                                                                                                                    
## Sepal.Width  0.08192319 0.08192319                                                                                                                                    

c2$xcoef / c3$xcoef                                                                                                                                            
##              [,1] [,2]                                                                                                                                                
## Sepal.Length   -1    1                                                                                                                                                
## Sepal.Width    -1    1                                                                                                                           

## and my raw coefficient matrix also only differs from them by a scalar multiple:
c2$xcoef/vX
##                   [,1]     [,2]                                                                                                                                       
## Sepal.Length -1.368843 2.223195                                                                                                                                       
## Sepal.Width  -1.368843 2.223195                                                                                                                                       

c2$ycoef/vY                                                                                                                                              
##                   [,1]    [,2]                                                                                                                                        
## Petal.Length -1.096528 5.17813                                                                                                                                        
## Petal.Width  -1.096528 5.17813 

So getting raw coefficients is fine. But now I don't know how to use them to reproduce the scores that come out of the cc() and cca() functions (note the cancor() function does not produce scores):

## X variable scores using cc()                                                                                                                                                                                                                                                        
head(c2$scores[[1]])                                                                                                                                                   
##           [,1]       [,2]                                                                                                                                             
## [1,] 1.1730856  0.5191447                                                                                                                                             
## [2,] 0.9593861 -0.6699407                                                                                                                                             
## [3,] 1.3441807 -0.3566336                                                                                                                                             
## [4,] 1.3655796 -0.6292350                                                                                                                                             
## [5,] 1.3654828  0.6757983                                                                                                                                             
## [6,] 1.1943878  1.5515766                                                                                                                                             

## X variable scores using cca()                                                                                                                                                           
head(c3$canvarx) ## differs from cc() output only in sign                                                                                                                                                                                                                                                                                      
##            CV 1       CV 2                                                                                                                                            
## [1,] -1.1730856  0.5191447                                                                                                                                            
## [2,] -0.9593861 -0.6699407                                                                                                                                            
## [3,] -1.3441807 -0.3566336                                                                                                                                            
## [4,] -1.3655796 -0.6292350                                                                                                                                            
## [5,] -1.3654828  0.6757983                                                                                                                                            
## [6,] -1.1943878  1.5515766    

I believe from looking at the R ?cca documentation that to get the scores that are coming out of that function's output I have to multiply my original matrix of variables by a matrix of canonical coefficients, but instead of using the raw coefficients I calculated above, I need to standardize them first…I am unsure how to do this properly. Also, I'm unsure if I need to standardize my original matrix of variables first before transforming them with the matrix of standardized coefficients.

Best Answer

You can always explore the code of the functions by just using yacca::cca or base::cancor, cca::cc.

To calculate the scores you need to apply the loadings to the original points (if they weren't scaled) by using a matrix multiplication. It hey were scales, use then you need to subtract from the original values the mean, this will be your centered values.