Solved – How to interpret results from Canonical Correlation Analysis (CCA)

canonical-correlation

I am learning CCA following an example posted here: https://stats.idre.ucla.edu/r/dae/canonical-correlation-analysis/

I have questions regarding on how to interpret the canonical coefficients, canonical loadings, and the figures from the statistical test. The original article was vague about these and does not answer my questions. So here I adapted the example as below. I understand that I have many questions here, but answers to any one question are equally highly appreciated!

Data: Two sets of variables exactly as the example shown on the page:

  • X as psychology traits (control, concept, motivation)
  • Y as academic achievements (read, write, math, science, sex)

Canonical Coefficients calculated by R:

$xcoef
              [,1]    [,2]    [,3]
Control    -1.2538 -0.6215 -0.6617
Concept     0.1000 -1.1877  0.8267
Motivation -1.2624  2.0273  2.0002

$ycoef
             [,1]      [,2]      [,3]
Read    -0.044621 -0.004910  0.021381
Write   -0.035877  0.042071  0.091307
Math    -0.623417  0.004229  0.009398
Science -0.000125 -0.085162 -0.109835

Canonical Loadings calculated by R:

$corr.X.xscores
               [,1]    [,2]    [,3]
Control    -0.90405 -0.3897 -0.1756
Concept    -0.02084 -0.7087  0.7052
Motivation -0.56715  0.3509  0.7451
$corr.Y.yscores
            [,1]     [,2]    [,3]
Read    -0.8404 -0.35883  0.1354
Write   -0.8765  0.06484  0.2546
Math    -0.7639 -0.29795  0.1478
Science -0.0564 -0.67680 -0.2304

Interpretation based on the cited article: Let's focus on the canonical variate pair one. From the article, I understand that, using the 'Science' variable as an example:

  • canonical coefficient: every 1 change of Science leads to -0.000125 change of the first canonical variate in set 2
  • canonical loading: Science appears to have almost negligible correlation with the first canonical variate in set 2, as its correlation with the variate is very low (-0.0564).

QUESTION 1: If these readings make sense, I wonder what do they actually 'mean'? Does it make sense to say that in this case, comparatively, 'Science' is NOT a useful variable for studying the relation between the Set 1 and Set 2 variables? The reason is: 1) compared to the other variables in Set 2 (Read,Write,Math,Sex), its contribution to the first canonical variate – as indicated by its canonical coefficient (-0.000125) – is orders of magnitude lower; and 2) changes in its value almost do not lead to noticeable change in the first canonical variate – as indicated by its canonical correlation with the first variate.

QUESTION 2: Now ignoring 'Science', how do we interpret the relation between the remaining Set 2 variables and Set 1 variables? Does it make sense to say that 'Math' appears to be the main affecting variable among Set 2 because it has the highest coefficient among all variables, and generally Set 2 variables seem to relate to one's ability in Control and Motivation? And it seems the relation is positive, i.e., the higher we score on math, the better we are at Control and Motivation?

The article the goes on to measure the statistical significance of each canonical variate pair, and calculates scores as below:

      WilksL      F df1  df2         p
 [1,] 0.7544 11.716  15 1635 7.498e-28
 [2,] 0.9614  2.944   8 1186 2.905e-03
 [3,] 0.9892  2.165   3  594 9.109e-02

with an explanation 'the first test of the canonical dimensions tests whether all three dimensions are significant (they are, F = 11.72), the next test tests whether dimensions 2 and 3 combined are significant (they are, F = 2.94). Finally, the last test tests whether dimension 3, by itself, is significant (it is not)'. This is where I am totally lost.

First, why is the first test for all three dimensions (i.e., canonical variate pairs), not the first dimension? And the second test for dimensions 2 and 3 combined, not just the second dimension? What does 'combined' mean and why do we measure them combined, not each individual dimension?

Second, what an value of F means it is significant? The conclusion based on F seems to make no sense, because the first F=11.7, second F=2.9, both are 'significant' while the difference between them is almost 9; the third F=2.165 which is only 0.735 less than the second F, yet the author says it is insignificant. How do we interpret F? How much is 'large'? In my opinion, the p-value makes more sense here because the last p-value>0.05.

Third, linked to the second question, how should we interpret the significance test overall, with all these three metrics (Wilks, F, p-value)? Do they all have to indicate 'significance' or does only one suffice?

Best Answer

I think you have the loading and the coefficients mixed. The coefficient is the correlation of a variable with the dimension. However the loadings affect their own canonical vector. The reason is in the source code (R):

    function (X, Y, res) 
    {
        X.aux = scale(X, center = TRUE, scale = FALSE)
        Y.aux = scale(Y, center = TRUE, scale = FALSE)
        X.aux[is.na(X.aux)] = 0
        Y.aux[is.na(Y.aux)] = 0
        xscores = X.aux %*% res$xcoef
        yscores = Y.aux %*% res$ycoef
        corr.X.xscores = cor(X, xscores, use = "pairwise")
        corr.Y.xscores = cor(Y, xscores, use = "pairwise")
        corr.X.yscores = cor(X, yscores, use = "pairwise")
        corr.Y.yscores = cor(Y, yscores, use = "pairwise")
        return(list(xscores = xscores, yscores = yscores, 
            corr.X.xscores = corr.X.xscores, 
            corr.Y.xscores = corr.Y.xscores, corr.X.yscores = 
              corr.X.yscores, 
            corr.Y.yscores = corr.Y.yscores))
    }

So you can see that the corr.X.xscores is the correlation of the dimension xscores with the original variables (X) (this is from the compute function, called by rcc that it is called by the cc of the CCA package).