Solved – Canonical correlation analysis with continuous and binary data

binary datacanonical-correlation

I came across interesting article on application of canonical correlation analysis (CCA). Authors apply classical CCA on a mixed variables dataset (both independent and dependent sets include continuous and binary variables).

To sum up their main contribution: "usual" correlation matrix (i.e., based on Pearson's correlation) which is particularly designed for continuous measurements is not appropriate, so they proposed "new" measure to capture the correlations between different data-type variables. This part of the analysis seems clear to me.

I'm struggling to grasp the procedure to calculate canonical variate scores and canonical loading in the context of mixed data. In usual settings (i.e., when all variables are continuous) canonical variate scores are found by multiplying raw data with the canonical weights. Canonical loadings are subsequently found by correlating the raw variable scores with the variate scores.

I wonder to know, how to derive variate scores in these particular settings. It's proper to simply multiply canonical weights with the corresponding raw matrix to obtain variate scores?

Thanks in advance for any suggestions.

Best Answer

Usually categorical variables are transformed into dummy variables. In the case of binary variables it is even easier: 0 for one category and 1 for the other.

Using a correlation with these binary numerical variables is fine. See this other question and answers