Correlation – Finding Correlations Between 2 Multivariate Data Sets

correlationdiscriminant analysisfactor analysismultivariate analysis

I have two data sets, and both are MultiVariate datasets

The first dataset has a format as below, with the first column being the country of origin (only two countries, so binary classifier) of a group of test subjects.

Column 2 is an ID to a audio file, and the other variables (V1:V30) are average responses of test subjects opinions on emotions heard in the audio file (V1=angry, V2=sad,…). Because these are averages, the sum V1:V30 =1:

country fileID V1 V2 V3 V30
0 0001.mp3 0.1 0.5 0.0 0.01
0 0002.mp3 0.3 0.6 0.0 0.00
0 2519.mp3 0.3 0.6 0.0 0.00
1 0001.mp3 0.9 0.00 0.0 0.01
1 0002.mp3 0.1 0.7 0.0 0.00
1 2519.mp3 0.3 0.6 0.0 0.00

The second data set has exactly the same first two columns as the first data set, but different variables that are on a different scale (ratings from 1-9)

country fileID V31 V322 V33 V53
0 0001.mp3 5.6 4.7 3.3 7.8
0 0002.mp3 4.3 3.5 6.2 4.2
0 2519.mp3 3.5 5.2 4.4 6.8
1 0001.mp3 4.5 7.2 6.7 4.3
1 0002.mp3 5.8 4.1 3.8 8.2
1 2519.mp3 6.6 4.4 3.3 2.2

The analysis that I am supposed to achieve is to find if there is a way to use 2nd data set (with variables on scale 1-9) to predict the emotion from the first dataset via some kind of correlation between the variables in the first and second datasets. This is for an introductory multivariate course, and the professor has only introduced factor analysis, PCA, and Linear Discriminant Analysis as methods of analysis. No logistic regression yet

I am having a lot of trouble using the methods that have been introduced. Shapiro-Wilk tests for normality do not run in R, because the data set is too large. Corrleation matrices find nothing above .02, and factor analysis only finds p-values of 0

I would really appreciate some guidance on how to approach and conduct this analysis

Best Answer

As the first comment suggests, you can answer whether the two sets of covariates (V1-V30) and (V31-V53) are correlated using canonical correlations. Remember to remove one covariate from (V1-V30) since otherwise it will not have full rank: the sum of the 30 covariates is 1! Perhaps it would be sufficient to perform the general test for any presence of canonical correlation given in the wiki link, report your test statistic and p-value to show that the sets of variables are correlated.

Next I recommend performing a PCA on the variables (V31-V53), remembering to center and scale them. Pick a suitable number of principal components to explain variation in the original set of covariates, and create scatter plots between each pair of these principal components while labeling country location to visually assess if the data may be linear separated by any of these principal components. If so, one can use linear discriminant analysis with the principal components acting as the covariates. It's also worthwhile to note that factor analysis or the normal linear factor model is very closely related to principal component analysis: the number of latent variables included in the factor model will be similar to the number of principal components recommended by diminishing returns in their explained variation.