Correlation – Clustering Variables Based on Correlation Matrix

Questions:

I have a large correlation matrix. Instead of clustering individual correlations, I want to cluster variables based on their correlations to each other, ie if variable A and variable B have similar correlations to variables C to Z, then A and B should be part of the same cluster. A good real life example of this is different asset classes – intra asset-class correlations are higher than inter-asset class correlations.
I am also considering clustering variables in terms stregth relationship between them, eg when the correlation between variables A and B is close to 0, they act more or less independently. If suddenly some underlying conditions change and a strong correlation arises (positive or negative), we can think of these two variables as belonging to the same cluster. So instead of looking for positive correlation, one would look for relationship versus no relationship. I guess an analogy could be a cluster of positively and negatively charged particles. If the charge falls to 0, the particle drifts away from the cluster. However, both positive and negative charges attract particles to revelant clusters.

I apologise if some of this isn't very clear. Please let me know, I will clarify specific details.

factanal(na.omit(x), 5, rotation = "Promax") Uniquenesses: A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 0.848 0.630 0.642 0.829 0.442 0.566 0.635 0.572 0.504 0.603 0.541 0.457 0.541 0.420 0.549 0.272 N2 N3 N4 N5 O1 O2 O3 O4 O5 0.321 0.526 0.514 0.675 0.625 0.804 0.544 0.630 0.814 Loadings: Factor1 Factor2 Factor3 Factor4 Factor5 A1 0.242 -0.154 -0.253 -0.164 A2 0.570 A3 -0.100 0.522 0.114 A4 0.137 0.351 -0.158 A5 -0.145 0.691 C1 0.630 0.184 C2 0.131 0.120 0.603 C3 0.154 0.638 C4 0.167 -0.656 C5 0.149 -0.571 0.125 E1 0.618 0.125 -0.210 -0.120 E2 0.665 -0.204 E3 -0.404 0.332 0.289 E4 -0.506 0.555 -0.155 E5 0.175 -0.525 0.234 0.228 N1 0.879 -0.150 N2 0.875 -0.152 N3 0.658 N4 0.406 0.342 -0.148 0.196 N5 0.471 0.253 0.140 -0.101 O1 -0.108 0.595 O2 -0.145 0.421 0.125 0.199 O3 -0.204 0.605 O4 0.244 0.548 O5 0.139 0.177 -0.441 Factor1 Factor2 Factor3 Factor4 Factor5 SS loadings 2.610 2.138 2.075 1.899 1.570 Proportion Var 0.104 0.086 0.083 0.076 0.063 Cumulative Var 0.104 0.190 0.273 0.349 0.412 Test of the hypothesis that 5 factors are sufficient. The chi square statistic is 767.57 on 185 degrees of freedom. The p-value is 5.93e-72

Best Answer

Here's a simple example in R using the bfi dataset: bfi is a dataset of 25 personality test items organised around 5 factors.

library(psych)
data(bfi)
x <- bfi

A hiearchical cluster analysis using the euclidan distance between variables based on the absolute correlation between variables can be obtained like so:

plot(hclust(dist(abs(cor(na.omit(x))))))

alt text The dendrogram shows how items generally cluster with other items according to theorised groupings (e.g., N (Neuroticism) items group together). It also shows how some items within clusters are more similar (e.g., C5 and C1 might be more similar than C5 with C3). It also suggests that the N cluster is less similar to other clusters.

Alternatively you could do a standard factor analysis like so:

Best Answer

Related Solutions

Solved – Clustering data based on correlation

Related Question