Solved – Reducing variables and correlated variables in a dataset

clusteringmodel selectionpca

I am attempting to produce models on a dataset with approximately 60 variables and 800 subjects. Other datasets are also being considered for use.
The datasets were not originally constructed for this type of analysis.
The dataset also comprises a mixture of variables. Some are continuous and some discrete. Also some of these are correlated with one another as can be seen on a correlation matrix and also just by eye/common sense.

Currently clustering variables has been used with the aim to remove any correlated components being used to generate models to avoid over-fitting and reduce the number of variables. By choosing the most representative variable from each cluster however I am unclear if this is being done correctly and not confident this is correct and will result in incorrect models.

Previously what has been done is to just load all the variables into the platform and produce a cluster summary. The most representative member variable is picked out of each cluster and models made. In many clusters there is a variable present that shows a low R squared with the cluster it is in, yet it has been included. The variable also does not show correlation to the other group members on reference to the correlation matrix. It's almost as if it has been forced in where it will fit best rather than left on its own. (Sometimes a cluster does include 1 variable.) Also, sometimes two separate clusters appear to contain variables which are correlated.

Thus I feel at times correlated variables are being used for models and at the same time uncorrelated variables are being missed.

I am very new to this level of stats and I am struggling with wrapping my head around eigenvalues etc. It was implied that this is an aspect of PCA but I have found that this is not really the case.

Information on the web tends to get into some quite complex math on the subject.

Here is my understanding, it may well be off.

I understand that PCA is a transformation of the dataset so that the first component explains the most variation in the variables, the second PC is orthogonal to this (uncorrelated) and that there is one PC for each variable. Each PCA is given a total eigenvalue that = number of variables (due to the transformation of the original axis where the variance of the data is made to =1 and mean =0). The PC that explains most of the variation in the data takes the greatest share of this followed by the second etc. This is due to variable weighting coefficients where each variable is given a weighting in respect to its influence on that PC. Eigenvectors are where I fall apart!!

Clustering involves looking at all the variables and performing iterative splits. A split involves a PCA analysis on the cluster members? then.. from a guide..
'the cluster with the largest second eigenvalue is chosen to be split into two new clusters'
Again I fall apart due to not really understanding eigenvalues.

I think the tools I have been using have been suggested to perform a dump and click solution which I can't really believe..but I cannot argue the case due to limited understanding.
Long post, but I'm getting a bit lost!

Best Answer

This should probably go as a comment but I cannot add it there because of low reputation. However, I think it can also partly serve as an answer to the OP's question.

As @NickCox alluded to, you should tackle this piecemeal. Read through the theory, work with a small number of variables and try to develop an intuitive sense of what is happening, although to develop this intuitive understanding will take some time, it will be worth it when you are trying to interpret the results of your principal component analysis.

If you look at the link @NickCox suggested and scroll down to whuber's answer you will find a good geometric explanation for what eigenvectors and eigenvalues mean and will give you a visual feel of what you are looking at. I will add to that answer - if you collapse his ellipsoid into a two dimensional space i.e look at only the first two principal components, you will get an ellipse with one major and one minor axis. The major axis is the "direction" which explains most of the variance of your data. It is represented by the first eigenvector and its length is the first eigenvalue (when you arrange the eigenvalues in descending order). It is also the linear combination of all the variables in your dataset, more specifically it is that linear combination which most explains the variance in your data.

Now, going back to your data you should look into textbooks on multivariate methods for methods of analyzing the covariance structure of your dataset. I am afraid I cannot point to anything specific but I think any intermediate level book would do. Additionally, I remember reading about applied principal component analysis from the book Statistical Methods in the Atmospheric Sciences by Daniel Wilks. If you get a hold of this book, you will find that it gives a good overview of the methodology in Chapter 11 (atmospheric scientists call PCA as Empirical Orthogonal Functions and it is the same thing). The author works with multi-variate data so it is structurally similar to what you have.

Lastly, in addition to clustering and PCA you can also look into Canonical Correlation Analysis (CCA). It is also used in reducing the dimensionality of one's dataset but it looks at the relationship between pairs of data. I cannot say whether it would be applicable in your case.

Related Question