Solved – Skewed variables in PCA or factor analysis

dimensionality reductionfactor analysispcaskewness

I want to do principal component analysis (factor analysis) on SPSS based on 22 variables. However, some of my variables are very skewed (skewness calculated from SPSS ranges from 2–80!).

So here are my questions:

Should I keep the skewed variables like that or could I transform the variables on principal component analysis? If yes, how would I interpret factor scores?
What type of transformation should I do? log10 or ln?
Originally, my KMO (Kaiser–Meyer–Olkin) is 0.413. Much literature recommends a minimum of 0.5. Can I still do factor analysis, or do I need to remove variables to raise my KMO to 0.5?

Best Answer

Skewness issue in PCA is the same as in regression: the longer tail, if it is really long relative to the whole range of the distribution, actually behaves like a big outlier—it pulls the fit line (principal component in your case) strongly toward itself because its influence is enhanced; its influence is enhanced because it is so far from the mean. In the context of PCA allowing very skewed variables is pretty similar to doing PCA without centering the data (i.e., doing PCA on the basis of cosine matrix rather than correlation matrix). It is you who decides whether to permit the long tail to influence results so greatly (and let the data be) or not (and transform the data). The issue is not connected with how you do interpretation of loadings.
As you like.
KMO is an index that tells you whether partial correlations are reasonably small to submit data to factor analysis. Because in factor analysis we generally expect a factor to load more than just two variables. Your KMO is low enough. You can make it better if you drop from the analysis variables with low individual KMO values (these form the diagonal of anti-image matrix, you can request to show this matrix in SPSS Factor procedure). Can tranforming variables into less skewed recover KMO? Who knows. Maybe. Note that KMO is important mostly in Factor analysis model, not Principal Components analysis model: in FA you fit pairwise correlations, whereas in PCA you don't.

Best Answer

Related Solutions

Solved – use PCA to do variable selection for cluster analysis

Principal Component Analysis – Different Results in SPSS and Stata After Rotation

Related Question