I want to conduct a principal component analysis (PCA) in SPSS. One assumption for PCA is that there are no significant outliers. How can I identify outliers in SPSS?
Solved – How to identify outliers and conduct robust PCA
outlierspcarobustspss
Related Solutions
You are correct. Stata is weird about this. Stata gives different results from SAS, R and SPSS, and it is difficult (in my opinion) to understand why without delving quite deep into the world of factor analysis and PCA.
Here's how you know that something weird is happening. The sum of the squared loadings for a component are equal to the eigenvalue for that component.
Pre-and post-rotation, the eigenvalues change, but the total eigenvalues don't change. Add up the sum of the squared loadings from your output (this is why I asked you to remove the blanks in my comment). With Stata's default, the sum of squared loadings will sum to 1.00 (within rounding error). With SPSS (and R, and SAS, and every other factor analysis program I've looked at) they will sum to the eigenvalue for that factor. (Post rotation eigenvalues change, but the sum of eigenvalues stays the same). The sum of squared loadings in SPSS is equal to the sum of the eigenvalues (i.e. 3.8723 + 1.40682), both pre- and post-rotation.
In Stata, the sum of the squared loadings for each factor is equal to 1.00, and so Stata has rescaled the loadings.
The only mention of this (that I have found) in the Stata documentation is in the estat loadings section of the help, where it says:
cnorm(unit | eigen | inveigen), an option used with estat loadings, selects the normalization of the eigenvectors, the columns of the principal-component loading matrix. The following normalizations are available
However, this appears to apply only to the unrotated component matrix, not the component rotated matrix. I can't get the unnormalized rotated matrix after PCA.
The people at Stata seem to know what they are doing, and usually have a good reason for doing things the way that they do. This one is beyond me though.
(For future reference, it would have made my life easier if you'd used a dataset that I could access, and if you'd included all output, without blanks).
Edit: My usual go-to site for information about how to get the same results for different programs is the UCLA IDRE. They don't cover PCA in Stata: http://www.ats.ucla.edu/stat/AnnotatedOutput/ I have to wonder if that's because they couldn't get the same result. :)
It looks to me as though the proposed method at its core uses robust estimates of location and covariance based on the MCD (Minimum Covariance Determinant) algorithm (the link is to the FastMCD variant.) This algorithm randomly samples the data hundreds of times, constructing covariance matrix estimates for the subsamples, then selects the one with the minimum determinant.
From your perspective, the important part is that "randomly samples" bit. This means that the estimated covariance matrix at the core of the pcaCoDa
algorithm is non-deterministic, so the output eigenvectors are too. Given how different the results are from run to run, I'd guess there's some parameter tuning in the calls to the FastMCD algorithm that aren't working well for this problem. Since it doesn't appear that you can alter the parameters passed to the FastMCD algorithm by altering any parameters passed to pcaCoDa
, you may have to mess with the code, or seek another approach altogether.
Best Answer
Robust PCA is a very active research area, and identifying and removing outliers in a sound way is quite delicate. (I've written two papers in this field, so I do know a bit about it.) While I don't know SPSS, you may be able to implement the relatively simple Algorithm (1) here.
This algorithm (not mine) has rigorous guarantees but requires only some basic computations and a "while" loop. Assuming you are searching for $d$ principal components, the basic procedure is
Everything in quotation marks is a heuristic; you can find the details in the paper.
The idea behind this procedure is that vectors whose projection after PCA is large may have effected the estimate too much, and so you may want to throw them away. It turns out that choosing the ones to throw away "at random" is actually a reasonable thing to do.
If anyone actually wants to take the time to write the SPSS code for this, I'm sure @cathy would appreciate it.