Solved – principal component analysis with missing data

clusteringmultivariate analysispca

for a prospective study of parameters affecting student's success in graduate school I am looking at a population of about 1500 med students. I have performed a cluster analysis (using Gower's universal similarity index and average linking) showing that students fall into 3 main groups (at ~ 0.2), each with several subgroups (at ~ 0.8). Clustering correlates poorly with study success and with any of the other parameters.

Normally, I'd perform a PCA in the hope of uncovering hidden variables that determine clustering. However, about 1/3 of the students fail the course at various time-points, so that there are a lot of missing data. To make matters worse, student performance at various exams (e.g., NBME Step1 and Step2) is only weakly correlated (r^2 ~ 0.3), so "filling" the table with calculated values would be questionable. Unfortunately, standard PCA reacts "poorly" to missing data.

I should add that the parameters are a mix of nominal, binary, ordinal and rational variables.

Any help would be much appreciated.

Engelbert

Best Answer

In a similar situation but in different field we used correlation matrix shrinkage by Ledoit Wolf. The idea's to calculate the pair-wise covariance matrix using all available data. If instead you drop observations where one student's data is missing, there's nothing left of the dataset. So, we use the intersection of data for each pair of items, not for the entire set. Since s correlation matrix can end up being non PSD in this case, we apply the mentioned shrinkage method to it before plugging into PCA