Solved – principal component analysis with missing data

clusteringmultivariate analysispca

for a prospective study of parameters affecting student's success in graduate school I am looking at a population of about 1500 med students. I have performed a cluster analysis (using Gower's universal similarity index and average linking) showing that students fall into 3 main groups (at ~ 0.2), each with several subgroups (at ~ 0.8). Clustering correlates poorly with study success and with any of the other parameters.

Normally, I'd perform a PCA in the hope of uncovering hidden variables that determine clustering. However, about 1/3 of the students fail the course at various time-points, so that there are a lot of missing data. To make matters worse, student performance at various exams (e.g., NBME Step1 and Step2) is only weakly correlated (r^2 ~ 0.3), so "filling" the table with calculated values would be questionable. Unfortunately, standard PCA reacts "poorly" to missing data.

I should add that the parameters are a mix of nominal, binary, ordinal and rational variables.

Any help would be much appreciated.

Engelbert

Best Answer

In a similar situation but in different field we used correlation matrix shrinkage by Ledoit Wolf. The idea's to calculate the pair-wise covariance matrix using all available data. If instead you drop observations where one student's data is missing, there's nothing left of the dataset. So, we use the intersection of data for each pair of items, not for the entire set. Since s correlation matrix can end up being non PSD in this case, we apply the mentioned shrinkage method to it before plugging into PCA

Related Solutions

Solved – Nonlinear principal component analysis MATLAB code

See Matthias Scholz's site for the MATLAB code (the homepage includes references).

Solved – Principal component analysis with group data

In general, I wouldn't see a problem why you couldn't do a PCA to visualize and interpret your multivariate dataset (however since you didn't provide data, I cannot say for sure). As for your second question, I would keep the two groups (drought, control) and not subtract them from each other. That way you will be able to see if the component scores (illustrated as points in the plot) will cluster and how the component loadings (illustrated as vectors in the plot) relate to them.

Here an example to illustrate what I mean (also your third question):

Generate an example dataset (based on your description):

set.seed(13)
d <- data.frame(
  treatment = rep(c("drought", "control"), each = 15),
  y1 = rnorm(30, 3, 1),
  y2 = rnorm(30, 6, 3),
  y3 = rnorm(30, 4, 2),
  y4 = rnorm(30, 9, 4),
  y5 = rnorm(30, 5, 2),
  y6 = rnorm(30, 12, 5)
)

The following steps can be achieved in a lot of different ways (also perhaps better and more efficient) with different R packages. But here is what usually works for me:

PCA using FactoMineR:

require(FactoMineR)
my.pca = PCA(d[, c(2:7)], scale.unit = T, graph = F)

# EXTRACTING VALUES FROM my.pca FOR PLOT BELOW
PC1.ind <- my.pca$ind$coord[,1]
PC2.ind <- my.pca$ind$coord[,2]
PC1.var <- my.pca$var$coord[,1]
PC2.var <- my.pca$var$coord[,2]
PC1.expl <- round(my.pca$eig[1,2],2)
PC2.expl <- round(my.pca$eig[2,2],2)
Treatment <- factor(d$treatment,levels=c('drought', 'control'))
labs.var<- rownames(my.pca$var$coord)

Build the plot with the ggplot2 package:

require(ggplot2)
require(grid)
ggplot() +
  geom_point(aes(x = PC1.ind, y = PC2.ind, fill = Treatment), colour='black', pch = 21,size = 2.2) + 
  scale_fill_manual(values = c("red", "blue")) +   
  coord_fixed(ratio = 1) +
  geom_segment(aes(x = 0, y = 0, xend = PC1.var*2.8, yend = PC2.var*2.8), arrow = arrow(length = unit(1/2, 'picas')), color = "grey30") + 
  geom_text(aes(x = PC1.var*3.2, y = PC2.var*3.2),label = labs.var, size = 3) + 
  xlab(paste('PC1 (',PC1.expl,'%',')', sep ='')) + 
  ylab(paste('PC2 (',PC2.expl,'%',')', sep ='')) +
  ggtitle("Some title")

The vectors represent components loadings, which are the correlations of the principal components with the original variables. The strength of the correlation is indicated by the vector length, and the direction indicates which accessions have high values for the original variables.

Also I would suggest having a look here for more information on how to interpret PCAs in general (if that's needed at all).

Also since you have predetermined groups, i.e. drought and control you might also have a look at linear discriminant analysis (LDA). Both, PCA and LDA, are rotation-based techniques. While PCA tries to maximize total variance explained in the dataset, LDA maximizes the separation (or discriminates) between groups. For more information you could have a look at the candisc function in the candisc package, or the lda() function in the MASS package for example (both in R).