Solved – PCA: 91% of explained variance on one principal component

pcar

I am new to PCA and wanted to do a bit of experimentation on my data set just to see what it looked like (using R). I am not able to give access to the data here since it is confidential. However, if there is some other kind of statistic/visualization you would like to see that would help you answer my questions please let me know and I will provide it.

I found the following information about the explained variance:

Component Prop.Var
1         0.911804348
2         0.033618098
3         0.020827269
4         0.011772988
5         0.006611746
6         0.005372772
7         0.004464788
8         0.003436401
9         0.002091589

This raises the following questions:

  1. Am I justified in removing the other 8 principal components?
  2. How do I interpret 91% of explained variance on one component?
  3. If I only kept one component what would be the best way to visualize the data?

Below is how the graph of the first two principal components looks. The spread of the data like this is not surprising given how little of the variance is on the second component.

Principal Components 1 and 2

As I mentioned, I am new to PCA so I really do not know if there is even any useful information to be found from this kind of dimensional reduction. Any insight would be appreciated.

Best Answer

I am (very) new to this, but I'll do my best to help. The answers to your questions are

Am I justified in removing the other 8 principal components?

I do not think you are "justified". But if you want to make a first coarse assessment of the data you can concentrate on the first PC, just bear in mind that you neglect 9% of the total variability. This leads you to ask many other questions: were the variables expected to be so strongly correlated? Could you simulate or explain this 9% extra variability simply by invoking measurement errors?

How do I interpret 91% of explained variance on one component?

You interpret it with a very high degree of correlation between the many variables you included, or between at least two variables while the others show a much smaller dispersion. When you look at the PC components in terms of original measurements, how many significant components do you have?

If I only kept one component what would be the best way to visualize the data?

If you only kept one component your final description of the data would be 1D, so an axis would do the job. I repeat myself, and please do not take my words as patronizing, but I would try to understand if the PC you calculated makes sense given the data.

Related Question