Solved – How to handle data imbalance in Principal Component Analysis

pcaunbalanced-classes

PCA reduces data set dimensions while trying to keep most variations in data set.

PCA can be used as a dimension reducing technique in discrimination, however it tries to keep the most discrimination power rather than keep most variations as more variation does not necessarily mean the separation of different group of data is maximum. So Chang (1983) in his paper "On using principal components before separating a mixture of two multivariate normal distributions" suggest using Mahalanobis distance in some way to measure it. His paper can be found here: http://www.jstor.org/discover/10.2307/2347949?uid=3738032&uid=2129&uid=2&uid=70&uid=4&sid=21100901419681

Does the method work well when the data size are very different (consider two groups only), Say sample sizes have ratio 1:99? If it does not work well, how to handle the problem?

I have read literature about rebalancing data set for PCA, but there are no theoretical and mathematical explanation for how rebalance data can help increase the performance of PCA. Is there any literature for the mathematical explanation part?

Best Answer

I addressed your confusion about PCA in your other question. The correct version of your statement is that PCA attempts to characterize most of the variation in the in a lower dimensional space. As I explain in the answer to the other post PCA is not used for discrimination. It has been used to do clustering after reducing dimension to a subset of the principal components.

If by imbalance of the data you mean one cluster center has many more points arund it than another then that may affect the ability of any clustering method at identifying the smaller cluster but the magnitude of separation of the clusters is probably more important.