Solved – linearly combine the results of PCA with the variance of each feature

dimensionality reductionpca

I'm trying to reduce the dimension of a dataset from 8 features to 1 using the principal component analysis (PCA) algorithm. The reduced dataset needs to be in 1 dimension(D) so I can use it for matrix factorization and other algorithms.

However, after applying PCA, the reduced 1D dataset only keeps 45% of the variance so I'm loosing a lot of information. I tried to reduce the dimension keeping 95% of the variance but the resulting dataset has 3D.

I know the variance of each feature in the reduced dataset. Therefore, I was thinking of using the variances as weights and linearly combine them with the reduced values so I'd reduce the dataset further to 1D, for example:

value_1D = variance_f1*value_f1 + variance_f2*value_f2 + variance_f3*value_f3 

Would it we correct to do it? Do you know any other alternative?

Thanks in advance

Best Answer

PCA already finds the best lineair combination of features (the principal components) to explain the amount of variance. If you start combining the principal components, you will not explain more variance, since PCA already found the best lineair combinations. Essentially you are trying to perform PCA and then again PCA. You could try performing PCA twice, but you will find the exact same result.

What you could do is maybe the following. Let's say you have a dataset with 2 features, one feature is in m and the other in km. It is justified to change the units so all features share common units. This way you could inflate the amount of variance explained.

However, be careful. If you take this to the extreme case, you could say feature 1 is much more important than feature 2. You could arrive at the conclusion that feature 2 is useless, and multiply it by zero. Now feature 1 will explain all variance (100%). But this becomes meaningless, since you deleted feature 2. So don't start reweighting all your features to just increase the amount of variance explained ;) Note that you should do this before PCA.

Finally, you could ask yourself if it is nececary to keep a lot of variance for your goal. Maybe if your goal is classification, you could consider another method (find a lineair combination of features that tries to seperate the classes as much as possible for example). Otherwise it might be wise to take the 3D dataset and look for a method that can work with multidimensional datasets.

Related Question