Solved – Should I standardize or normalize variables before conducting a principal components analysis

machine learningpcaterminology

I am very confused as I am reading through PCA. Some sources say that I should normalize my data before applying PCA, and some sources say that I should standardize my data before applying PCA. I know that normalization will only change the scale of my values into a range of [0,1]. On the other hand, when standardizing, I am changing variables' means to 0 and standard deviations to 1.

Sources say that I should standardize my variables: https://onlinecourses.science.psu.edu/stat505/node/55

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html#the-effect-of-standardization-on-pca-in-a-pattern-classification-task

Sources say that I should normalize my variables:

https://datafai.com/2017/10/27/data-standardization-or-normalization/

Why do we need to normalize data before principal component analysis (PCA)?

Best Answer

The Wikipedia page on "Normalization" notes:

In statistics and applications of statistics, normalization can have a range of meanings.

It then goes on to list 6 examples of "normalizations in statistics," including both of what you have called "standardization" and "normalization."

"Normalization" onto [0,1] is called "feature scaling" or "unity-based normalization" on the Wikipedia page. "Normalization" based on the observed mean and standard deviation (called "Student's t-statistic" on that page; "standardization" in more frequent but not universal usage) is typically what you want for PCA.

This type of terminological confusion happens often in practice. Consider, for example, the frequent use of "multivariate" to represent multiple predictors in a model, when that word might best be reserved for situations with multiple types of outcomes.

So I wouldn't worry too much about the terminology that other people are using. Look into what they actually did, not what they called it. Then when you report your study, explain clearly what you did and try to use the best current terminology yourself.

Related Question