Solved – Does using a covariance matrix of scaled and centered variables compare with using a correlation matrix

correlation matrixcovariance-matrixpca

I have some data with features which have different units of measurements. Here, by data, I'm trying to say that the row represents the observations and column the features. There are correlations between the features as well. Hence, principal component analysis would be the best to explain the variance in my data and also identify the variance contribution of each feature in the total variance explained by a principal component.

I read somewhere that principal components analysis (PCA) can be performed on the data by computing both covariance and correlation matrices. If I scale and center my data (Z-scores) and compute covariance and compare it with a correlation matrix, will the results of PCA be different? As both correlation matrix and scale/centered covariances represents standardization of the data, I assume the results (variance contributed by features in a principal component) should be same? Am I wrong in assuming this? I'm a novice and am trying to understand principal component analysis.

Best Answer

you will find a nice summary given by user @ttnphns here: https://stats.stackexchange.com/q/22520.

In particular:

  • If you center columns (variables) of $\mathbf{A}$, then $\mathbf{A′A}$ is the scatter (or co-scatter, if to be rigorous) matrix and $\mathbf{A′A}/(n−1)$ is the covariance matrix.
  • If you z-standardize columns of a matrix $\mathbf{A}$ (subtract the column mean and divide by the standard deviation), then $\mathbf{AA′}/(n−1)$ is the Pearson correlation matrix: correlation is covariance for standardized variables.

In general you should always center your data when performing PCA. As explained here, not centering your data can give misleading results.