Principal Component Analysis – Behavior of PCA with No Correlation in Dataset

machine learningpca

We all know that Principal Component Analysis is executed on a Covariance/Correlation matrix, but what if we have a very high dimensional data, assuming 75 features and 157849 rows?
How does PCA tackle this?

  • Does it tackle this problem in the same way as it does for
    correlated datasets?
  • Will my explained variance be equally
    distributed among the 75 features?
  • I came across BARTLETT'S Test and
    KMO Test
    which helps us:

    • in identifying the wether there is any
      correlation present or not, and
    • the proportion of variance that might
      be a common variance among the variables

respectively. I can certainly leverage these two tests in making a controlled decision, but I am still looking for an answer towards:

  • How does PCA behave when there is no correlation in the dataset?

I want to get an interpretation of this in a way that I could explain it to my non-technical brother.

Practical example using Python:

s = pd.Series(data=[1,1,1],index=['a','b','c'])
diag_data = np.diag(s) 
df = pd.DataFrame(diag_data, index=s.index, columns=s.index)
# Normalizing
df = (df.subtract(df.mean())).divide(df.std())

Which looks like:

        a            b          c
a   1.154701    -0.577350   -0.577350
b   -0.577350   1.154701    -0.577350
c   -0.577350   -0.577350   1.154701

Covariance Matrix looks like this:

Cor = np.corrcoef(df.T)
Cor

array([[ 1. , -0.5, -0.5],
       [-0.5,  1. , -0.5],
       [-0.5, -0.5,  1. ]])

Now, calculating PCA Projections:

eigen_vals,eigen_vects = np.linalg.eig(Cor)
projections = pd.DataFrame(np.dot(df,eigen_vects))

And projections are:

        0             1             2
0   1.414214    -2.012134e-17   -0.102484
1   -0.707107   -2.421659e-16   -1.170283
2   -0.707107   -1.989771e-16   1.272767

The explained Ratio seems to be equally distributed among two features:

[0.5000000000000001, -9.680089716721685e-17, 0.5000000000000001]

Now, when I tried calculating the Q-Residual error in order to find the reconstruction error, I got zero for all the features:

a    0.0
b    0.0
c    0.0
dtype: float64

This would indicate that PCA on a non-correlated dataset like identity matrix gives us the projections which are very close to the original data-points. And the same results are obtained with the DIAGONAL MATRIX.

If the reconstruction error is very low, this would suggest that, in a single pipeline, we can fix the PCA method to execute and even if the dataset is not carrying much correlation we will get the same results after PCA transformation, but for the dataset which has high correlated features, we can prevent our curse of dimensionality.

Public views on this?

Best Answer

If you have no observed correlation, then your covariance matrix is diagonal, and the PCA diagonalizes a matrix that is already diagonal (so it does nothing).

If you have no population correlation but observe small sample correlations due to sampling variability, then the PCA is diagonalizing a covariance matrix that is nearly diagonal, and the result will be a minimally different set of features from the PCA.