Solved – Influence of Normalization and PCA on Distance Metrics

After standardizing my dataset, I perform a principal component analysis. Then I do a nearest neighbour search.

I observed then after performing the PCA, even though I kept (for testing purposes) all resulting dimensions (so the input dataset has the same number of features as the PCA-transformed one), my nearest neighbour results drop.

As a distance metric in the nearest neighbour search I use the cosine distance. My intuitive assumption was, that this distant metric should not be influenced by the PCA in the case I don't reduce the dimensionality of my dataset.

How can the performance drop be explained? Did I probably do something wrong?

Here is how I perform the PCA. I use Eigen and C++. traindata is my untouched input dataset where each row is a sample and each column a feature:

  // Mean centering data.
  featureMeans = traindata.colwise().mean();
  Eigen::MatrixXf centered = traindata.rowwise() - featureMeans;
  // Compute featurewise standard deviations.
  Eigen::MatrixXf cov = centered.adjoint() * centered;
  Eigen::VectorXf variances = cov.diagonal();
  stdDevs = variances.cwiseSqrt();
  // Normalize each feature with standard deviation.
  for (size_t i = 0; i < centered.rows(); i++) {
    centered.row(i) = centered.row(i).array() / stdDevs.array().transpose();
  }

  // Compute SVD.
  Eigen::JacobiSVD<Eigen::MatrixXf> svd(centered, Eigen::ComputeThinV);
  // Transformation matrix.
  pcaTransform = svd.matrixV();
  // Project the dataset.
  traindata = centered * pcaTransform;

And later when I get a new datapoint, I transform it in the space of the principal components like this:

Eigen::VectorXf NearestNeighbour::transform(Eigen::VectorXf& vec) {
  vec -= featureMeans;
  vec = vec.array() / stdDevs.array();
  vec = vec.transpose() * pcaTransform;
  return vec;
}

Then in the nearest neighbour algorithm I compute the cosine similarity like this:

float NearestNeighbour::calcCosineSimilarity(Eigen::VectorXf& vecOne, Eigen::VectorXf& vecTwo) {
  float dot = vecOne.dot(vecTwo);
  float cosine = dot / (vecOne.norm() * vecTwo.norm());
  return cosine;
}

I wrote a test script in python using my dataset. The input vector is randomly created as I only wanted to test the concept:

inpvec = [[ 0.73977109,  0.03620438,  0.25417753,  0.11561778,  0.82897718,
         0.13585422,  0.16245644,  0.97201561,  0.68201026,  0.52702283,
         0.21245685,  0.246901  ,  0.72042289,  0.52233973,  0.89980493,
         0.3394559 ,  0.92817351,  0.64084039,  0.73594745,  0.825488  ,
         0.87527608,  0.02777485,  0.30630228,  0.26867405,  0.10130528,
         0.43129711,  0.41076364,  0.22625131,  0.2616146 ,  0.52088176,
         0.23174206,  0.1674724 ,  0.81184377,  0.68945395,  0.12719359,
         0.97440578,  0.18162815,  0.29054626,  0.22535362,  0.44556911,
         0.15830425,  0.15641608,  0.00425475,  0.77260893,  0.84462181,
         0.98222192,  0.12986739,  0.16809029,  0.58549871,  0.38430837,
         0.39776035,  0.76900314]]


def loadTrainingData(path):
  f = open(path)
  tData = []
  for line in f:
    lsplit = line.split(",")
    datapoint = []
    for feature in lsplit:
      datapoint.append(float(feature))

    tData.append(datapoint)
  data = np.array(tData) 
  X = np.delete(data, data.shape[1] - 1, 1) # Strip class.
  return X


inpvec = np.array(inpvec)
X = loadTrainingData("trainingfile.csv")


normalizer = Normalizer()
X_norm = normalizer.fit_transform(X)
cosdists = []
for datapoint in X_norm:
  cosdists.append(cosine(normalizer.transform(inpvec)[0], datapoint))
value = min(cosdists)
index = [i for i, j in enumerate(cosdists) if j == value]
print "min dist at " + str(index) + " with " + str(value)


# PCA
pipeline = Pipeline([('scaling', StandardScaler()), ('pca', PCA(n_components=X.shape[1]))])
X_reduced = pipeline.fit_transform(X)

inpVecReduced = pipeline.transform(inpvec)[0]
cosdistsTwo = []
for datapointPCA in X_reduced:
  cosdistsTwo.append(cosine(inpVecReduced, datapointPCA))
valuePCA = min(cosdistsTwo)
indexPCA = [i for i, j in enumerate(cosdistsTwo) if j == valuePCA]
print "min dist at " + str(indexPCA) + " with " + str(valuePCA)

The output of this is:

min dist at [462] with 0.322760977886
min dist at [304] with 0.461332258519

My assumption was that it should output the same index in both cases. Why is this happening?

Best Answer

PCA is a linear transformation, and if you are keeping every dimension, then your data should have the same distance function. Supposing that your original data was some matrix $X$ and your resulting data is some transformed matrix $Y = AX$, then the distance should be unchanged:

$$ d(y_i, y_j) = d(A x_i, A x_j) $$

and if you are using the cosine distance,

$$ d(A x_i, A x_j) = 1 - ((A x_i)^T (A x_j)) / (\|A x_i\| \| A x_j \|) $$

and since the $\| A x_i \| = \| x_i \|$,

$$ = 1 - (x_i^T A^T A x_j) / (\| x_i \| \| x_j \|) $$

and since $A^T A = I$,

$$ = 1 - (x_i^T x_j) / (\| x_i \| \| x_j \|) = d(x_i, x_j) $$

so all this was a very roundabout way to say that distance calculations should not be affected.

Edit, once code appeared: in the calculation of centered, you are normalizing the features for unit variance. This might be a good idea in general, but it's going to change the distances between points since essentially you are weighting some dimensions to be more or less important than others. In that case, your resulting data is some transformed matrix $Y = ANX$ where $A$ is some orthogonal basis (so $A^T A = I$ as before), but $N$ is just a diagonal matrix that is not necessarily $I$. In that case you can't show that $d(y_i, y_j) = d(x_i, x_j)$. However, you can show that $d(y_i, y_j) = d(N x_i, N x_j)$. That means that if you were to transform your input data to have unit variance in each dimension, then the distance would be the same as the post-PCA data.

Best Answer

Related Solutions

Related Question