Machine Learning – Physical Meaning of Eigenvectors of the Gram/Kernel Matrix

kernel trickmachine learningpca

If we have some centered dataset $X$ then the eigenvectors of $X^TX$ represent the principal components of the dataset, and their physical meaning is the directions that data follow in the original feature space.

Now, in the same way as before, we can create the Kernel matrix $XX^T$ of the centered data and we can compute its eigenvectors and eigenvalues.

I understand that there might be some benefits in utilising this decomposition, like utilising a subset of the top components as discussed for example in this paper by J. Shawe-Taylor et al., however, I am not sure if I understand the physical meaning of this decomposition. What are the directions now, and what's the meaning of the eigenvalues?

Best Answer

The eigenvalues are actually the same as those of the covariance matrix. Let $X = U \Sigma V^T$ be the singular value decomposition; then $$X X^T = U \Sigma \underbrace{V^T V}_{I} \Sigma U^T = U \Sigma^2 U^T$$ and similarly $X^T X = V \Sigma^2 V^T$. Note that in the typical case where $X$ is $n \times p$ with $n \gg p$, most of the eigenvalues of the Gram matrix will be zero. If you're using say an RBF kernel, none will be zero (though some will probably be incredibly small).

The eigenvectors of the Gram matrix are thus seen to be the left singular values of $X$, $U$. One way to interpret these is:

  • The right singular vectors (columns of $V$, the eigenvectors of the covariance matrix) give the directions that data tends to lie on in the feature space.
  • The singular values (diagonal of $\Sigma$, square root of the eigenvalues of either matrix) give how important each component is to the dataset as a whole.
  • The left singular vectors (columns of $U$, the eigenvectors of the Gram matrix) give the representation of how much each data point is represented by each of the components, relative to how much they're used in the whole dataset. (Columns of $U \Sigma$ give the scores, the linear coefficient of each component when representing the data in the basis $V$.)

If you take only the first few columns of $U$ (and the corresponding block of $\Sigma$), you get the data projected as well as possible onto the most frequent components (PCA).

If a data point has high norm of its row of $U$, that means that it uses components much more than other components do, ie it has high leverage / "sticks out." If $p > n$, these will all be one, in which case you can either look only at the first $k$ values (leverage scores corresponding to the best rank-k approximation) or do some kind of soft-thresholding instead. Doing that with $k=1$, which is computationally easier, gives you PageRank.

See also this thread and the links therein for everything you'd ever want to know about SVD/PCA, which you perhaps didn't realize was really your question but it was.

Related Question