Machine Learning – Physical Meaning of Eigenvectors of the Gram/Kernel Matrix

kernel trickmachine learningpca

If we have some centered dataset $X$ then the eigenvectors of $X^TX$ represent the principal components of the dataset, and their physical meaning is the directions that data follow in the original feature space.

Now, in the same way as before, we can create the Kernel matrix $XX^T$ of the centered data and we can compute its eigenvectors and eigenvalues.

I understand that there might be some benefits in utilising this decomposition, like utilising a subset of the top components as discussed for example in this paper by J. Shawe-Taylor et al., however, I am not sure if I understand the physical meaning of this decomposition. What are the directions now, and what's the meaning of the eigenvalues?

Best Answer

The eigenvalues are actually the same as those of the covariance matrix. Let $X = U \Sigma V^T$ be the singular value decomposition; then $$X X^T = U \Sigma \underbrace{V^T V}_{I} \Sigma U^T = U \Sigma^2 U^T$$ and similarly $X^T X = V \Sigma^2 V^T$. Note that in the typical case where $X$ is $n \times p$ with $n \gg p$, most of the eigenvalues of the Gram matrix will be zero. If you're using say an RBF kernel, none will be zero (though some will probably be incredibly small).

The eigenvectors of the Gram matrix are thus seen to be the left singular values of $X$, $U$. One way to interpret these is:

The right singular vectors (columns of $V$, the eigenvectors of the covariance matrix) give the directions that data tends to lie on in the feature space.
The singular values (diagonal of $\Sigma$, square root of the eigenvalues of either matrix) give how important each component is to the dataset as a whole.
The left singular vectors (columns of $U$, the eigenvectors of the Gram matrix) give the representation of how much each data point is represented by each of the components, relative to how much they're used in the whole dataset. (Columns of $U \Sigma$ give the scores, the linear coefficient of each component when representing the data in the basis $V$.)

If you take only the first few columns of $U$ (and the corresponding block of $\Sigma$), you get the data projected as well as possible onto the most frequent components (PCA).

If a data point has high norm of its row of $U$, that means that it uses components much more than other components do, ie it has high leverage / "sticks out." If $p > n$, these will all be one, in which case you can either look only at the first $k$ values (leverage scores corresponding to the best rank-k approximation) or do some kind of soft-thresholding instead. Doing that with $k=1$, which is computationally easier, gives you PageRank.

See also this thread and the links therein for everything you'd ever want to know about SVD/PCA, which you perhaps didn't realize was really your question but it was.

Related Solutions

Solved – Steps done in factor analysis compared to steps done in PCA

This answer is to show concrete computational similarities and differences between PCA and Factor analysis. For general theoretical differences between them, see questions/answers 1, 2, 3, 4, 5.

Below I will do, step by step, Principal Component analysis (PCA) of iris data ("setosa" species only) and then will do Factor analysis of the same data. Factor analysis (FA) will be done by Iterative principal axis (PAF) method which is based on PCA approach and thus makes one able to compare PCA and FA step-by-step.

Iris data (setosa only):

  id  SLength   SWidth  PLength   PWidth species 

   1      5.1      3.5      1.4       .2 setosa 
   2      4.9      3.0      1.4       .2 setosa 
   3      4.7      3.2      1.3       .2 setosa 
   4      4.6      3.1      1.5       .2 setosa 
   5      5.0      3.6      1.4       .2 setosa 
   6      5.4      3.9      1.7       .4 setosa 
   7      4.6      3.4      1.4       .3 setosa 
   8      5.0      3.4      1.5       .2 setosa 
   9      4.4      2.9      1.4       .2 setosa 
  10      4.9      3.1      1.5       .1 setosa 
  11      5.4      3.7      1.5       .2 setosa 
  12      4.8      3.4      1.6       .2 setosa 
  13      4.8      3.0      1.4       .1 setosa 
  14      4.3      3.0      1.1       .1 setosa 
  15      5.8      4.0      1.2       .2 setosa 
  16      5.7      4.4      1.5       .4 setosa 
  17      5.4      3.9      1.3       .4 setosa 
  18      5.1      3.5      1.4       .3 setosa 
  19      5.7      3.8      1.7       .3 setosa 
  20      5.1      3.8      1.5       .3 setosa 
  21      5.4      3.4      1.7       .2 setosa 
  22      5.1      3.7      1.5       .4 setosa 
  23      4.6      3.6      1.0       .2 setosa 
  24      5.1      3.3      1.7       .5 setosa 
  25      4.8      3.4      1.9       .2 setosa 
  26      5.0      3.0      1.6       .2 setosa 
  27      5.0      3.4      1.6       .4 setosa 
  28      5.2      3.5      1.5       .2 setosa 
  29      5.2      3.4      1.4       .2 setosa 
  30      4.7      3.2      1.6       .2 setosa 
  31      4.8      3.1      1.6       .2 setosa 
  32      5.4      3.4      1.5       .4 setosa 
  33      5.2      4.1      1.5       .1 setosa 
  34      5.5      4.2      1.4       .2 setosa 
  35      4.9      3.1      1.5       .2 setosa 
  36      5.0      3.2      1.2       .2 setosa 
  37      5.5      3.5      1.3       .2 setosa 
  38      4.9      3.6      1.4       .1 setosa 
  39      4.4      3.0      1.3       .2 setosa 
  40      5.1      3.4      1.5       .2 setosa 
  41      5.0      3.5      1.3       .3 setosa 
  42      4.5      2.3      1.3       .3 setosa 
  43      4.4      3.2      1.3       .2 setosa 
  44      5.0      3.5      1.6       .6 setosa 
  45      5.1      3.8      1.9       .4 setosa 
  46      4.8      3.0      1.4       .3 setosa 
  47      5.1      3.8      1.6       .2 setosa 
  48      4.6      3.2      1.4       .2 setosa 
  49      5.3      3.7      1.5       .2 setosa 
  50      5.0      3.3      1.4       .2 setosa

We have 4 numeric variables to include in our analyses: SLength SWidth PLength PWidth, and the analyses will be based on covariances, which is the same as to say that we analyse centered variables. (If we chose to analyse correlations that would be analysing standardized variables. Analysis based on correlations produce different results than analysis based on covariances.) I will not display the centered data. Let's call these data matrix X.

PCA steps:

Step 0. Compute centered variables X and covariance matrix S.

Covariances S (= X'*X/(n-1) matrix: see https://stats.stackexchange.com/a/22520/3277)
.12424898   .09921633   .01635510   .01033061
.09921633   .14368980   .01169796   .00929796
.01635510   .01169796   .03015918   .00606939
.01033061   .00929796   .00606939   .01110612

Step 1.1. Decompose data X or matrix S to get eigenvalues and right eigenvectors.
          You may use svd or eigen decomposition (see https://stats.stackexchange.com/q/79043/3277)

Eigenvalues L (component variances) and the proportion of overall variance explained
           L            Prop
PC1   .2364556901   .7647237023 
PC2   .0369187324   .1193992401 
PC3   .0267963986   .0866624997 
PC4   .0090332606   .0292145579    

Eigenvectors V (cosines of rotation of variables into components)
              PC1           PC2           PC3           PC4
SLength   .6690784044   .5978840102  -.4399627716  -.0360771206 
SWidth    .7341478283  -.6206734170   .2746074698  -.0195502716 
PLength   .0965438987   .4900555922   .8324494972  -.2399012853 
PWidth    .0635635941   .1309379098   .1950675055   .9699296890 

Step 1.2. Decide on the number M of first PCs you want to retain.
          You may decide it now or later on - no difference, because in PCA values of components do not depend on M.
          Let's M=2. So, leave only 2 first eigenvalues and 2 first eigenvector columns.

Step 2. Compute loadings A. May skip if you don't need to interpret PCs anyhow.
Loadings are eigenvectors normalized to respective eigenvalues: A value = V value * sqrt(L value)
Loadings are the covariances between variables and components.

Loadings A
              PC1           PC2           
SLength    .32535081     .11487892
SWidth     .35699193    -.11925773
PLength    .04694612     .09416050
PWidth     .03090888     .02515873

Sums of squares in columns of A are components' variances, the eigenvalues

Standardized (rescaled) loadings.
St. loading is Loading / sqrt(Variable's variance);
these loadings are computed if you analyse covariances, and are suitable for interpretation of PCs
(if you analyse correlations, A are already standardized).
              PC1           PC2      
SLength    .92300804     .32590717
SWidth     .94177127    -.31461076
PLength    .27032731     .54219930
PWidth     .29329327     .23873031

Step 3. Compute component scores (values of PCs).

Regression coefficients B to compute Standardized component scores are: B = A*diag(1/L) = inv(S)*A
B
              PC1           PC2  
SLength   1.375948338   3.111670112 
SWidth    1.509762499  -3.230276923 
PLength    .198540883   2.550480216 
PWidth     .130717448    .681462580 

Standardized component scores (having variances 1) = X*B
      PC1           PC2
  .219719506   -.129560000 
 -.810351411    .863244439 
 -.803442667   -.660192989 
-1.052305574   -.138236265 
  .233100923   -.763754703 
 1.322114762    .413266845 
 -.606159168  -1.294221106 
 -.048997489    .137348703 
  ...

Raw component scores (having variances = eigenvalues) can of course be computed from standardized ones.
In PCA, they are also computed directly as X*V
      PC1           PC2
  .106842367   -.024893980 
 -.394047228    .165865927 
 -.390687734   -.126851118 
 -.511701577   -.026561059 
  .113349309   -.146749722 
  .642900908    .079406116 
 -.294755259   -.248674852 
 -.023825867    .026390520 
  ...

FA (iterative principal axis extraction method) steps:

Step 0.1. Compute centered variables X and covariance matrix S.

Step 0.2. Decide on the number of factors M to extract.
          (There exist several well-known methods in help to decide, let's omit mentioning them. Most of them require that you do PCA first.)
          Note that you have to select M before you proceed further because, unlike in PCA, in FA loadings and factor values depend on M.
          Let's M=2.

Step 0.3. Set initial communalities on the diagonal of S.
          Most often quantities called "images" are used as initial communalities (see https://stats.stackexchange.com/a/43224/3277).
          Images are diagonal elements of matrix S-D, where D is diagonal matrix with diagonal = 1 / diagonal of inv(S).
          (If S is correlation matrix, images are the squared multiple correlation coefficients.)

With covariance matrix, image is the squared multiple correlation multiplied by the variable variance.
S with images as initial communalities on the diagonal
.07146025  .09921633  .01635510  .01033061
.09921633  .07946595  .01169796  .00929796
.01635510  .01169796  .00437017  .00606939
.01033061  .00929796  .00606939  .00167624

Step 1. Decompose that modified S to get eigenvalues and right eigenvectors.
        Use eigen decomposition, not svd.
        (Some last eigenvalues may be negative. This is because a reduced covariance matrix can be not positive semidefinite.)

Eigenvalues L
F1   .1782099114
F2   .0062074477
    -.0030958623
    -.0243488794

Eigenvectors V
               F1            F2 
SLength   .6875564132   .0145988554   .0466389510   .7244845480
SWidth    .7122191394   .1808121121  -.0560070806  -.6759542030
PLength   .1154657746  -.7640573143   .6203992617  -.1341224497
PWidth    .0817173855  -.6191205651  -.7808922917  -.0148062006

Leave the first M=2 values in L and columns in V.

Step 2.1. Compute loadings A.
Loadings are eigenvectors normalized to respective eigenvalues: A value = V value * sqrt(L value)
               F1            F2 
SLength   .2902513607   .0011502052
SWidth    .3006627098   .0142457085
PLength   .0487437795  -.0601980567
PWidth    .0344969255  -.0487788732

Step 2.2. Compute row sums of squared loadings. These are updated communalities.
          Reset the diagonal of S to them

S with updated communalities on the diagonal
.08424718  .09921633  .01635510  .01033061
.09921633  .09060101  .01169796  .00929796
.01635510  .01169796  .00599976  .00606939
.01033061  .00929796  .00606939  .00356942

REPEAT Steps 1-2 many times (iterations, say, 25)

Extraction of factors is done.

Let us look at the final eigenvalues of the reduced covariance 
matrix after iterations:
Eigenvalues L
F1   .2026316056
F2   .0137096989
     .0005000572
    -.0005882867

The eigenvalues are the factors' (F1 and F2) variances. The overall common variance is .2026316056 + .0137096989 = .2163413036,
so F1, for example, explains .2026316056/.2163413036 = 93.7% of the common variance.
That 93.7% of the common variance amounts to .2026316056/.3092040816 = 65.5% of the total variability 
(.3092040816, the total variance, is the trace of the initial, non-reduced covariance matrix).
[Note. The .0005000572 + -.0005882867 do not count a common variance; these "dross" eigenvalues are nonzero due to the fact
the 2-factor model does not predict the covariances without any error.]

Final loadings A and communalities (row sums of squares in A).
Loadings are the covariances between variables and factors.
Communality is the degree to what the factors load a variable, it is the "common variance" in the variable.
               F1            F2                        Comm
SLength   .3125767362   .0128306509                .0978688416
SWidth    .3187577564  -.0323523347                .1026531808
PLength   .0476237419   .1034495601                .0129698323
PWidth    .0324478281   .0423861795                .0028494498

Sums of squares in columns of A are the factors' variances: .2026316056 and .0137096989.

The main goal of factor analysis is to explain correlations or covariances by means of the loadings.
A*t(A) is the restored covariances:
.0978688416   .0992211576   .0162133990   .0106862785 
.0992211576   .1026531808   .0118336023   .0089717050 
.0162133990   .0118336023   .0129698323   .0059301186 
.0106862785   .0089717050   .0059301186   .0028494498

See that off-diagonal elements above are quite close to those of the input covariance matrix:
S
.1242489796   .0992163265   .0163551020   .0103306122    
.0992163265   .1436897959   .0116979592   .0092979592    
.0163551020   .0116979592   .0301591837   .0060693878    
.0103306122   .0092979592   .0060693878   .0111061224

Standardized (rescaled) loadings and communalities.
St. loading is Loading / sqrt(Variable's variance);
these loadings are computed if you analyse covariances, and are suitable for interpretation of Fs
(if you analyse correlations, A are already standardized).
               F1            F2                        Comm
SLength   .8867684574   .0364000747                .7876832626
SWidth    .8409066701  -.0853478652                .7144082859
PLength   .2742292179   .5956880078                .4300458666
PWidth    .3078962532   .4022009053                .2565656710

Step 3. Compute factor scores (values of Fs).
        Unlike component scores in PCA, factor scores are not exact, they are reasonable approximations.
        Several methods of computation exist (https://stats.stackexchange.com/q/126885/3277).
        Here is regressional method which is the same as the one used in PCA.

Regression coefficients B to compute Standardized factor scores are: B = inv(S)*A (original S is used)
B
              F1           F2  
SLength  1.597852081   -.023604439
SWidth   1.070410719   -.637149341
PLength   .212220217   3.157497050
PWidth    .423222047   2.646300951

Standardized factor scores = X*B
These "Standardized factor scores" have variance not 1; the variance of a factor is SSregression of the factor by variables / (n-1).
      F1           F2
  .194641800   -.365588231
 -.660133976   -.042292672
 -.786844270   -.480751358
-1.011226507    .216823430
  .141897664   -.426942721
 1.250472186    .848980006
 -.669003108   -.025440982
 -.050962459    .016236852
  ...

Factors are extracted as orthogonal. And they are.
However, regressionally computed factor scores are not fully uncorrelated.
Covariance matrix between computed factor scores.
      F1      F2
F1   .864   .026
F2   .026   .459

Factor variances are their squared loadings.
You can easily recompute the above "standardized" factor scores to "raw" factor scores having those variances:
raw score = st. score * sqrt(factor variance / st. scores variance).

After the extraction (shown above), optional rotation may take place. Rotation is frequently done in FA. Sometimes it is done in PCA exactly the same way. Rotation rotates loading matrix A into some form of "simple structure" which facilitates interpretation of factors greatly (then rotated scores can be recomputed). Since rotation is not what differentiates FA from PCA mathematically and because it is a separate large topic, I won't touch it.

Solved – How to compute PCA scores from eigendecomposition of the covariance matrix

The projection is given by $\mathbf{X}_c \mathbf{V}$; principal component scores are columns of this matrix.

Your first formula cannot be computed, as $\mathbf{V}^\top$ is of $7 \times 7$ size and $\mathbf{X}_c$ has $12$ rows; the dimensions do not match, and the matrix product is not defined. Instead, $\mathbf{V}^\top \mathbf{X}_c^\top$ would be another correct formula, but it is simply a transpose of the first equation: $\mathbf{V}^\top \mathbf{X}^\top_c = (\mathbf{X}_c\mathbf{V})^\top$. Here PC scores are in rows.

Best Answer

Related Solutions

Solved – Steps done in factor analysis compared to steps done in PCA

Solved – How to compute PCA scores from eigendecomposition of the covariance matrix

Related Question