Solved – Relation between variance of eigenvalues and the effectiveness of PCA on the data

eigenvaluesmachine learningpcavariance

If the covariance matrix has eigenvalues $$\lambda_1 \ge \lambda_2 \ldots \ge \lambda_d > 0,$$ why is the variance of the eigenvalues, $$\sigma^2=\frac{1}{d}\sum_{i=1}^d (\lambda_i-\bar \lambda)^2$$ a measure of whether or not PCA would be useful for analyzing the data (the higher the value of $\sigma$, the more useful PCA)?

Best Answer

As @ttnphns said, if eigenvalues are similar between them, the covariance matrix of your multivariate vector seems espherical.

A little bit previous to the point of the answer: suppose for a moment that all eigenvalues are different.

Then, $\lambda_1$ is associated with the 1-dimentional affine sub-space in which the projection of the data shows the higher variance (it is usually said that it is the direction that "explains" most of the variance).

Then, $\lambda_2$ is asociated with the direction orthogonal with $\lambda_1$ that "explains" most of the variance between all directions orthogonal to $\lambda_1$ (here, explain is in the same sense of above, that is, projecting the data into this subespace shows higher variance than any other projection, with the restriction of being orthogenal to the one asociated to $\lambda_1$).

And with $\lambda_3, \lambda_4, ... , \lambda_d$ you could keep finding directions orthogonal to the previous ones that explain variance. It comes from this construction that the first direction explains most of the variance of the data, the second one explains less than the first but more than the rest and so on... So, sometimes, the firsts $p < d$ principal components (i.e. data projections on the previous directions) are used to describe the data, as they keep most of their variability while reducing the dimentionality of the dataset.

And here is a big point: all of this holds only if the eigenvalues are different between each other. If $\lambda_1 = \lambda_2$, then you cannot find a single 1-d subespace that explain most of the variance, but you need a 2-d plane to do that (associated with the eigenvectors of $\lambda_1$ and $\lambda_2$). You cannot (essentially) say anything about the directions associated with each of them by themselves, but you can assure that the plane formed by these two is the plane that explains the higher percentage of the variance between all of the available 2-d planes. So data reduction will,at most, be possible until 2-d, no 1-d representation makes sense.

In the extreme case in which all eigenvalues are the same, the data cannot be projected optimally (in the sense of variance explanation understood as above), and no data reduction can be performed. If eigenvalues variance is small, then something like this last case is happening: eigenvalues are very similar to each other, and so succesive directions will explain almost the same percentage of the variance (say near $\frac{1}{d}$ in an extreme, almost equal, case). So PCA would not be useful as a dimention reduction technique, because in order to get a representation of the data that keeps track of a significant amount of the original variability, you will need to use near d dimentions - so no reduction really took place. For example, if $d=8$ and eigenvalues are almost the same (i.e., their variance is near zero) then if you use 3 PC you will only explain around $\frac{3}{8} = 37.5\%$ of your original data variance: no PC low dimentionality representation keeps track of your data variability, so it may hide a lot of its beahviour, and should not be used.

Related Solutions

Solved – Singular values of the data matrix and eigenvalues of the covariance matrix

Aside stating the obvious: eig gives the results in ascending order while svd in descending one; the svd eigenvalues (and eigenvectors obviously) are dissimilar to those of eig decomposition because your matrix ingredients is not symmetric to start with. To paraphrase wikipedia a bit: "When the $X$ is a normal and/or a positive semi-definite matrix, the decomposition $\ {X} = {U} {D} {U}^*$ is also a singular value decomposition", not otherwise. ($U$ being the eigenvectors of $XX^\mathbf{T}$)

So example if you did something like:

rng(0,'twister')        %just set the seed.
Q = random('normal', 0,1,5);
X =  Q' * Q;            %so X is PSD 
[U S V]=    svd(X);
[A,B]=      eig(X);

max( abs(diag(S)- fliplr(diag(B)')' ))
% ans =  7.1054e-15     % AKA equal to numerical precision.

you would find that svd and eig do give you back the same results. While before exactly because matrix ingredients was not at least PSD (or even square for that matter), well.. you didn't get the same results. :)

Just to state it in another way: $X= U\Sigma V^*$ practically translates into: $X = \sum_1^r u_i s_i v_i^T$ ($r$ being the rank of $X$). Which itself means that you are (pretty awesomely) allowed to write $X v_i = \sigma_i u_i$. Clear to get back to the eigen-decomposition $X u_i = \lambda_i u_i$ you need first all $u_i$ == $v_i$. Something that non-normal matrices do not guarantee. As final note: The small numerical differences are due to eig and svd having different algorithms working in the background; a variant of the QR algorithm of svd and a (usually) generalized Schur decomposition for eig.

Specific to your problem what you want is something akin to:

load hald;
[u s v]=svd(ingredients);
sigma=(ingredients' * ingredients); 
lambda =eig(sigma);     
max( abs(diag(s)- fliplr(sqrt(lambda)')' ))
% ans = 5.6843e-14

As you see this is nothing to do with centring you data to have mean $0$ at this point; the matrix ingredients is not centered.

Now if you use the covariance matrix (and not a simple inner product matrix as I did) you will have to centre your data. Let's say that ingredients2 is your zero-meaned sample.

ingredients2 = ingredients - repmat(mean(ingredients), 13,1);

Then indeed you need this normalization by $1/(n-1)$

[u s v] =svd(ingredients2 );        
sigma = cov(ingredients); % You don't care about centring here
lambda =eig(sigma);   

max( abs( diag(s)- fliplr(sqrt(lambda *12)')')) % n = 13 so multiply by n-1
% ans = 4.7962e-14

So yeah, it the centring now. I was a bit misleading originally because I worked with the notion of PSD matrices rather than covariance matrices. The answer before the editing was fine. It addressed exactly why your eigen-decomposition did not fit your singular value decomposition. With the editting I show why your singular value decomposition did not fit the eigen-decomposition. Clearly one can view the same problem in two different ways. :D

Solved – Why does the direction with highest eigenvalue have the largest semi-axis

I think there are two ellipses that we could consider. First, consider the image of the unit circle with respect to the map $x \mapsto x^T A x$ for PD $A \in \mathbb R^{n \times n}$. It is a standard result that $f(x) = x^T A x$ is maximized over unit vectors $x$ by the unit eigenvector $v_1$ with largest eigenvalue $\lambda_1$. So this means that the ellipse formed by the image of the unit circle with respect to this map has its largest semi-axis as $v_1$ with length $\lambda_1$, and so on for the other eigenpairs. So in this case we clearly have that the eigenvalues give the lengths of the semi-axes and the biggest semi-axis is for the biggest eigenvalue.

But now consider the contour $f(x) = 1$ for any $x \in \mathbb R^n$. Since $A$ is positive-definite we know that $f$ is a paraboloid, so its intersection with a horizontal plane (i.e. the contour) is an ellipse. In this case, we find that the shortest semi-axis is parallel to $v_1$, which makes sense because that's the direction that $f$ grows the fastest so we hit 1 the soonest. The largest semi-axis is $v_n$ since that's the direction in which $f$ is growing the slowest. Plugging $v_1$ in to $f$ we get $f(v_1) = v_1^T A v_1 = \lambda_1 v_1^T v_1 = \lambda_1$, not 1 as required, so the actual vector parallel to $v_1$ is $\frac{1}{\sqrt \lambda_1}v_1$. Does this help?

Bringing this back to PCA, let's say that our data consists of $m$ observations coming iid from $\mathcal N_2(\vec 0, \Sigma)$. Let's draw an ellipse around our data such that every point in the ellipse has likelihood greater than some cutoff $c$. This corresponds to a contour of the likelihood and can be found by $$ c = \frac{1}{2\pi \vert \Sigma \vert}\exp \left( -\frac12 x^T \Sigma^{-1} x\right) \iff x^T \Sigma^{-1} x = -2\log (2\pi c \vert \Sigma \vert) $$ i.e. the ellipse that circles the data is a contour of the quadratic form $g(x) = x^T \Sigma^{-1} x$. Note that $\Sigma^{-1}v = \lambda v \implies \frac1\lambda v = \Sigma v$, so the eigenvectors defining the axes of this likelihood contour ellipse are the same as those for the covariance matrix $\Sigma$ but with inverted eigenvalues.

Best Answer

Related Solutions

Solved – Singular values of the data matrix and eigenvalues of the covariance matrix

Solved – Why does the direction with highest eigenvalue have the largest semi-axis

Related Question