Solved – Does the first principal component differ from simply computing the mean of all variables

feature-engineeringmeanpca

I was just wondering if the first principal component, while I am trying to find it for a dataset of 18 variables, is different from simply adding all variables and finding the mean? I.e. to compute the variable that is the mean of my variables.

In fact, does first principal component account for more variance than that mean? Why? How?

Best Answer

Some purely geometric meditation on the question. How can we graphically compare "mean as the construct" with "PC1 as the construct"?

Let us have two variables $X$ and $Y$ with this covariance matrix

enter image description here

and we have computed the variable which is the sum $S$ or the mean $M$ of $X$ and $Y$. We also performed PCA of the $X,Y$ centered data (i.e. PCA based on the covariance matrix). [Please note that it makes no difference whether you compute $M$ (or $S$) from the raw variables and then center the result, or you first center the variables and then compute $M$. The same, centered $M$ appears.]

Below are vector representations of the variables in the subject space, convenient to depict analyses of few correlated variables (see e.g. this, this, this and other examples used by me). On the pictures, since the data were centered, the lengths of all the vectors shown equal the st. deviations of the respective variables. For example, length of $X$ is $\sqrt {\sigma_x^2}$. Angles are correlations: cosine between $X$ and $Y$ vectors is $\cos 45 = .7071$, Pearson $r$ observed between the two variables.

Construct which is Mean of the variables (pic. 1). $M$ variable is the half of vector $S$ which is $X+Y$ variable. The squared lengths of $S$ and $M$ - the variances $\sigma_s^2$ and $\sigma_m^2$ - are easily computed by the law of parallelogram. We consider the red vector ($M$, or $S$) as the construct - i.e. the variable which one serves as a substitute for two, $X$ and $Y$. It is like Principal component 1 (also a construct) but is different.

enter image description here

We may project $X$ and $Y$ on red construct to obtain their coordinates on that "axis", $a_{xs}$ and $a_{ys}$. I'd want to inform you that these quantities can be called "loadings", by the analogy of loadings in PCA (in PCA, you see loadings to be coordinates on a loading plot or biplot). OK, now what is most interesting to us thing on pic.1? It is the equality of the variables' coordinates onto the axis complementary and orthogonal to the construct: $h_x=h_y$, call this quantity $h$. So, what is most characteristic of taking the mean of two variables to be the construct is that this decision constrains the left-out, complementary (to the $a$'s) loadings $h_x$ and $h_y$ to be equal.$^1$

$^1$A reader familiar with subject space vector representations will immediately recognize in the $a$s and the $h$s the facets of simple regression (compare e.g. with the 2nd pic here). Clearly, loading $a_{xs}$, for example, is (the st. dev. of) the prediction of $X$ by $M$ and $h_x$ is (the st. dev. of) the error term of that regression. Since we have only two variables defining the mean, the two $h$s are equal.

Principal Components (pic. 2). On the following picture, PCA is displayed. The construct is principal component 1, $P_1$, and we notice that it - gravitated to the longer variable - has gone lower than the parallelogram's diagonal, construct $M$ (=$S$). $P_2$, complement, is the principal component 2 to be left out. The squared lengths - the variances $\sigma_1^2$ and $\sigma_2^2$ - of $P_1$ and $P_2$ are the eigenvalues of the covariance matrix and are computed accordingly.

enter image description here

The loadings, coordinates of the variables vectors onto the components vectors, are the subscripted black bold frames. We know, that the constraint which is the essence of PCA is to maximize variance of $P_1$ (which is equal to the sum of its squared loadings, $a_{x1}^2+a_{y1}^2$) and consequently to minimize variance of $P_2$ (equal $a_{x2}^2+a_{y2}^2$). In other words, the grey shaded area on the pic is what gets minimized. This is different constraint from the equality of the two loadings that was the constraint seen on pic.1.

The Mean can be rescaled into Component 1 but it is weaker than Principal component 1 (pic. 3). The "axiom" of PCA is that all the components restore all the summative variance of the variables: $\sigma_1^2 + \sigma_2^2 = \sigma_x^2 + \sigma_y^2$. Actually, it is true for any component analysis, not just "principal" one. Can we turn our mean (or sum) variable $M$ chosen to become construct on pic.1 into component 1 (albeit not "principal component 1") of a component analysis which thus can be labeled, for a moment, "Mean component analysis"?

enter image description here

Yes, why not (see now pic.3 which is based on pic.1). We only have to rescale variance (length) of $M$ so it, together with its left-out orthogonal complement, sums to $\sigma_x^2$ + $\sigma_y^2$. We can compute the variance of component $C_1$ (the rescaled $M$) as soon as we know its loadings, as $a_{x1}^2+a_{y1}^2$; and we can know the loadings by pythagorean theorem as soon as we know the error term $h$. The latter we can estimated as the altitude of the triangle (shaded beige) with known sides.

We do it, and find that the variance of $C_1$ is weaker than the variance of $P_1$ (the vector lengths, st. deviations, are 5.988 vs 6.006). Selecting mean to be the (base of the) main component that we leave, as a substitute of the initial variables, is possible and can make sense, yet it is less optimal as a variance keeper than the true PCA component which is thus the recommended strategy.

This answer examined a relation between PCA and mean-construct in a simple case of two variables (I believe it could be extended with some effort to more variables, except the ability to make pictures of 4+ dimensions and the complicating fact that error terms $h$s will generally not be equal then). Another answer elsewhere considers conditions when mean-construct can be a good substitute for PCA.

See also a related discussion regarding computing scores vs summation/averaging in factor analysis.

Related Question