Solved – Does the first principal component differ from simply computing the mean of all variables

feature-engineeringmeanpca

I was just wondering if the first principal component, while I am trying to find it for a dataset of 18 variables, is different from simply adding all variables and finding the mean? I.e. to compute the variable that is the mean of my variables.

In fact, does first principal component account for more variance than that mean? Why? How?

Best Answer

Some purely geometric meditation on the question. How can we graphically compare "mean as the construct" with "PC1 as the construct"?

Let us have two variables $X$ and $Y$ with this covariance matrix

and we have computed the variable which is the sum $S$ or the mean $M$ of $X$ and $Y$. We also performed PCA of the $X,Y$ centered data (i.e. PCA based on the covariance matrix). [Please note that it makes no difference whether you compute $M$ (or $S$) from the raw variables and then center the result, or you first center the variables and then compute $M$. The same, centered $M$ appears.]

Below are vector representations of the variables in the subject space, convenient to depict analyses of few correlated variables (see e.g. this, this, this and other examples used by me). On the pictures, since the data were centered, the lengths of all the vectors shown equal the st. deviations of the respective variables. For example, length of $X$ is $\sqrt {\sigma_x^2}$. Angles are correlations: cosine between $X$ and $Y$ vectors is $\cos 45 = .7071$, Pearson $r$ observed between the two variables.

Construct which is Mean of the variables (pic. 1). $M$ variable is the half of vector $S$ which is $X+Y$ variable. The squared lengths of $S$ and $M$ - the variances $\sigma_s^2$ and $\sigma_m^2$ - are easily computed by the law of parallelogram. We consider the red vector ($M$, or $S$) as the construct - i.e. the variable which one serves as a substitute for two, $X$ and $Y$. It is like Principal component 1 (also a construct) but is different.

We may project $X$ and $Y$ on red construct to obtain their coordinates on that "axis", $a_{xs}$ and $a_{ys}$. I'd want to inform you that these quantities can be called "loadings", by the analogy of loadings in PCA (in PCA, you see loadings to be coordinates on a loading plot or biplot). OK, now what is most interesting to us thing on pic.1? It is the equality of the variables' coordinates onto the axis complementary and orthogonal to the construct: $h_x=h_y$, call this quantity $h$. So, what is most characteristic of taking the mean of two variables to be the construct is that this decision constrains the left-out, complementary (to the $a$'s) loadings $h_x$ and $h_y$ to be equal.$^1$

$^1$A reader familiar with subject space vector representations will immediately recognize in the $a$s and the $h$s the facets of simple regression (compare e.g. with the 2nd pic here). Clearly, loading $a_{xs}$, for example, is (the st. dev. of) the prediction of $X$ by $M$ and $h_x$ is (the st. dev. of) the error term of that regression. Since we have only two variables defining the mean, the two $h$s are equal.

Principal Components (pic. 2). On the following picture, PCA is displayed. The construct is principal component 1, $P_1$, and we notice that it - gravitated to the longer variable - has gone lower than the parallelogram's diagonal, construct $M$ (=$S$). $P_2$, complement, is the principal component 2 to be left out. The squared lengths - the variances $\sigma_1^2$ and $\sigma_2^2$ - of $P_1$ and $P_2$ are the eigenvalues of the covariance matrix and are computed accordingly.

The loadings, coordinates of the variables vectors onto the components vectors, are the subscripted black bold frames. We know, that the constraint which is the essence of PCA is to maximize variance of $P_1$ (which is equal to the sum of its squared loadings, $a_{x1}^2+a_{y1}^2$) and consequently to minimize variance of $P_2$ (equal $a_{x2}^2+a_{y2}^2$). In other words, the grey shaded area on the pic is what gets minimized. This is different constraint from the equality of the two loadings that was the constraint seen on pic.1.

The Mean can be rescaled into Component 1 but it is weaker than Principal component 1 (pic. 3). The "axiom" of PCA is that all the components restore all the summative variance of the variables: $\sigma_1^2 + \sigma_2^2 = \sigma_x^2 + \sigma_y^2$. Actually, it is true for any component analysis, not just "principal" one. Can we turn our mean (or sum) variable $M$ chosen to become construct on pic.1 into component 1 (albeit not "principal component 1") of a component analysis which thus can be labeled, for a moment, "Mean component analysis"?

Yes, why not (see now pic.3 which is based on pic.1). We only have to rescale variance (length) of $M$ so it, together with its left-out orthogonal complement, sums to $\sigma_x^2$ + $\sigma_y^2$. We can compute the variance of component $C_1$ (the rescaled $M$) as soon as we know its loadings, as $a_{x1}^2+a_{y1}^2$; and we can know the loadings by pythagorean theorem as soon as we know the error term $h$. The latter we can estimated as the altitude of the triangle (shaded beige) with known sides.

We do it, and find that the variance of $C_1$ is weaker than the variance of $P_1$ (the vector lengths, st. deviations, are 5.988 vs 6.006). Selecting mean to be the (base of the) main component that we leave, as a substitute of the initial variables, is possible and can make sense, yet it is less optimal as a variance keeper than the true PCA component which is thus the recommended strategy.

This answer examined a relation between PCA and mean-construct in a simple case of two variables (I believe it could be extended with some effort to more variables, except the ability to make pictures of 4+ dimensions and the complicating fact that error terms $h$s will generally not be equal then). Another answer elsewhere considers conditions when mean-construct can be a good substitute for PCA.

See also a related discussion regarding computing scores vs summation/averaging in factor analysis.

Related Solutions

Solved – Evaluate the relative importance of variables using PCA

A problem with your question is illustrated in the example below.

The points vary mostly in two directions (it is roughly a disk shape) and so the data may be reduced into two dimensions without loosing lot's of information (in terms of variance, possibly that tiny bit of variation may be important, first determine whether that viewpoint of amount of variance=information/importance applies).

Rather than selecting the largest PC's, PC1 and PC2 (which are transformations of x1, x2 and x3), one may consider instead to choose a set of the original parameters that also describes a large amount of the variation. Reducing the number of x's, for instance because it requires time to measure or space to store the information. This seems to be your goal.

Note in this example that x1 and x2 correlate strongly with the PC's and have high weights. Yet, it is better to select x1 + x3 or x2 + x3. This is because x1 and x2 correlate strongly with each other and after selecting one of x1 and x2 as important, the other one does not provide much more value. The contrast X1-X2 correlates only with the small variance of PC3).

What the PCA does is just showing you a structure perpendicular components generated in order of maximum variance. The goals of a PCA is to observe an underlying structure in a complex system of many variables. It does not give you an answer to reduce the dimensionality by selecting less variables (instead, it allows you to reduce the dimensionality by transformation, but this still requires all of the original variables).

What you could better do is write an algorithm selecting and switching variables until a maximum explained variance is achieved.

R code to generate the image

#generating three random PC's
set.seed(1)
PC1 = rnorm(100,0,1)
PC2 = rnorm(100,0,0.5)
PC3 = rnorm(100,0,0.1)

#transformation back into underlying parameters
x1 = PC1 - PC3  
x2 = PC1 + PC3
x3 = 0.2*PC1 + PC2

#plotting
library("plot3D")
for (theta in c(0:120)*3) {
  if (theta < 10) {n = paste0('000',theta)}
  if (theta < 100 && theta >= 10) {n = paste0('00',theta)}
  if (theta >= 100) {n = paste0('0', theta)}


  name=paste0("~/Desktop/gifs/image_",n,".png")
  png(name)
  scatter3D(x1,x2,x3,xlab="x1",ylab="x2",zlab="x3",col=1,pch=19,theta=theta,phi=30)
  dev.off()
}

system("convert ~/Desktop/gifs/image*.png -delay 1 -loop 0 ~/Desktop/gifs/3D.gif")

Best Answer

Related Solutions

Solved – Evaluate the relative importance of variables using PCA

Related Question