PCA – Using Multiple Regression to Predict One Principal Component (PC) from Others

pcaregression

A while ago a user on R-help mailing list asked about the soundness of using PCA scores in a regression. The user is trying to use some PC scores to explain variation in another PC (see full discussion here). The answer was that no, this is not sound because PCs are orthogonal to each other.

Can someone explain in a bit more detail why this is so?

Best Answer

A principal component is a weighted linear combination of all your factors (X's).

example: PC1 = 0.1X1 + 0.3X2

There will be one component for each factor (though in general a small number are selected).

The components are created such that they have zero correlation (are orthogonal), by design.

Therefore, component PC1 should not explain any variation in component PC2.

You may want to do regression on your Y variable and the PCA representation of your X's, as they will not have multi-collinearity. However, this could be hard to interpret.

If you have more X's than observations, which breaks OLS, you can regress on your components, and simply select a smaller number of the highest variation components.

Principal Component Analysis by Jollife a very in-depth and highly cited book on the subject

This is also good: http://www.statsoft.com/textbook/principal-components-factor-analysis/

Related Solutions

PCA Eigenvectors – Why PCA Eigenvectors Are Orthogonal and Their Relation to Uncorrelated PCA Scores

I will try to explain how orthogonality of $a_1$ and $a_2$ ensures that $y_1$ and $y_2$ be uncorrelated. We want $a_1$ to maximize $Var(y_1)=a_1^T \Sigma a_1$. This will not be achieved unless we constrain $a_1$, in this case by $a_1^T a_1=1$. This optimization calls for the use of a Lagrange Multiplier (it's not too complicated, read about it on Wikipedia). We thus try to maximize \begin{equation} a_1^T \Sigma a_1 - \lambda(a_1^T a_1-1) \end{equation} with respect to both $a_1$ and $\lambda$. Notice that differentiation with respect to $\lambda$ and then equating to $0$ gives our constraint $a_1^T a_1=1$. Differentiation with respect to $a_1$ gives \begin{equation} \Sigma a_1 -\lambda a_1 =0 \end{equation} or \begin{equation} (\Sigma -\lambda I_p)a_1=0 \end{equation} variance of $y_1$ will be maximized by the greatest eigenvalue $\lambda_1$. Thus $\lambda_1 a_1=\Sigma a_1$. Here comes the part that will answer your question. Some elementary calculations using the definition of covariance will show that \begin{equation} Cov(y_1,y_2)=Cov(a^T_1 x,a^T_2 x)=a^T_1\Sigma a_2=a^T_2\Sigma a_1=a^T_2\lambda_1 a_1=\lambda_1 a^T_2 a_1 \end{equation} which will equal $0$ if and only if $a^T_2 a_1=0$.

Solved – Can MDS coordinates be used as variables in further analyses

As someone who has used that analogy and tells people that it is incorrect or inappropriate to take nMDS "axes" and use them in subsequent analyses as if they were principal components, I should probably address this.

Firstly, the analogy; when I have used this, perhaps I have not been clear enough to specifically mention the issue of mapping. Yes, people use easting and northing, or latitude and longitude as explanatory variables in subsequent models, but they do this because these variables mean something. By that I mean these variables are related to features of interest, such as climate. A species probably doesn't care nor know what latitude it's supposed to live at, but where that species is found is determined by many things, some of which are related in some way to the latitudes of the locations where that species is observed to inhabit.

What I have always meant to express with the analogy is that nMDS is also a mapping of the original dissimilarities into a (usually) 2D space. If you were to map locations of towns or cities in Europe say, you'd do a poor job if you only used one of the X or Y spatial coordinates. Put another way, seeing as these locations are already mapped, you'd do a pretty poor job of explaining or summarizing that map if you only used one of the X or Y spatial coordinates. You need both of these coordinates to do a reasonable job of preserving the distances between the locations.

Even that analogy is incomplete however. At least there, an easting and northing or a latitude and longitude describe locations of places in a system where the two axes of the space relate physically to something. These are not arbitrary coordinate systems at the scales at which they are used. It's not like latitude actually measures distances from a line passing through the earth at some randomly determined angle to the equator.

In an nMDS, the mapping you get is just that, a mapping. But there the "axes" are entirely arbitrary; you can rotate and reflect the resultant mapping an infinite number of ways and you will never change the information contained in the mapping or configuration of points. To me, it seems quite strange to think that including an arbitrary "axis" like this in a model makes any kind of intuitive sense. Why use the horizontal nMDS "axis" when a line at 7 degrees to the horizontal might better explain the spread of samples in the 2D space? Or 37 degrees, or 12? nMDS is trying to find a low-D mapping of the data that only tries to preserve the rank order of the original dissimilarities between sites. It is a step removed from the original data. PCA on the other hand is directly decomposing the data so that the largest directions of variation in the original data are represented using only the first few PCs.

Remember also that this is a mapping that aims to preserve in $k$ dimensions the rank order of the dissimilarities between sites. If you ask for a 2D mapping, you need both coordinates to preserve the resultant mapping. Neither of the horizontal or vertical "axes" correspond to the mapping you'd get if you asked for a 1D solution or a 3D solution. If you want an nMDS mapping you need to use the full solution. You can't pick and choose.

Why doing what you describe might work quite well in practice (not that that is a good reason to do anything) is because of tricks that are often employed in software that implements nMDS. In the vegan package for R, Jari Oksanen implemented some best-practice recommendations from the ecology literature. One of those involves rotating the resultant nMDS mapping using principal components so that we get a nicer map; what this does it set the direction of greatest spread in the locations of samples in the original nMDS mapping as the new horizontal axis (NMDS1), and the next largest direction of spread (subject to that direction being orthogonal to the first) is used as the new vertical axis of the plot (NMDS2). Now, if there are strong gradients in your species data, these should be evident in the dissimilarities and evident in the nMDS mapping, and because we just rotated the mapping using principal components, it is not surprising that NMDS1 can also be interpreted in terms of the environment or as representing some component of species compositional change for use in a subsequent model. So taking NMDS1 only works as well as it does in practice because we made the arbitrary choice to rotate the original nMDS solution using PCA.

Finally, a comment on the use of PCA axes. Often people just take PC1. Or perhaps they take PCs 1-3 because some test or other suggested that they were the "significant" axes. What they are doing is taking the coordinates of a mapping of the data into 1 or 3 dimensions and then using all of the dimensions that represent structure in the original data. That some people sometimes take PC2 only, reflects the fact that the individual PCs are orthogonal to one another — PC2 explains a different part of the variation in the original data set (when decomposed via linear combinations of the original variables). nMDS solutions are not necessarily (and are unlikely to be) orthogonal in terms of the original data (the fact that in some software we do rotation so that in the rotated coordinate system the new axes are orthogonal to one another, this is not with respect to the original data).

in short nMDS "axes" and PCA axes are two very different things and shouldn't be thought of as being the same kind of thing.

This is not to say that you can't take the nMDS mapping and use it in a model; we already do that in two different ways in vegan for example. In envfit() we fit the linear model

$$y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon$$

where $y$ is the thing we're interested in and $X_1$ and $X_2$ are the coordinates on NMDS1 and NMDS2 respectively. This fits a plane through the mapping in terms of these two coordinates. You could extend this to an interaction term too, although because of the PCA rotation we do with metaMDS(), this is probably not that useful if you're assuming a linear model. I also suspect that this is going to give you a terrible fit if the truth is not linear and I think we would have good reason to think the underlying relationships are nonlinear because the nMDS only aims to preserve the rank order of the original dissimilarities.

We also fit a nonlinear model of a similar nature using ordisurf(), which fits the GAM

$$y = f(X_1, X_2) + \varepsilon$$

with the same meaning for $X_1$ and $X_2$. Now the model describes a smooth surface.

If you were to include something like the latter in a subsequent model, then I would argue that this usage is perfectly acceptable. What I disagree with is why it makes any sense whatsoever to pick and choose from the resulting nMDS "axes" to use one of them in a model. What the GAM/ordisurf() method is doing is determining if there is any relationship between the spatial configuration of the samples in nMDS space and the response. As we're using the configuration as a whole, we're respecting the way that configuration was determined and the special nature of the rank-order preserving behaviour of Euclidean distances among the samples in the configuration.

Best Answer

Related Solutions

PCA Eigenvectors – Why PCA Eigenvectors Are Orthogonal and Their Relation to Uncorrelated PCA Scores

Solved – Can MDS coordinates be used as variables in further analyses

Related Question