Solved – Conclusions from output of a principal component analysis

interpretationpcar

I am trying to understand output of principal component analysis performed as follows:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> res = prcomp(iris[1:4], scale=T)
> res
Standard deviations:
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation:
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
> 
> summary(res)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
>

I tend to conclude following from above output:

The proportion of variance indicates how much of total variance is there in variance of a particular principal component. Hence, PC1 variability explains 73% of total variance of the data.
Rotation values shown are same as 'loadings' mentioned in some descriptions.
Considering rotations of PC1, one can conclude that Sepal.Length, Petal.Length and Petal.Width are directly related, and they all are inversely related to Sepal.Width (which has a negative value in rotation of PC1)
There may be a factor in plants (some chemical/physical functional system etc) which may be affecting all these variables (Sepal.Length, Petal.Length and Petal.Width in one direction and Sepal.Width in opposite direction).
If I want to show all rotations in one graph, I can show their relative contribution to total variation by multiplying each rotation by proportion of variance of that principal component. For example, for PC1, the rotations of 0.52, -0.26, 0.58 and 0.56 are all multiplied by 0.73 (proportional variance for PC1, shown in summary(res) output.

Am I right about above conclusions?

Edit regarding question 5: I want to show all rotation in a simple barchart as follows: enter image description here

Since PC2, PC3 and PC4 have progressively lesser contribution to variation, will it make sense to adjust (reduce) the loadings of the variables there?

Best Answer

Yes. This is the correct interpretation.
Yes, rotation values indicate the component loading values. This is confirmed by the prcomp documentation, though I'm not sure why they label this part of the aspect "Rotation", as it implies the loadings have been rotated using some orthogonal (likely) or oblique (less likely) method.
While it does appear to be the case that Sepal.Length, Petal.Length, and Petal.Width are all positively associated, I would not put as much stock in the small negative loading of Sepal.Width on PC1; it loads much more strongly (almost exclusively) on PC2. To be clear, Sepal.Width is still likely negatively associated with the other three variables, but it just doesn't seem to be strongly related to the first principle component.
Based on this question, I wonder whether you would be better served by using a common factor (CF) analysis, rather than a principle components analysis (PCA). CF is more of an appropriate data-reducing technique when your goal is to uncover meaningful theoretical dimensions--such as the plant-factor that you are hypothesizing may affect Sepal.Length, Petal.Length, and Petal.Width. I appreciate you're from some sort of biological science--botany perhaps--but there's some good writing in Psychology on the PCA v. CF distinction by Fabrigar et al., 1999, Widaman, 2007, and others. The core distinction between the two is that PCA assumes that all variances is true-score variance--no error is assumed--whereas CF partitions true score variance from error variance, before factors are extracted and factor loadings are estimated. Ultimately you might get a similar-looking solution--people sometimes do--but when they do diverge, it tends to be the case that PCA overestimate loading values, and underestimates the correlations between components. An additional perk of the CF approach is that you can use maximum likelihood estimation to perform significance tests of loading values, while also getting some indexes of how well your chosen solution (1 factor, 2 factors, 3 factors, or 4 factors) explains your data.
I would plot the factor loading values as you have, without weighting their bars by the proportion of variance for their respective components. I understand what you want to try to show by such an approach, but I think it would likely lead to readers to misunderstanding the component loading values from your analysis. However, if you wanted a visual way of showing the relative magnitude of variance accounted for by each component, you might consider manipulating the opacity of the groups bars (if you're using ggplot2, I believe this is done with the alpha aesthetic), based on the proportion of variance explained by each component (i.e., more solid colors = more variance explained). However, in my experience, your figure is not a typical way of presenting the results of a PCA--I think a table or two (loadings + variance explained in one, component correlations in another) would be much more straightforward.

References

Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272-299.

Widaman, K. F. (2007). Common factors versus components: Principals and principles, errors, and misconceptions. In R. Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historic developments and future directions (pp. 177-203). Mahwah, NJ: Lawrence Erlbaum.

Related Solutions

Solved – How to use principal components analysis to select variables for regression

You haven't specified what "modeling" you plan on, but it sounds like you're asking about how to select independent variables among $A$, $B$, and $C$ for the purpose of (say) regressing a fourth dependent variable $W$ on them.

To see that this approach can go wrong, consider three independent Normally distributed variables $X$, $Y$, and $Z$ with unit variance. For the true, underlying model choose a small constant $\beta \ll 1$, a really tiny constant $\epsilon \ll \beta$, and let the (dependent variable) $W = Z$ (plus a little bit of error independent of $X$, $Y$, and $Z$).

Suppose the independent variables you have are $A = X + \epsilon Y$, $B = X - \epsilon Y$, and $C = \beta Z$. Then $W$ and $C$ are strongly correlated (depending on the variance of the error), because each is close to a multiple of $Z$. However, $W$ is uncorrelated with either of $A$ or $B$. Because $\beta$ is small, the first principal component for $\{A, B, C\}$ is parallel to $X$ with eigenvalue $2 \gg \beta$. $A$ and $B$ load heavily on this component and $C$ loads not at all because it is independent of $X$ (and $Y$). Nevertheless, if you eliminate $C$ from the independent variables, leaving only $A$ and $B$, you will be throwing away all information about the dependent variable because $W$, $A$, and $B$ are independent!

This example shows that for regression you want to pay attention to how the independent variables are correlated with the dependent one; you can't get away just by analyzing relationships among the independent variables.

Solved – Which variables explain which PCA components, and vice versa

You are right, the loadings can help you here. They can be used to compute the correlation between the variables and the principal components. Moreover, the sum of the squared loadings of one variable over all principal components is equal to 1. Hence, the squared loadings tell you the proportion of variance of one variable explained by one principal component.

The problem with princomp is, it only shows the "very high" loadings. But since the loadings are just the eigenvectors of the covariance matrix, one can get all loadings using the eigen command in R:

 loadings <- eigen(cov(USArrests))$vectors
 explvar <- loadings^2

Now, you have the desired information in the matrix explvar.

Best Answer

Related Solutions

Solved – How to use principal components analysis to select variables for regression

Solved – Which variables explain which PCA components, and vice versa

Related Question