Solved – How to calculate principal components scores in a compositional data PCA

compositional-datapcar

I hope someone can help. Please let me know if my question is not clear.

I have compositional data, where there are three variables which sum to 100% for each row, something like this:

        grp1     grp2     grp3
1        40       22       38
2        46       31       23
3        18       55       27
4        40       16       44
5        39       38       23
6        32       41       27
7        49       29       22
8        13       69       19
9        37       21       42
10       58        3       38

I'm using R, and within the package robCompositions I found the function pcaCoDa to carry out a principal components analysis on compositional data like this, as I want to reduce the data down to the first principal component (in the hope that this will provide a single variable that explains a majority of the variation across the three variables).

I then ran the PCA with:

pca.res <- pcaCoDa(data[,3:5], method="robust")

where data[,3:5] describes the columns in my full dataset where groups 1-3 are.

I know that I can get the scores for each row of the dataset along each of the principal component vectors directly from the PCA output simply by using pca.res$scores[,1] (for the first PC). But I would like to understand how these scores are calculated (partly for my understanding, and partly so I can project new data into the same PC space). I've done this previously in a 'normal' PCA (I mean, not with compositional data). My understanding is that the PC score is calculated by firstly centering the data, then by multiplying the centered data with the PCA loadings.

I also think I understand that within the compositional data PCA, the data is transformed using a centered log-ratio transformation, so I use this transformation on my raw data initially. Then I try the PC calculation I describe above with this data. When I compare my calculated PC scores with the scores in the PCA output, I get something close-but-not-really, and I don't understand where I'm going wrong. I'm hoping someone could explain? Is there something particular to the compositional data PCA that I am missing?

Thank you very much for your time and consideration!

I've put some more detail here about my attempt to calculate the PC scores manually.

My CLR-transformed data of the above 3 variables looks like this:

I started by transforming my raw data with a centered log-ratio transformation, my transformed data looks like this:

         grp1       grp2      grp3
1   0.21637677 -0.38146024  0.1650835
2   0.36260046 -0.03205373 -0.3305467
3  -0.50747551  0.60948592 -0.1020104
4   0.27366018 -0.64263055  0.3689704
5   0.18468097  0.15870549 -0.3433865
6  -0.02597904  0.22185712 -0.1958781
7   0.44176744 -0.08275703 -0.3590104
8  -0.68288226  0.98627489 -0.3033926
9   0.14654792 -0.41984755  0.2732996
10  1.12822919 -1.83360153  0.7053723

I created a blank variable in my dataset to put the calculated scores in, and alongside it saved the 'real' scores from the PCA output so I could compare them (only working with PC1 for now):

data$PC1.calc <- 0
data$PC1 <- pca.res$scores[,1]

I created a matrix object with the centered values of the data to be projected – I used the CLR-transformed data, and the centering value as saved in the PCA output:

centdat <- as.matrix(data[,20:22] - pca.res$princompOutputClr$center[1])

I then multiplied this centered data by the loadings for PC1 from the PCA output, and saved this all to the empty variable PC1.calc:

for (i in 1:10) {data$PC1.calc[i] <- sum(pca.res$loadings[,1] * centdat[i,])}

A comparison of the PC1 scores from the PCA output (PC1) and the calculated PC1 scores (PC1.calc) using my data example as above, gives the following:

       PC1      PC1.calc
1   0.3098496   0.3892512
2  -0.2857503  -0.2063488
3  -0.5890632  -0.5096616
4   0.6394197   0.7188213
5  -0.4320175  -0.3526159
6  -0.3751055  -0.2957040
7  -0.2689980  -0.1895964
8  -1.0001454  -0.9207439
9   0.4126019   0.4920034
10  1.7304654   1.8098670

Thank you again for any help anyone can offer!

Best Answer

I don't know exactly what robCompositions is doing. Remember, that this package is applying robust statistics.

If you want, you can calculate the first principal coordinate using the coda.base package, or calculate this orthornormal coordinates from the clr-coordinates.

Using `coda.base`

# I am using the first three variables of iris dataset as a composition
X = iris[,1:3]

library(coda.base)
H_pc = coordinates(X, 'pc')

# You have the first coordinate in
H_pc[,1]

The basis is obtained directly from the principal coordinate basis.

attr(H_pc, 'basis')

## Which is equivalent to
H_clr = coordinates(X, 'clr')
prcomp(H_clr)

Without package

To obtain the first principal component you can use the following code

lX = log(X)
clrX = lX - rowMeans(lX)
B = prcomp(clrX)
as.matrix(clrX) %*% B$rotation[,1]

Related Solutions

Solved – Multivariate data analyis of compositional data

Make sure you understand the algorithms before using them.

E.g. k-means minimizes variance, and of course an attribute with a larger scale, with have a much larger variance, too. Therefore, standardizing data is often beneficial there.

But with e.g. hierarchical clustering, you need to give a distance function. Euclidean distance is just one of the many options; and maybe you can be much more specific as to which attribute should have which amount of effect on the result.

The key question is: what is a sensible measure of similarity for your domain. There is no universal measure. With hierarchical clustering, this is just more explicit - K-means is based on the sum-of-squared-deviations, so there you need to rescale / transform your data to give appropriate weight, which is much more limited than specifying a similarity measure for your data.

So: when are two soil samples alike - as you can see this is a domain and purpose question, not so much a statistical question.

PCA Loadings – How to Interpret PCA Loadings

Loadings (which should not be confused with eigenvectors) have the following properties:

Their sums of squares within each component are the eigenvalues (components' variances).
Loadings are coefficients in linear combination predicting a variable by the (standardized) components.

You extracted 2 first PCs out of 4. Matrix of loadings $\bf A$ and the eigenvalues:

A (loadings)
         PC1           PC2
X1   .5000000000   .5000000000 
X2   .5000000000   .5000000000 
X3   .5000000000  -.5000000000 
X4   .5000000000  -.5000000000
Eigenvalues:
    1.0000000000  1.0000000000

In this instance, both eigenvalues are equal. It is a rare case in real world, it says that PC1 and PC2 are of equal explanatory "strength".

Suppose you also computed component values, Nx2 matrix $\bf C$, and you z-standardized (mean=0, st. dev.=1) them within each column. Then (as point 2 above says), $\bf \hat {X}=CA'$. But, because you left only 2 PCs out of 4 (you lack 2 more columns in $\bf A$) the restored data values $\bf \hat {X}$ are not exact, - there is an error (if eigenvalues 3, 4 are not zero).

OK. What are the coefficients to predict components by variables? Clearly, if $\bf A$ were full 4x4, these would be $\bf B=(A^{-1})'$. With non-square loading matrix, we may compute them as $\bf B= A \cdot diag(eigenvalues)^{-1}=(A^+)'$, where diag(eigenvalues) is the square diagonal matrix with the eigenvalues on its diagonal, and + superscript denotes pseudoinverse. In your case:

diag(eigenvalues):
1 0
0 1

B (coefficients to predict components by original variables):
    PC1           PC2
X1 .5000000000   .5000000000 
X2 .5000000000   .5000000000 
X3 .5000000000  -.5000000000 
X4 .5000000000  -.5000000000

So, if $\bf X$ is Nx4 matrix of original centered variables (or standardized variables, if you are doing PCA based on correlations rather than covariances), then $\bf C=XB$; $\bf C$ are standardized principal component scores. Which in your example is:

PC1 = 0.5*X1 + 0.5*X2 + 0.5*X3 + 0.5*X4 ~ (X1+X2+X3+X4)/4

"the first component is proportional to the average score"

PC2 = 0.5*X1 + 0.5*X2 - 0.5*X3 - 0.5*X4 = (0.5*X1 + 0.5*X2) - (0.5*X3 + 0.5*X4)

"the second component measures the difference between the first pair of scores and the second pair of scores"

In this example it appeared that $\bf B=A$, but in general case they are different.

Note: The above formula for the coefficients to compute component scores, $\bf B= A \cdot diag(eigenvalues)^{-1}$, is equivalent to $\bf B=R^{-1}A$, with $\bf R$ being the covariance (or correlation) matrix of variables. The latter formula comes directly from linear regression theory. The two formulas are equivalent within PCA context only. In factor analysis, they are not and to compute factor scores (which are always approximate in FA) one should rely on the second formula.