Solved – How to calculate principal components scores in a compositional data PCA

compositional-datapcar

I hope someone can help. Please let me know if my question is not clear.

I have compositional data, where there are three variables which sum to 100% for each row, something like this:

        grp1     grp2     grp3
1        40       22       38
2        46       31       23
3        18       55       27
4        40       16       44
5        39       38       23
6        32       41       27
7        49       29       22
8        13       69       19
9        37       21       42
10       58        3       38

I'm using R, and within the package robCompositions I found the function pcaCoDa to carry out a principal components analysis on compositional data like this, as I want to reduce the data down to the first principal component (in the hope that this will provide a single variable that explains a majority of the variation across the three variables).

I then ran the PCA with:

pca.res <- pcaCoDa(data[,3:5], method="robust")

where data[,3:5] describes the columns in my full dataset where groups 1-3 are.

I know that I can get the scores for each row of the dataset along each of the principal component vectors directly from the PCA output simply by using pca.res$scores[,1] (for the first PC). But I would like to understand how these scores are calculated (partly for my understanding, and partly so I can project new data into the same PC space). I've done this previously in a 'normal' PCA (I mean, not with compositional data). My understanding is that the PC score is calculated by firstly centering the data, then by multiplying the centered data with the PCA loadings.

I also think I understand that within the compositional data PCA, the data is transformed using a centered log-ratio transformation, so I use this transformation on my raw data initially. Then I try the PC calculation I describe above with this data. When I compare my calculated PC scores with the scores in the PCA output, I get something close-but-not-really, and I don't understand where I'm going wrong. I'm hoping someone could explain? Is there something particular to the compositional data PCA that I am missing?

Thank you very much for your time and consideration!

I've put some more detail here about my attempt to calculate the PC scores manually.

My CLR-transformed data of the above 3 variables looks like this:

I started by transforming my raw data with a centered log-ratio transformation, my transformed data looks like this:

         grp1       grp2      grp3
1   0.21637677 -0.38146024  0.1650835
2   0.36260046 -0.03205373 -0.3305467
3  -0.50747551  0.60948592 -0.1020104
4   0.27366018 -0.64263055  0.3689704
5   0.18468097  0.15870549 -0.3433865
6  -0.02597904  0.22185712 -0.1958781
7   0.44176744 -0.08275703 -0.3590104
8  -0.68288226  0.98627489 -0.3033926
9   0.14654792 -0.41984755  0.2732996
10  1.12822919 -1.83360153  0.7053723

I created a blank variable in my dataset to put the calculated scores in, and alongside it saved the 'real' scores from the PCA output so I could compare them (only working with PC1 for now):

data$PC1.calc <- 0
data$PC1 <- pca.res$scores[,1]

I created a matrix object with the centered values of the data to be projected – I used the CLR-transformed data, and the centering value as saved in the PCA output:

centdat <- as.matrix(data[,20:22] - pca.res$princompOutputClr$center[1])

I then multiplied this centered data by the loadings for PC1 from the PCA output, and saved this all to the empty variable PC1.calc:

for (i in 1:10) {data$PC1.calc[i] <- sum(pca.res$loadings[,1] * centdat[i,])}

A comparison of the PC1 scores from the PCA output (PC1) and the calculated PC1 scores (PC1.calc) using my data example as above, gives the following:

       PC1      PC1.calc
1   0.3098496   0.3892512
2  -0.2857503  -0.2063488
3  -0.5890632  -0.5096616
4   0.6394197   0.7188213
5  -0.4320175  -0.3526159
6  -0.3751055  -0.2957040
7  -0.2689980  -0.1895964
8  -1.0001454  -0.9207439
9   0.4126019   0.4920034
10  1.7304654   1.8098670

Thank you again for any help anyone can offer!

Best Answer

I don't know exactly what robCompositions is doing. Remember, that this package is applying robust statistics.

If you want, you can calculate the first principal coordinate using the coda.base package, or calculate this orthornormal coordinates from the clr-coordinates.

Using coda.base

# I am using the first three variables of iris dataset as a composition
X = iris[,1:3]

library(coda.base)
H_pc = coordinates(X, 'pc')

# You have the first coordinate in
H_pc[,1]

The basis is obtained directly from the principal coordinate basis.

attr(H_pc, 'basis')

## Which is equivalent to
H_clr = coordinates(X, 'clr')
prcomp(H_clr)

Without package

To obtain the first principal component you can use the following code

lX = log(X)
clrX = lX - rowMeans(lX)
B = prcomp(clrX)
as.matrix(clrX) %*% B$rotation[,1]