Solved – How to perform isometric log-ratio transformation

compositional-datadata transformationmultivariate analysisr

I have data on movement behaviours (time spent sleeping, sedentary, and doing physical activity) that sums to approximately 24 (as in hours per day). I want to create a variable that captures the relative time spent in each of these behaviours – I've been told that an isometric log-ratio transformation would accomplish this.

It looks like I should use the ilr function in R, but can't find any actual examples with code. Where do I start?

The variables I have are time spent sleeping, average sedentary time, average average light physical activity, average moderate physical activity, and average vigorous physical activity. Sleep was self-reported, while the others are averages from valid days of accelerometer data. So for these variables, cases do not sum to exactly 24.

My guess:
I'm working in SAS, but it looks like R will be much easier to use for this part. So first import data with only the variables of interest.
Then use acomp() function. Then I can't figure out the syntax for the ilr() function. Any help would be much appreciated.

Best Answer

The ILR (Isometric Log-Ratio) transformation is used in the analysis of compositional data. Any given observation is a set of positive values summing to unity, such as the proportions of chemicals in a mixture or proportions of total time spent in various activities. The sum-to-unity invariant implies that although there may be $k\ge 2$ components to each observation, there are only $k-1$ functionally independent values. (Geometrically, the observations lie on a $k-1$-dimensional simplex in $k$-dimensional Euclidean space $\mathbb{R}^k$. This simplicial nature is manifest in the triangular shapes of the scatterplots of simulated data shown below.)

Typically, the distributions of the components become "nicer" when log transformed. This transformation can be scaled by dividing all values in an observation by their geometric mean before taking the logs. (Equivalently, the logs of the data in any observation are centered by subtracting their mean.) This is known as the "Centered Log-Ratio" transformation, or CLR. The resulting values still lie within a hyperplane in $\mathbb{R}^k$, because the scaling causes the sum of the logs to be zero. The ILR consists of choosing any orthonormal basis for this hyperplane: the $k-1$ coordinates of each transformed observation become its new data. Equivalently, the hyperplane is rotated (or reflected) to coincide with the plane with vanishing $k^\text{th}$ coordinate and one uses the first $k-1$ coordinates. (Because rotations and reflections preserve distance they are isometries, whence the name of this procedure.)

Tsagris, Preston, and Wood state that "a standard choice of [the rotation matrix] $H$ is the Helmert sub-matrix obtained by removing the first row from the Helmert matrix."

The Helmert matrix of order $k$ is constructed in a simple manner (see Harville p. 86 for instance). Its first row is all $1$s. The next row is one of the the simplest that can be made orthogonal to the first row, namely $(1, -1, 0, \ldots, 0)$. Row $j$ is among the simplest that is orthogonal to all preceding rows: its first $j-1$ entries are $1$s, which guarantees it is orthogonal to rows $2, 3, \ldots, j-1$, and its $j^\text{th}$ entry is set to $1-j$ to make it orthogonal to the first row (that is, its entries must sum to zero). All rows are then rescaled to unit length.

Here, to illustrate the pattern, is the $4\times 4$ Helmert matrix before its rows have been rescaled:

$$\pmatrix{1&1&1&1 \\ 1&-1&0&0 \\ 1&1&-2&0 \\ 1&1&1&-3}.$$

(Edit added August 2017) One particularly nice aspect of these "contrasts" (which are read row by row) is their interpretability. The first row is dropped, leaving $k-1$ remaining rows to represent the data. The second row is proportional to the difference between the second variable and the first. The third row is proportional to the difference between the third variable and the first two. Generally, row $j$ ($2\le j \le k$) reflects the difference between variable $j$ and all those that precede it, variables $1, 2, \ldots, j-1$. This leaves the first variable $j=1$ as a "base" for all contrasts. I have found these interpretations helpful when following the ILR by Principal Components Analysis (PCA): it enables the loadings to be interpreted, at least roughly, in terms of comparisons among the original variables. I have inserted a line into the R implementation of ilr below that gives the output variables suitable names to help with this interpretation. (End of edit.)

Since R provides a function contr.helmert to create such matrices (albeit without the scaling, and with rows and columns negated and transposed), you don't even have to write the (simple) code to do it. Using this, I implemented the ILR (see below). To exercise and test it, I generated $1000$ independent draws from a Dirichlet distribution (with parameters $1,2,3,4$) and plotted their scatterplot matrix. Here, $k=4$.

The points all clump near the lower left corners and fill triangular patches of their plotting areas, as is characteristic of compositional data.

Their ILR has just three variables, again plotted as a scatterplot matrix:

This does indeed look nicer: the scatterplots have acquired more characteristic "elliptical cloud" shapes, better amenable to second-order analyses such as linear regression and PCA.

Tsagris et al. generalize the CLR by using a Box-Cox transformation, which generalizes the logarithm. (The log is a Box-Cox transformation with parameter $0$.) It is useful because, as the authors (correctly IMHO) argue, in many applications the data ought to determine their transformation. For these Dirichlet data a parameter of $1/2$ (which is halfway between no transformation and a log transformation) works beautifully:

"Beautiful" refers to the simple description this picture permits: instead of having to specify the location, shape, size, and orientation of each point cloud, we need only observe that (to an excellent approximation) all the clouds are circular with similar radii. In effect, the CLR has simplified an initial description requiring at least 16 numbers into one that requires only 12 numbers and the ILR has reduced that to just four numbers (three univariate locations and one radius), at a price of specifying the ILR parameter of $1/2$--a fifth number. When such dramatic simplifications happen with real data, we usually figure we're on to something: we have made a discovery or achieved an insight.

This generalization is implemented in the ilr function below. The command to produce these "Z" variables was simply

z <- ilr(x, 1/2)

One advantage of the Box-Cox transformation is its applicability to observations that include true zeros: it is still defined provided the parameter is positive.

References

Michail T. Tsagris, Simon Preston and Andrew T.A. Wood, A data-based power transformation for compositional data. arXiv:1106.1451v2 [stat.ME] 16 Jun 2011.

David A. Harville, Matrix Algebra From a Statistician's Perspective. Springer Science & Business Media, Jun 27, 2008.

Here is the R code.

#
# ILR (Isometric log-ratio) transformation.
# `x` is an `n` by `k` matrix of positive observations with k >= 2.
#
ilr <- function(x, p=0) {
  y <- log(x)
  if (p != 0) y <- (exp(p * y) - 1) / p       # Box-Cox transformation
  y <- y - rowMeans(y, na.rm=TRUE)            # Recentered values
  k <- dim(y)[2]
  H <- contr.helmert(k)                       # Dimensions k by k-1
  H <- t(H) / sqrt((2:k)*(2:k-1))             # Dimensions k-1 by k
  if(!is.null(colnames(x)))                   # (Helps with interpreting output)
    colnames(z) <- paste0(colnames(x)[-1], ".ILR")
  return(y %*% t(H))                          # Rotated/reflected values
}
#
# Specify a Dirichlet(alpha) distribution for testing.
#
alpha <- c(1,2,3,4)
#
# Simulate and plot compositional data.
#
n <- 1000
k <- length(alpha)
x <- matrix(rgamma(n*k, alpha), nrow=n, byrow=TRUE)
x <- x / rowSums(x)
colnames(x) <- paste0("X.", 1:k)
pairs(x, pch=19, col="#00000040", cex=0.6)
#
# Obtain the ILR.
#
y <- ilr(x)
colnames(y) <- paste0("Y.", 1:(k-1))
#
# Plot the ILR.
#
pairs(y, pch=19, col="#00000040", cex=0.6)

Related Solutions

Solved – Why is isometric log-ratio transformation preferred over the additive(alr) or centered(clr) with compositional data

Continuing off of marianess's answer, clr is really not suitable due to the colinearity issue. In words if you try to make inferences with clr transformed data, you may fall in the trap of trying to infer increase/decreases of variables, which you can never never do with proportions in the first place.

The ilr transformation attempts to resolve this by just sticking to ratios of partitions, since ratios are stable quantities. These partitions can be represented as trees, where internal nodes in the tree represents the log ratio of the geometric means of the subtrees. This log ratios of subtrees is known as balances.

I'd also recommend checking out these publications, since they all have nice explanations of how to interpret the ilr transform.

http://msystems.asm.org/content/2/1/e00162-16

https://peerj.com/articles/2969/

https://elifesciences.org/content/6/e21887

Here is an IPython notebook that goes in the details of how to calculate balances given a tree

I also gave a description how to this with the modules in scikit-bio here in case you curious.

Solved – How to use isometric logratio ilr() from a package “compositions”

If your data is compositional, it means that the only relevant information available in your data is the relative information you have between parts. Therefore, you are interested in studying the relations between parts relatively. The log-ratio methodology studies this relative relations using the quotient between parts, to be more precise, using the logarithm between ratios of parts.

You can see that all the possible ratios between a $k$-part composition, $(x_1,\dots,x_k)$, are completely caracterized by considering only $k-1$ certain (independant) log-ratios. A common approach is to consider all the ratios against the last component, $(\log\frac{x_1}{x_k}, \dots, \log\frac{x_{k-1}}{x_k}))$, this approach is known as the additive log-ratio transformation (alr). It is possible to interpret the alr tranformation as the coordinates with respect a certain basis.

A first problem you have with the alr tranformation is that the basis used to obtain the trnaformed values is not orthogonal but oblique and
a second problem is that the simplex does not seems to have an standard basis.

Although, you can define orthonormal basis in the simplex and with this basis you can define a transformation. This tranformation is commonly known as ilr transformation (ilr) and the function implemented inside the package compositions is to obtain the ilr coordinates with respect to an orthonormal basis.

With your example (I've reduced the number of composition),

set.seed(1)
# loading library
library(compositions)

# Generate data
dataset <- data.frame(
x = runif(5, min = 0.2, max = 0.65), 
y = runif(5, min = 0.2, max = 0.4),   
z = runif(5, min = 0.1, max = 0.7))  

# Make data compositional
dataset.compositional = acomp(dataset)

you have that the coordinates of your sample

(X <- dataset.compositional/rowSums(dataset.compositional))
#         x         y         z        
# [1,] 0.3462279 0.4114672 0.2423048
# [2,] 0.3818417 0.4041619 0.2139964
# [3,] 0.3515582 0.2550841 0.3933578
# [4,] 0.4811888 0.2575718 0.2612394
# [5,] 0.2730063 0.1993929 0.5276008

are

(dataset.ilr <- ilr(dataset.compositional))
#            [,1]       [,2]
# [1,]  0.12206935 -0.3618850
# [2,]  0.04017035 -0.4959822
# [3,] -0.22682715  0.2226876
# [4,] -0.44191428 -0.2435951
# [5,] -0.22218525  0.6662235

By default, the ilr function uses the basis (in columns)

(B <- exp(ilrBase(D=3))
# 1 0.4930687 0.6648138
# 2 2.0281150 0.6648138
# 3 1.0000000 2.2625592

which means that your original sample can be obtained from the orthnormal basis B and the ilr coordinates.

X1 = t(apply(dataset.ilr, 1, function(x){
  B[,1]^x[1] * B[,2]^x[2]
}))
X1 / rowSums(X1)
#            1         2         3
# [1,] 0.3462279 0.4114672 0.2423048
# [2,] 0.3818417 0.4041619 0.2139964
# [3,] 0.3515582 0.2550841 0.3933578
# [4,] 0.4811888 0.2575718 0.2612394
# [5,] 0.2730063 0.1993929 0.5276008

In this case, taking into an account the basis B, we see that (except for a constant term) the first column of the ilr coordinates is comparing the first component against the second component

1/sqrt(2) * log(X[,2]/X[,1]) # Compare with first column of dataset.ilr
# [1]  0.12206935  0.04017035 -0.22682715 -0.44191428 -0.22218525

the second column is comparing (expect for a constant term) the thir component against the other components (in fact, against the geometric mean of first and second component)

sqrt(2)/sqrt(3) * log(X[,3] / (X[,1]*X[,2])^(1/2))
# [1] -0.3618850 -0.4959822  0.2226876 -0.2435951  0.6662235

Finally, I think that a good place to find more information about the subject: articles, books, ... is http://www.compositionaldata.com/material.php