Solved – Standardization of compositional data in PCA versus using real data

compositional-datacorrelationdata transformationeigenvaluespca

I have a question about conducting a PCA between variables that are measured in different units. I understand the importance of using a correlation matrix versus a covariance matrix to minimize variance. The data I'm working with is not normally distributed and has not been transformed in other tests.

For example, there are three variables A, B, and C, and 20 observations, where 10 observations are measured using 1 set of units, and the other 10 observations are measured using another set of units*. The values between the units are quite different in in value and variance (expected). The data is not normal in either units and has not been transformed.

The measurements using the first set of units is 2 to 3 orders of magnitude higher than those measured using the other units (expected). I have conducted a PCA using a correlation matrix and interpreted results. However a non-statistician recommended I `standardize' the measured data, such that I'm using ratio or fractions for all the observations for each of the variables: Variable A/(sum of all 3 variables) and so on and so forth for Variables B and C.

However, a PCA using the contributing fraction of each variable is different from PCA using the measured value in different units leading to different eigenvalues and eigenvectors, thus leading to two different scientific interpretations.

Beyond using the appropriate association matrix, Is this "standardization" step valid and or necessary from a statistical perspective? Update: Should PCA be done on compositional data?

Best Answer

First, whether you use the covariance matrix, or the correlation matrix (equivalent to standardizing each variable before carrying out PCA on the covariance matrix), or transform the data in any other way before carrying out PCA, the results of the PCA apply to that transformed data. So you should not be surprised to see different eigensystems using different transformations; any interpretations you may make may of course be different, but are are not conflicting. If they seem to conflict you must be misinterpreting them.

Second, whether it's more meaningful to express each variable as a fraction of the sum of variables for each individual is for you to decide, before thinking about principal components. If it is more meaningful, PCA on the data thus transformed may not be what you want: any one variable is expressible in terms of the other two, which are still constrained not to exceed unity in total. A scatterplot would be an obvious method to look at three variables, using barycentric co-ordinates if you like. If you still need PCA for something, Aitchison (1983), Biometrika 70 (1) discusses the issues, & gives useful transformations to use for vectors of proportions, & you may be interested in the R packages compositions & robCompositions.

Related Question