PCA – What is the Horseshoe Effect and Arch Effect in Correspondence Analysis?

correspondence-analysisecologyexploratory-data-analysispca

There are many techniques in ecological statistics for exploratory data analysis of multidimensional data. These are called 'ordination' techniques. Many are the same or closely related to common techniques elsewhere in statistics. Perhaps the prototypical example would be principal components analysis (PCA). Ecologists might use PCA, and related techniques, to explore 'gradients' (I am not entirely clear what a gradient is, but I've been reading a little bit about it.)

On this page, the last item under Principal Components Analysis (PCA) reads:

PCA has a serious problem for vegetation data: the horseshoe effect. This is caused by the curvilinearity of species distributions along gradients. Since species response curves are typically unimodal (i.e. very strongly curvilinear), horseshoe effects are common.

Further down the page, under Correspondence Analysis or Reciprocal Averaging (RA), it refers to "the arch effect":

RA has a problem: the arch effect. It is also caused by nonlinearity of distributions along gradients.

The arch is not as serious as the horseshoe effect of PCA, because the ends of the gradient are not convoluted.

Can someone explain this? I have recently seen this phenomenon in plots that re-represent data in a lower dimensional space (viz., correspondence analysis and factor analysis).

What would a "gradient" correspond to more generally (i.e., in a non-ecological context)?
If this happens with your data, is it a "problem" ("serious problem")? For what?
How should one interpret output where a horseshoe / arch shows up?
Does a remedy need to be applied? What? Would transformations of the original data help? What if the data are ordinal ratings?

The answers may exist in other pages on that site (e.g., for PCA, CA, and DCA). I have been trying to work through those. But the discussions are couched in sufficiently unfamiliar ecological terminology and examples that it is harder to understand the issue.

Best Answer

Q1

Ecologists talk of gradients all the time. There are lots of kinds of gradients, but it may be best to think of them as some combination of whatever variable(s) you want or are important for the response. So a gradient could be time, or space, or soil acidity, or nutrients, or something more complex such as a linear combination of a range of variables required by the response in some way.

We talk about gradients because we observe species in space or time and a whole host of things vary with that space or time.

Q2

I have come to the conclusion that in many cases the horseshoe in PCA is not a serious problem if you understand how it arises and don't do silly things like take PC1 when the "gradient" is actually represented by PC1 and PC2 (well it is also split into higher PCs too, but hopefully a 2-d representation is OK).

In CA I guess I think the same (now having been forced to think a bit about it). The solution can form an arch when there is no strong 2nd dimension in the data such that a folded version of the first axis, which satisfies the orthogonality requirement of the CA axes, explains more "inertia" than another direction in the data. This may be more serious as this is made up structure where with PCA the arch is just a way to represent species abundances at sites along a single dominant gradient.

I've never quite understood why people worry so much about the wrong ordering along PC1 with a strong horseshoe. I would counter that you shouldn't take just PC1 in such cases, and then the problem goes away; the pairs of coordinates on PC1 and PC2 get rid of the reversals on any one of those two axes.

Q3

If I saw the horseshoe in a PCA biplot, I would interpret the data as having a single dominant gradient or direction of variation.

If I saw the arch, I would probably conclude the same, but I would be very wary of trying to explain CA axis 2 at all.

I would not apply DCA - it just twists the arch away (in the best circumstances) such that you don't see to oddities in 2-d plots, but in many cases it produces other spurious structures such as diamonds or trumpet shapes to the arrangement of samples in the DCA space. For example:

library("vegan")
data(BCI)
plot(decorana(BCI), display = "sites", type = "p") ## does DCA

enter image description here

We see a typical fanning out of sample points towards the left of the plot.

Q4

I would suggest that the answer to this question depends on the aims of your analysis. If the arch/horseshoe was due to a single dominant gradient, then rather than have to represent this as $m$ PCA axes, it would be beneficial if we could estimate a single variable that represents the positions of sites/samples along the gradient.

This would suggest finding a nonlinear direction in the high-dimensional space of the data. One such method is the principal curve of Hastie & Stuezel, but other non-linear manifold methods are available which might suffice.

For example, for some pathological data

enter image description here

We see a strong horseshoe. The principal curve tries to recover this underlying gradient or arrangement/ordering of samples via a smooth curve in the m dimensions of the data. The figure below shows how the iterative algorithm converges on something approximating the underlying gradient. (I think it wanders away from the data at the top of the plot so as to be closer to the data in higher dimensions, and partly because of the self-consistency criterion for a curve to be declared a principal curve.)

enter image description here

I have more details including code on my blog post from which I took those images. But the main point here is the the principal curves easily recovers the known ordering of samples whereas PC1 or PC2 on its own does not.

In the PCA case, it is common to apply transformations in ecology. Popular transformations are those that can be thought of returning some non-Euclidean distance when the Euclidean distance is computed on the transformed data. For example, the Hellinger distance is

$$D_{\mathrm{Hellinger}}(x1, x2) = \sqrt{\sum_{j=1}^p \left [ \sqrt{\frac{y_{1j}}{y_{1+}}} - \sqrt{\frac{y_{2j}}{y_{2+}}} \right ]^2}$$

Where $y_{ij}$ is the abundance of the $j$th species in sample $i$, $y_{i+}$ is the sum of the abundances of all species in the $i$th sample. If we convert the data to proportions and apply a square-root transformation, then the Euclidean distance-preserving PCA will represent the Hellinger distances in the original data.

The horseshoe has been known and studied for a long time in ecology; some of the early literature (plus a more modern look) is

Goodall D.W. et al. (1954) Objective methods for the classification of vegetation. III. An essay in the use of factor analysis. Australian Journal of Botany 2, 304–324.
Noy-Meir I. & Austin M.P. et al. (1970) Principal Component Ordination and Simulated Vegetational Data. Ecology 51, 551–552.
Podani J. & Miklós I. et al. (2002) Resemblance Coefficients and the Horseshoe Effect in Principal Coordinates Analysis. Ecology 83, 3331–3343.
Swan J.M.A. et al. (1970) An Examination of Some Ordination Problems By Use of Simulated Vegetational Data. Ecology 51, 89–102.

The main principal curve references are

De’ath G. et al. (1999) Principal Curves: a new technique for indirect and direct gradient analysis. Ecology 80, 2237–2253.
Hastie T. & Stuetzle W. et al. (1989) Principal Curves. Journal of the American Statistical Association 84, 502–516.

With the former being a very ecological presentation.

SVD

Singular-value decomposition is at the root of the three kindred techniques. Let $\bf X$ be $r \times c$ table of real values. SVD is $\bf X = U_{r\times r}S_{r\times c}V_{c\times c}'$. We may use just $m$ $[m \le\min(r,c)]$ first latent vectors and roots to obtain $\bf X_{(m)}$ as the best $m$-rank approximation of $\bf X$: $\bf X_{(m)} = U_{r\times m}S_{m\times m}V_{c\times m}'$. Further, we'll notate $\bf U=U_{r\times m}$, $\bf V=V_{c\times m}$, $\bf S=S_{m\times m}$.

Singular values $\bf S$ and their squares, the eigenvalues, represent scale, also called inertia, of the data. Left eigenvectors $\bf U$ are the coordinates of the rows of the data onto the $m$ principal axes; while right eigenvectors $\bf V$ are the coordinates of the columns of the data onto those same latent axes. The entire scale (inertia) is stored in $\bf S$ and so the coordinates $\bf U$ and $\bf V$ are unit-normalized (column SS=1).

Principal Component Analysis by SVD

In PCA, it is agreed upon to consider rows of $\bf X$ as random observations (which can come or go), but to consider columns of $\bf X$ as fixed number of dimensions or variables. Hence it is appropriate and convenient to remove the effect of the number of rows (and only rows) on the results, particularly on the eigenvalues, by svd-decomposing of $\mathbf Z=\mathbf X/\sqrt{r}$ instead of $\bf X$. Note that this corresponds to eigen-decomposition of $\mathbf {X'X}/r$, $r$ being the sample size n. (Often, mostly with covariances - to make them unbiased - we'll prefer to divide by $r-1$, but it is a nuance.)

The multiplication of $\bf X$ by a constant affected only $\bf S$; $\bf U$ and $\bf V$ remain to be the unit-normalized coordinates of rows and of columns.

From here and everywhere below we redefine $\bf S$, $\bf U$ and $\bf V$ as given by svd of $\bf Z$, not of $\bf X$; $\bf Z$ being a normalized version of $\bf X$, and the normalization varies between types of analysis.

By multiplying $\mathbf U\sqrt{r}=\bf U_*$ we bring the mean square in the columns of $\bf U$ to 1. Given that rows are random cases to us, it is logical. We've thus obtained what is called in PCA standard or standardized principal component scores of observations, $\bf U_*$. We do not do the same thing with $\bf V$ because variables are fixed entities.

We then can confer rows with all the inertia, to obtain unstandardized row coordinates, also called in PCA raw principal component scores of observations: $\bf U_*S$. This formula we'll call "direct way". The same result is returned by $\bf XV$; we'll label it "indirect way".

Analogously, we can confer columns with all the inertia, to obtain unstandardized column coordinates, also called in PCA the component-variable loadings: $\bf VS'$ [may ignore transpose if $\bf S$ is square], - the "direct way". The same result is returned by $\bf Z'U$, - the "indirect way". (The above standardized principal component scores can also be computed from the loadings as $\bf X(AS^{-1/2})$, where $\bf A$ are the loadings.)

Biplot

Consider biplot in a sense of a dimensionality reduction analysis on its own, not simply as "a dual scatterplot". This analysis is very similar to PCA. Unlike PCA, both rows and columns are treated, symmetrically, as random observations, which means that $\bf X$ is being seen as a random two-way table of varying dimensionality. Then, naturally, normalize it by both $r$ and $c$ before svd: $\mathbf Z=\mathbf X/\sqrt{rc}$.

After svd, compute standard row coordinates as we did it in PCA: $\mathbf U_*=\mathbf U\sqrt{r}$. Do the same thing (unlike PCA) with column vectors, to obtain standard column coordinates: $\mathbf V_*=\mathbf V\sqrt{c}$. Standard coordinates, both of rows and of columns, have mean square 1.

We may confer rows and/or columns coordinates with inertia of eigenvalues like we do it in PCA. Unstandardized row coordinates: $\bf U_*S$ (direct way). Unstandardized column coordinates: $\bf V_*S'$ (direct way). What's about the indirect way? You can easily deduce by substitutions that the indirect formula for the unstandardized row coordinates is $\mathbf {XV_*}/c$, and for the unstandardized column coordinates is $\mathbf {X'U_*}/r$.

PCA as a particular case of Biplot. From the above descriptions you probably learned that PCA and biplot differ only in how they normalize $\bf X$ into $\bf Z$ which is then decomposed. Biplot normalizes by both the number of rows and the number of columns; PCA normalizes only by the number of rows. Consequently, there is a little difference between the two in the post-svd computations. If in doing biplot you set $c=1$ in its formulas you will get exactly PCA results. Thus, biplot can be seen as a generic method and PCA as a particular case of biplot.

[Column centering. Some user may say: Stop, but doesn't PCA require also and first of all the centering of the data columns (variables) in order it to explain variance? While biplot may not do the centering? My answer: only PCA-in-narrow-sense does the centering and explains variance; I'm discussing linear PCA-in-general-sense, PCA which explains some sort sum of squared deviations from the origin chosen; you might choose it to be the data mean, the native 0 or whatever you like. Thus, the "centering" operation isn't what could distinguish PCA from biplot.]

Passive rows and columns

In biplot or PCA, you can set some rows and/or columns to be passive, or supplementary. Passive row or column does not influence the SVD and therefore does not influence the inertia or the coordinates of other rows/columns, but receives its coordinates in the space of principal axes produced by the active (not passive) rows/columns.

To set some points (rows/columns) to be passive, (1) define $r$ and $c$ be the number of active rows and columns only. (2) Set to zero passive rows and columns in $\bf Z$ before svd. (3) Use the "indirect" ways to compute coordinates of passive rows/columns, since their eigenvector values will be zero.

In PCA, when you compute component scores for new incoming cases with the help of loadings obtained on old observations (using the score coefficient matrix), you actually doing the same thing as taking these new cases in PCA and keeping them passive. Similarly, to compute correlations/covariances of some external variables with the component scores produced by a PCA is equivalent to taking those variables in that PCA and keeping them passive.

Arbitrary spreading of inertia

The column mean squares (MS) of standard coordinates are 1. The column mean squares (MS) of unstandardized coordinates are equal to the inertia of the respective principal axes: all the inertia of eigenvalues was donated to eigenvectors to produce the unstandardized coordinates.

In biplot: row standard coordinates $\bf U_*$ have MS=1 for each principal axis. Row unstandardized coordinates, also called row principal coordinates $\mathbf {U_*S} = \mathbf {XV_*}/c$ have MS = corresponding eigenvalue of $\bf Z$. The same is true for column standard and unstandardized (principal) coordinates.

Generally, it is not required that one endows coordinates with inertia either in full or in none. Arbitrary spreading is allowed, if needed for some reason. Let $p_1$ be the proportion of inertia which is to go to rows. Then the general formula of row coordinates is: $\bf U_*S^{p1}$ (direct way) = $\mathbf {XV_*S^{p1-1}}/c$ (indirect way). If $p_1=0$ we get standard row coordinates, whereas with $p_1=1$ we get principal row coordinates.

Likewise $p_2$ be the proportion of inertia which is to go to columns. Then the general formula of column coordinates is: $\bf V_*S^{p2}$ (direct way) = $\mathbf {X'U_*S^{p2-1}}/r$ (indirect way). If $p_2=0$ we get standard column coordinates, whereas with $p_2=1$ we get principal column coordinates.

The general indirect formulas are universal in that they allow to compute coordinates (standard, principal or in-between) also for the passive points, if there are any.

If $p_1+p_2=1$ they say the inertia is distributed between row and column points. The $p_1=1,p_2=0$, i.e. row-principal-column-standard, biplots are sometimes called "form biplots" or "row-metric preservation" biplots. The $p_1=0,p_2=1$, i.e. row-standard-column-principal, biplots are often called within PCA literature "covariance biplots" or "column-metric preservation" biplots; they display variable loadings (which are juxtaposed to covariances) plus standardized component scores, when applied within PCA.

In correspondence analysis, $p_1=p_2=1/2$ is often used and is called "symmetric" or "canonical" normalization by inertia - it allows (albeit at some expence of euclidean geometric strictness) compare proximity between row and column points, like we can do on multidimensional unfolding map.

Correspondence Analysis (Euclidean model)

Two-way (=simple) correspondence analysis (CA) is biplot used to analyze a two-way contingency table, that is, a non-negative table which entries bear the meaning of some sort of affinity between a row and a column. When the table is frequencies chi-square model correspondence analysis is used. When the entries is, say, means or other scores, a simplier Euclidean model CA is used.

Euclidean model CA is just the biplot described above, only that the table $\bf X$ is additionally preprocessed before it enters the biplot operations. In particular, the values are normalized not only by $r$ and $c$ but also by the total sum $N$.

The preprocessing consists of centering, then normalizing by the mean mass. Centering can be various, most often: (1) centering of columns; (2) centering of rows; (3) two-way centering which is the same operation as computation of frequency residuals; (4) centering of columns after equalizing column sums; (5) centering of rows after equalizing row sums. Normalizing by the mean mass is dividing by the mean cell value of the initial table. At preprocessing step, passive rows/columns, if exist, are standardized passively: they are centered/normalized by the values computed from active rows/columns.

Then usual biplot is done on the preprocessed $\bf X$, starting from $\mathbf Z=\mathbf X/\sqrt{rc}$.

Weighted Biplot

Imagine that the activity or importance of a row or a column can be any number between 0 and 1, and not only 0 (passive) or 1 (active) as in the classic biplot discussed so far. We could weight the input data by these row and column weights and perform weighted biplot. With weighted biplot, the greater is the weight the more influential is that row or that column regarding all the results - the inertia and the coordinates of all the points onto the principal axes.

The user supplies row weights and column weights. These and those are first normalized separately to sum to 1. Then the normalization step is $\mathbf{Z_{ij} = X_{ij}}\sqrt{w_i w_j}$, with $w_i$ and $w_j$ being the weights for row i and column j. Exactly zero weight designates the row or the column to be passive.

At that point we may discover that classic biplot is simply this weighted biplot with equal weights $1/r$ for all active rows and equal weights $1/c$ for all active columns; $r$ and $c$ the numbers of active rows and active columns.

Perform svd of $\bf Z$. All operations are the same as in classic biplot, the only difference being that $w_i$ is in place of $1/r$ and $w_j$ is in place of $1/c$. Standard row coordinates: $\mathbf {U_{*i}=U_i}/\sqrt{w_i}$ and standard column coordinates: $\mathbf {V_{*j}=V_j}/\sqrt{w_j}$. (These are for rows/columns with nonzero weight. Leave values as 0 for those with zero weight and use the indirect formulas below to obtain standard or whatever coordinates for them.)

Give inertia to coordinates in the proportion you want (with $p_1=1$ and $p_2=1$ the coordinates will be fully unstandardized, or principal; with $p_1=0$ and $p_2=0$ they will stay standard). Rows: $\bf U_*S^{p1}$ (direct way) = $\bf X[Wj]V_*S^{p1-1}$ (indirect way). Columns: $\bf V_*S^{p2}$ (direct way) = $\bf ([Wi]X)'U_*S^{p2-1}$ (indirect way). Matrices in brackets here are the diagonal matrices of the column and the row weights, respectively. For passive points (that is, with zero weights) only the indirect way of computation is suited. For active (positive weights) points you may go either way.

PCA as a particular case of Biplot revisited. When considering unweighted biplot earlier I mentioned that PCA and biplot are equivalent, the only difference being that biplot sees columns (variables) of the data as random cases symmetrically to observations (rows). Having extended now biplot to more general weighted biplot we may once again claim it, observing that the only difference is that (weighted) biplot normalizes the sum of column weights of input data to 1, and (weighted) PCA - to the number of (active) columns. So here is the weighted PCA introduced. Its results are proportionally identical to those of weighted biplot. Specifically, if $c$ is the number of active columns, then the following relationships are true, for weighted as well as classic versions of the two analyses:

eigenvalues of PCA = eigenvalues of biplot $\cdot c$;
loadings = column coordinates under "principal normalization" of columns;
standardized component scores = row coordinates under "standard normalization" of rows;
eigenvectors of PCA = column coordinates under "standard normalization" of columns $/ \sqrt c$;
raw component scores = row coordinates under "principal normalization" of rows $\cdot \sqrt c$.

Correspondence Analysis (Chi-square model)

This is technically a weighted biplot where weights are being computed from a table itself rather then supplied by the user. It is used mostly to analyze frequency cross-tables. This biplot will approximate, by euclidean distances on the plot, chi-square distances in the table. Chi-square distance is mathematically the euclidean distance inversely weighted by the marginal totals. I will not go further in details of Chi-square model CA geometry.

The preprocessing of frequency table $\bf X$ is as follows: divide each frequency by the expected frequency, then subtract 1. It is the same as to first obtain the frequency residual and then to divide by the expected frequency. Set row weights to $w_i=R_i/N$ and column weights to $w_j=C_j/N$, where $R_i$ is the marginal sum of row i (active columns only), $C_j$ is the marginal sum of column j (active rows only), $N$ is the table total active sum (the three numbers come from the initial table).

Then do weighted biplot: (1) Normalize $\bf X$ into $\bf Z$. (2) The weights are never zero (zero $R_i$ and $C_j$ are not allowed in CA); however you can force rows/columns to become passive by zeroing them in $\bf Z$, so their weights are ineffective at svd. (3) Do svd. (4) Compute standard and inertia-vested coordinates as in weighted biplot.

In Chi-square model CA as well as in Euclidean model CA using two-way centering one last eigenvalue is always 0, so the maximal possible number of principal dimensions is $\min(r-1,c-1)$.

See also a nice overview of chi-square model CA in this answer.

Illustrations

Here is some data table.

 row     A     B     C     D     E     F
   1     6     8     6     2     9     9
   2     0     3     8     5     1     3
   3     2     3     9     2     4     7
   4     2     4     2     2     7     7
   5     6     9     9     3     9     6
   6     6     4     7     5     5     8
   7     7     9     6     6     4     8
   8     4     4     8     5     3     7
   9     4     6     7     3     3     7
  10     1     5     4     5     3     6
  11     1     5     6     4     8     3
  12     0     6     7     5     3     1
  13     6     9     6     3     5     4
  14     1     6     4     7     8     4
  15     1     1     5     2     4     3
  16     8     9     7     5     5     9
  17     2     7     1     3     4     4
  28     5     3     3     9     6     4
  19     6     7     6     2     9     6
  20    10     7     4     4     8     7

Several dual scatterplots (in 2 first principal dimensions) built on analyses of these values follow. Column points are connected with the origin by spikes for visual emphasis. There were no passive rows or columns in these analyses.

The first biplot is SVD results of the data table analyzed "as is"; the coordinates are the row and the column eigenvectors.

enter image description here

Below is one of possible biplots coming from PCA. PCA was done on the data "as is", without centering the columns; however, as it is adopted in PCA, normalization by the number of rows (the number of cases) was done initially. This specific biplot displays principal row coordinates (i.e. raw component scores) and principal column coordinates (i.e. variable loadings).

enter image description here

Next is biplot sensu stricto: The table was initially normalized both by the number of rows and the number of columns. Principal normalization (inertia spreading) was used for both row and column coordinates - as with PCA above. Note the similarity with the PCA biplot: the only difference is due to the difference in the initial normalization.

enter image description here

Chi-square model correspondence analysis biplot. The data table was preprocessed in the special manner, it included two-way centering and a normalization using marginal totals. It is a weighted biplot. Inertia was spread over the row and the column coordinates symmetrically - both are halfway between "principal" and "standard" coordinates.

enter image description here

The coordinates displayed on all these scatterplots:

point      dim1_1   dim2_1   dim1_2   dim2_2   dim1_3   dim2_3   dim1_4   dim2_4
1            .290     .247   16.871    3.048    6.887    1.244    -.479    -.101
2            .141    -.509    8.222   -6.284    3.356   -2.565    1.460    -.413
3            .198    -.282   11.504   -3.486    4.696   -1.423     .414    -.820
4            .175     .178   10.156    2.202    4.146     .899    -.421     .339
5            .303     .045   17.610     .550    7.189     .224    -.171    -.090
6            .245    -.054   14.226    -.665    5.808    -.272    -.061    -.319
7            .280     .051   16.306     .631    6.657     .258    -.180    -.112
8            .218    -.248   12.688   -3.065    5.180   -1.251     .322    -.480
9            .216    -.105   12.557   -1.300    5.126    -.531     .036    -.533
10           .171    -.157    9.921   -1.934    4.050    -.789     .433     .187
11           .194    -.137   11.282   -1.689    4.606    -.690     .384     .535
12           .157    -.384    9.117   -4.746    3.722   -1.938    1.121     .304
13           .235     .099   13.676    1.219    5.583     .498    -.295    -.072
14           .210    -.105   12.228   -1.295    4.992    -.529     .399     .962
15           .115    -.163    6.677   -2.013    2.726    -.822     .517    -.227
16           .304     .103   17.656    1.269    7.208     .518    -.289    -.257
17           .151     .147    8.771    1.814    3.581     .741    -.316     .670
18           .198    -.026   11.509    -.324    4.699    -.132     .137     .776
19           .259     .213   15.058    2.631    6.147    1.074    -.459     .005
20           .278     .414   16.159    5.112    6.597    2.087    -.753     .040
A            .337     .534    4.387    1.475    4.387    1.475    -.865    -.289
B            .461     .156    5.998     .430    5.998     .430    -.127     .186
C            .441    -.666    5.741   -1.840    5.741   -1.840     .635    -.563
D            .306    -.394    3.976   -1.087    3.976   -1.087     .656     .571
E            .427     .289    5.556     .797    5.556     .797    -.230     .518
F            .451     .087    5.860     .240    5.860     .240    -.176    -.325

Best Answer

Q1

Q2

Q3

Q4

Related Solutions

Principal Components Analysis vs Correspondence Analysis – A Comparative Guide

PCA vs Correspondence Analysis – Relation to Biplot

SVD

Principal Component Analysis by SVD

Biplot

Passive rows and columns

Arbitrary spreading of inertia

Correspondence Analysis (Euclidean model)

Weighted Biplot

Correspondence Analysis (Chi-square model)

Illustrations

Related Question