PCA – What is the Horseshoe Effect and Arch Effect in Correspondence Analysis?

correspondence-analysisecologyexploratory-data-analysispca

There are many techniques in ecological statistics for exploratory data analysis of multidimensional data. These are called 'ordination' techniques. Many are the same or closely related to common techniques elsewhere in statistics. Perhaps the prototypical example would be principal components analysis (PCA). Ecologists might use PCA, and related techniques, to explore 'gradients' (I am not entirely clear what a gradient is, but I've been reading a little bit about it.)

On this page, the last item under Principal Components Analysis (PCA) reads:

  • PCA has a serious problem for vegetation data: the horseshoe effect. This is caused by the curvilinearity of species distributions along gradients. Since species response curves are typically unimodal (i.e. very strongly curvilinear), horseshoe effects are common.

Further down the page, under Correspondence Analysis or Reciprocal Averaging (RA), it refers to "the arch effect":

  • RA has a problem: the arch effect. It is also caused by nonlinearity of distributions along gradients.
  • The arch is not as serious as the horseshoe effect of PCA, because the ends of the gradient are not convoluted.

Can someone explain this? I have recently seen this phenomenon in plots that re-represent data in a lower dimensional space (viz., correspondence analysis and factor analysis).

  1. What would a "gradient" correspond to more generally (i.e., in a non-ecological context)?
  2. If this happens with your data, is it a "problem" ("serious problem")? For what?
  3. How should one interpret output where a horseshoe / arch shows up?
  4. Does a remedy need to be applied? What? Would transformations of the original data help? What if the data are ordinal ratings?

The answers may exist in other pages on that site (e.g., for PCA, CA, and DCA). I have been trying to work through those. But the discussions are couched in sufficiently unfamiliar ecological terminology and examples that it is harder to understand the issue.

Best Answer

Q1

Ecologists talk of gradients all the time. There are lots of kinds of gradients, but it may be best to think of them as some combination of whatever variable(s) you want or are important for the response. So a gradient could be time, or space, or soil acidity, or nutrients, or something more complex such as a linear combination of a range of variables required by the response in some way.

We talk about gradients because we observe species in space or time and a whole host of things vary with that space or time.

Q2

I have come to the conclusion that in many cases the horseshoe in PCA is not a serious problem if you understand how it arises and don't do silly things like take PC1 when the "gradient" is actually represented by PC1 and PC2 (well it is also split into higher PCs too, but hopefully a 2-d representation is OK).

In CA I guess I think the same (now having been forced to think a bit about it). The solution can form an arch when there is no strong 2nd dimension in the data such that a folded version of the first axis, which satisfies the orthogonality requirement of the CA axes, explains more "inertia" than another direction in the data. This may be more serious as this is made up structure where with PCA the arch is just a way to represent species abundances at sites along a single dominant gradient.

I've never quite understood why people worry so much about the wrong ordering along PC1 with a strong horseshoe. I would counter that you shouldn't take just PC1 in such cases, and then the problem goes away; the pairs of coordinates on PC1 and PC2 get rid of the reversals on any one of those two axes.

Q3

If I saw the horseshoe in a PCA biplot, I would interpret the data as having a single dominant gradient or direction of variation.

If I saw the arch, I would probably conclude the same, but I would be very wary of trying to explain CA axis 2 at all.

I would not apply DCA - it just twists the arch away (in the best circumstances) such that you don't see to oddities in 2-d plots, but in many cases it produces other spurious structures such as diamonds or trumpet shapes to the arrangement of samples in the DCA space. For example:

library("vegan")
data(BCI)
plot(decorana(BCI), display = "sites", type = "p") ## does DCA

enter image description here

We see a typical fanning out of sample points towards the left of the plot.

Q4

I would suggest that the answer to this question depends on the aims of your analysis. If the arch/horseshoe was due to a single dominant gradient, then rather than have to represent this as $m$ PCA axes, it would be beneficial if we could estimate a single variable that represents the positions of sites/samples along the gradient.

This would suggest finding a nonlinear direction in the high-dimensional space of the data. One such method is the principal curve of Hastie & Stuezel, but other non-linear manifold methods are available which might suffice.

For example, for some pathological data

enter image description here

We see a strong horseshoe. The principal curve tries to recover this underlying gradient or arrangement/ordering of samples via a smooth curve in the m dimensions of the data. The figure below shows how the iterative algorithm converges on something approximating the underlying gradient. (I think it wanders away from the data at the top of the plot so as to be closer to the data in higher dimensions, and partly because of the self-consistency criterion for a curve to be declared a principal curve.)

enter image description here

I have more details including code on my blog post from which I took those images. But the main point here is the the principal curves easily recovers the known ordering of samples whereas PC1 or PC2 on its own does not.

In the PCA case, it is common to apply transformations in ecology. Popular transformations are those that can be thought of returning some non-Euclidean distance when the Euclidean distance is computed on the transformed data. For example, the Hellinger distance is

$$D_{\mathrm{Hellinger}}(x1, x2) = \sqrt{\sum_{j=1}^p \left [ \sqrt{\frac{y_{1j}}{y_{1+}}} - \sqrt{\frac{y_{2j}}{y_{2+}}} \right ]^2}$$

Where $y_{ij}$ is the abundance of the $j$th species in sample $i$, $y_{i+}$ is the sum of the abundances of all species in the $i$th sample. If we convert the data to proportions and apply a square-root transformation, then the Euclidean distance-preserving PCA will represent the Hellinger distances in the original data.

The horseshoe has been known and studied for a long time in ecology; some of the early literature (plus a more modern look) is

The main principal curve references are

With the former being a very ecological presentation.

Related Question