MANOVA – How MANOVA is Related to LDA

anovadiscriminant analysismanova

In several places I saw a claim that MANOVA is like ANOVA plus linear discriminant analysis (LDA), but it was always made in a hand-waving sort of way. I would like to know what exactly it is supposed to mean.

I found various textbooks describing all the details of MANOVA computations, but it seems to be very hard to find good general discussion (let alone pictures) accessible to somebody who is not a statistician.

Best Answer

In a nutshell

Both one-way MANOVA and LDA start with decomposing the total scatter matrix $\mathbf T$ into the within-class scatter matrix $\mathbf W$ and between-class scatter matrix $\mathbf B$, such that $\mathbf T = \mathbf W + \mathbf B$. Note that this is fully analogous to how one-way ANOVA decomposes total sum-of-squares $T$ into within-class and between-class sums-of-squares: $T=B+W$. In ANOVA a ratio $B/W$ is then computed and used to find the p-value: the bigger this ratio, the smaller the p-value. MANOVA and LDA compose an analogous multivariate quantity $\mathbf W^{-1} \mathbf B$.

From here on they are different. The sole purpose of MANOVA is to test if the means of all groups are the same; this null hypothesis would mean that $\mathbf B$ should be similar in size to $\mathbf W$. So MANOVA performs an eigendecomposition of $\mathbf W^{-1} \mathbf B$ and finds its eigenvalues $\lambda_i$. The idea is now to test if they are big enough to reject the null. There are four common ways to form a scalar statistic out of the whole set of eigenvalues $\lambda_i$. One way is to take the sum of all eigenvalues. Another way is to take the maximal eigenvalue. In each case, if the chosen statistic is big enough, the null hypothesis is rejected.

In contrast, LDA performs eigendecomposition of $\mathbf W^{-1} \mathbf B$ and looks at the eigenvectors (not eigenvalues). These eigenvectors define directions in the variable space and are called discriminant axes. Projection of the data onto the first discriminant axis has highest class separation (measured as $B/W$); onto the second one -- second highest; etc. When LDA is used for dimensionality reduction, the data can be projected e.g. on the first two axes, and the remaining ones are discarded.

See also an excellent answer by @ttnphns in another thread which covers almost the same ground.

Example

Let us consider a one-way case with $M=2$ dependent variables and $k=3$ groups of observations (i.e. one factor with three levels). I will take the well-known Fisher's Iris dataset and consider only sepal length and sepal width (to make it two-dimensional). Here is the scatter plot:

Fisher Iris scatter plot

We can start with computing ANOVAs with both sepal length/width separately. Imagine data points projected vertically or horizontally on the x and y axes, and 1-way ANOVA performed to test if three groups have same means. We get $F_{2,147}=119$ and $p=10^{-31}$ for sepal length, and $F_{2,147}=49$ and $p=10^{-17}$ for sepal width. Okay, so my example is pretty bad as three groups are significantly different with ridiculous p-values on both measures, but I will stick to it anyway.

Now we can perform LDA to find an axis that maximally separates three clusters. As described above, we compute full scatter matrix $\mathbf{T}$, within-class scatter matrix $\mathbf{W}$ and between-class scatter matrix $\mathbf{B}=\mathbf{T}-\mathbf{W}$ and find eigenvectors of $\mathbf{W}^{-1}\mathbf{B}$. I can plot both eigenvectors on the same scatterplot:

Fisher Iris LDA

Dashed lines are discriminant axes. I plotted them with arbitrary lengths, but the longer axis shows the eigenvector with larger eigenvalue (4.1) and the shorter one --- the one with smaller eigenvalue (0.02). Note that they are not orthogonal, but the mathematics of LDA guarantees that the projections on these axes have zero correlation.

If we now project our data on the first (longer) discriminant axis and then run the ANOVA, we get $F=305$ and $p=10^{-53}$, which is lower than before, and is the lowest possible value among all linear projections (that was the whole point of LDA). The projection on the second axis gives only $p=10^{-5}$.

If we run MANOVA on the same data, we compute the same matrix $\mathbf{W}^{-1}\mathbf{B}$ and look at its eigenvalues in order to compute the p-value. In this case the larger eigenvalue is equal to 4.1, which is equal to $B/W$ for ANOVA along the first discriminant (indeed, $F=B/W \cdot (N-k)/(k-1) = 4.1\cdot 147/2 = 305$, where $N=150$ is the total number of data points and $k=3$ is the number of groups).

There are several commonly used statistical tests that calculate p-value from the eigenspectrum (in this case $\lambda_1=4.1$ and $\lambda_2=0.02$) and give slightly different results. MATLAB gives me the Wilks' test, which reports $p=10^{-55}$. Note that this value is lower than what we had before with any ANOVA, and the intuition here is that MANOVA's p-value "combines" two p-values obtained with ANOVAs on two discriminant axes.

Is it possible to get an opposite situation: higher p-value with MANOVA? Yes, it is. For this we need a situation when only one discriminate axis gives significant $F$, and the second one does not discriminate at all. I modified the above dataset by adding seven points with coordinates $(8,4)$ to the "green" class (the big green dot represents these seven identical points):

Fisher Iris LDA modified

The second discriminant axis is gone: its eigenvalue is almost zero. ANOVAs on two discriminant axes give $p=10^{-55}$ and $p=0.26$. But now MANOVA reports only $p=10^{-54}$, which is a bit higher than ANOVA. The intuition behind it is (I believe) that MANOVA's increase of p-value accounts for the fact that we fitted the discriminant axis to get the minimum possible value and corrects for possible false positive. More formally one would say that MANOVA consumes more degrees of freedom. Imagine that there are 100 variables, and only along $\sim 5$ directions one gets $p\approx0.05$ significance; this is essentially multiple testing and those five cases are false positives, so MANOVA will take it into account and report an overall non-significant $p$.

MANOVA vs LDA as machine learning vs. statistics

This seems to me now one of the exemplary cases of how different machine learning community and statistics community approach the same thing. Every textbook on machine learning covers LDA, shows nice pictures etc. but it would never even mention MANOVA (e.g. Bishop, Hastie and Murphy). Probably because people there are more interested in LDA classification accuracy (which roughly corresponds to the effect size), and have no interest in statistical significance of group difference. On the other hand, textbooks on multivariate analysis would discuss MANOVA ad nauseam, provide lots of tabulated data (arrrgh) but rarely mention LDA and even rarer show any plots (e.g. Anderson, or Harris; however, Rencher & Christensen do and Huberty & Olejnik is even called "MANOVA and Discriminant Analysis").

Factorial MANOVA

Factorial MANOVA is much more confusing, but is interesting to consider because it differs from LDA in a sense that "factorial LDA" does not really exist, and factorial MANOVA does not directly correspond to any "usual LDA".

Consider balanced two-way MANOVA with two factors (or independent variables, IVs). One factor (factor A) has three levels, and another factor (factor B) has two levels, making $3\cdot 2=6$ "cells" in the experimental design (using ANOVA terminology). For simplicity I will only consider two dependent variables (DVs):

factorial MANOVA

On this figure all six "cells" (I will also call them "groups" or "classes") are well-separated, which of course rarely happens in practice. Note that it is obvious that there are significant main effects of both factors here, and also significant interaction effect (because the upper-right group is shifted to the right; if I moved it to its "grid" position, then there would be no interaction effect).

How do MANOVA computations work in this case?

First, MANOVA computes pooled within-class scatter matrix $\mathbf W$. But the between-class scatter matrix depends on what effect we are testing. Consider between-class scatter matrix $\mathbf B_A$ for factor A. To compute it, we find the global mean (represented in the figure by a star) and the means conditional on the levels of factor A (represented in the figure by three crosses). We then compute the scatter of these conditional means (weighted by the number of data points in each level of A) relative to the global mean, arriving to $\mathbf B_A$. Now we can consider a usual $\mathbf W^{-1} \mathbf B_A$ matrix, compute its eigendecomposition, and run MANOVA significance tests based on the eigenvalues.

For the factor B, there will be another between-class scatter matrix $\mathbf B_B$, and analogously (a bit more complicated, but straightforward) there will be yet another between-class scatter matrix $\mathbf B_{AB}$ for the interaction effect, so that in the end the total scatter matrix is decomposed into a neat $$\mathbf T = \mathbf B_A + \mathbf B_B + \mathbf B_{AB} + \mathbf W.$$ [Note that this decomposition works only for a balanced dataset with the same number of data points in each cluster. For unbalanced dataset, $\mathbf B$ cannot be uniquely decomposed into a sum of three factor contributions because the factors are not orthogonal anymore; this is similar to the discussion of Type I/II/III SS in ANOVA.]

Now, our main question here is how MANOVA corresponds to LDA. There is no such thing as "factorial LDA". Consider factor A. If we wanted to run LDA to classify levels of factor A (forgetting about factor B altogether), we would have the same between-class $\mathbf B_A$ matrix, but a different within-class scatter matrix $\mathbf W_A=\mathbf T - \mathbf B_A$ (think of merging together two little ellipsoids in each level of factor A on my figure above). The same is true for other factors. So there is no "simple LDA" that directly corresponds to the three tests that MANOVA runs in this case.

However, of course nothing prevents us from looking at the eigenvectors of $\mathbf W^{-1} \mathbf B_A$, and from calling them "discriminant axes" for factor A in MANOVA.