Solved – Why is the number of components in Linear Discriminant Analysis bounded by the number of classes

dimensionality reductiondiscriminant analysislinear modelpca

Resources about LDA usually say the number of components is bounded by the number of classes – 1. E.g, in the binary case, only one component can be found.

In LDA, the first discriminant direction $\Phi_1$ is calculated as argmax of $\frac{\Phi_1^T S_b \Phi_1}{\Phi_1^T S_w \Phi_1}$ where $S_b$ and $S_w$ are the between-class and within-class covariance matrices, respectively. Why can't we continue this way and compute the $i$th direction $\Phi_i$ to be the argmax of $\frac{\Phi_i^T S_b \Phi_i}{\Phi_i^T S_w \Phi_i}$ under the constraint of orthogonality to $\Phi_1, \Phi_2 \dots \Phi_{i-1}$, as is done in PCA?

Ostenbily, in the binary case, where each $\Phi_i$ is a vector, one can do it $n$ times, if the inputs are $n$ dimensional vectors.

Best Answer

The rank of between-class scatter matrix $S_B$ for the whole data set is at most $c-1$. ($c$ is the number of classes.) The individual between-class scatter matrix $S_{Bi}$ for one class is at most $1$. The former matrix is the weighted sum of the latter.

Since $rank(AB)\le{min(rank(A), rank(B))}$, you have $rank(S^{-1}_WS_B)\le{rank(S_B)}\le{c-1}$

Related Solutions

Discriminant Analysis – Comparing Bayesian and Fisher’s Approaches

I will provide only a short informal answer and refer you to the section 4.3 of The Elements of Statistical Learning for the details.

Update: "The Elements" happen to cover in great detail exactly the questions you are asking here, including what you wrote in your update. The relevant section is 4.3, and in particular 4.3.2-4.3.3.

(2) Do and how the two approaches relate to each other?

They certainly do. What you call "Bayesian" approach is more general and only assumes Gaussian distributions for each class. Your likelihood function is essentially Mahalanobis distance from $x$ to the centre of each class.

You are of course right that for each class it is a linear function of $x$. However, note that the ratio of the likelihoods for two different classes (that you are going to use in order to perform an actual classification, i.e. choose between classes) -- this ratio is not going to be linear in $x$ if different classes have different covariance matrices. In fact, if one works out boundaries between classes, they turn out to be quadratic, so it is also called quadratic discriminant analysis, QDA.

An important insight is that equations simplify considerably if one assumes that all classes have identical covariance [Update: if you assumed it all along, this might have been part of the misunderstanding]. In that case decision boundaries become linear, and that is why this procedure is called linear discriminant analysis, LDA.

It takes some algebraic manipulations to realize that in this case the formulas actually become exactly equivalent to what Fisher worked out using his approach. Think of that as a mathematical theorem. See Hastie's textbook for all the math.

(1) Can we do dimension reduction using Bayesian approach?

If by "Bayesian approach" you mean dealing with different covariance matrices in each class, then no. At least it will not be a linear dimensionality reduction (unlike LDA), because of what I wrote above.

However, if you are happy to assume the shared covariance matrix, then yes, certainly, because "Bayesian approach" is simply equivalent to LDA. However, if you check Hastie 4.3.3, you will see that the correct projections are not given by $\Sigma^{-1} \mu_k$ as you wrote (I don't even understand what it should mean: these projections are dependent on $k$, and what is usually meant by projection is a way to project all points from all classes on to the same lower-dimensional manifold), but by first [generalized] eigenvectors of $\boldsymbol \Sigma^{-1} \mathbf{M}$, where $\mathbf{M}$ is a covariance matrix of class centroids $\mu_k$.

Solved – Comparing four formulations of class scatter matrices

Let's go over your four definitions one by one.

Duda et al. 2012. These are the standard definitions of scatter matrices: within-class, between-class, and the total scatter matrix. They obey a nice and useful property $$S_W+S_B=S_T,$$ so one can talk about the "decomposition of the scatter matrix" similar to the "decomposition of the sum of squares" in a univariate situation (one-way ANOVA). For the purposes of linear discriminant analysis (LDA), one only needs the product $S_W^{-1}S_B$.

Scatter matrix differs from covariance matrix only by a scalar multiplier: sample covariance matrix is equal to the scatter matrix divided by $n$ (for maximum likelihood estimate) or by $n-1$ (for unbiased estimate).
Webb 2002. These definitions differ from (1) only by the $1/n$ factor; otherwise they are identical. It follows that the product $S_W^{-1}S_B$ computed using these definitions will be identical to (1) and so the definitions (1) and (2) are equivalent as far as LDA is concerned.

Of course Webb's $S_T$ is just the sample covariance matrix (ML estimate), so one might think that these definitions simply replace scatter matrices with covariances matrices. But the situation is tricky here because between-class covariance matrix is usually estimated with $C-1$ denominator (instead of $n$) and within-class covariance matrix with $n-C$ denominator: these are the respective degrees of freedom. If one uses these factors then the decomposition of total covariance matrix into between-class and within-class covariance matrices does not hold (and using the same factor does not make much sense). This is why it is easier to work with scatter matrices instead of covariance matrices and to side-step these problems.

The reason Webb 2002 uses $1/n$ factor is probably so that his $S_T$ was equal to the total covariance matrix, which is a very familiar object. However, if Webb uses $1/n$ factor and still calls it "scatter matrices" then it is a very non-standard terminology.
Johnson and Wichern 2007. This is a non-standard definition of the between-class scatter matrix (within-class one is the same here as in (1)) and the authors do not seem to motivate them in their textbook. So I can only guess at what is the rationale behind it, see my answer to What is the correct formula for between-class scatter matrix in LDA?. As I wrote there, this approach can actually be useful when the classes are unbalanced (different number of data points per class). One can call this a "re-balanced between-class scatter matrix".
@amoeba 2015. The between-class scatter matrix from (3) does not turn into the standard between-class scatter matrix from (1) when all $n_i$ are equal to each other. There is a scalar factor $\bar n$ missing.

Another problem with (3) is that there is no meaningful definition of total scatter matrix preserving the decomposition equation $S_W+S_B=S_T$. Definitions (4) were my attempt to suggest a set of definitions for re-balanced scatter matrices so that they (i) preserve the decomposition property and (ii) reduce to the standard definitions (1) when classes are balanced.

The idea is that maybe you have unequal $n_i$ due to some experimental limitations, but you would still like to have a guess at what would happen if the $n$'s were equal (perhaps you expect them to be equal in the test dataset or in the future). So even for within-class covariance matrix you want to weigh the contribution of each class equally.

No, I have never seen it described in the literature.

Best Answer

Related Solutions

Discriminant Analysis – Comparing Bayesian and Fisher’s Approaches

Solved – Comparing four formulations of class scatter matrices

Related Question