Solved – Why is Linear Discriminant Analysis a linear classifier

classificationdiscriminant analysislinear modelmachine learning

Question

As the title suggests, I am wondering why we consider LDA a linear classifier.

More specifically, I would like to know why LDA is considered a linear classifier in the following case:

  • The response consists of two classes, coded as 1 and 0.
  • The threshold for a given observation to be classified as 1 is $a$, where $a \in [0,\,1]$.

Current attempt

A simpler problem

The second condition is very important — I know why we consider LDA a linear classifier in the case where we classify an observation as "1" in the case we estimate the posterior probability of this class to be greater than 0.5.

Namely, one proof could involve the use of the disciminant functions corresponding to each of the two classes. Since the ranking of the discriminant functions is always equal to the ranking of the posterior probabilities, $p_1(X) = p_0(X) \iff \delta_1(X) = \delta_0(X)$. Since this is the decision boundary, and $\delta_1(X) = \delta_0(X)$ describes a hyperplane in $X$, we consider the model linear.

Current problem

In this more general problem, I can no longer use $\delta_1(X) = \delta_0(X)$ to derive the decision boundary. Instead, the decision boundary is of the form $p_1(X) = a$. I don't think it is particularly feasible to show that this equation is linear in $X$, and so I wonder if it is possible to express the equation in an equivalent form $\delta_1 = b$, where $b$ is independent or linear in $X$. Then the conclusion that LDA is linear in this case would follow. It is just that I cannot think of how to derive this $b$, either.

Best Answer

An observation belongs to exactly one class $c$, $c=1,\ldots,C$. $K$ predictors $x$ are informative about class membership.

We are interested in the class posteriors $Pr(G|X)$ Let $f_c(x)$ the likelihood of $X$ in class $G=c$, and $\pi_c$ the prior for class $c$. Then, $$ Pr(G=c|X=x)=\frac{f_c(x)\pi_c}{\sum_{\ell=1}^Cf_\ell(x)\pi_\ell} $$ Classical LDA assumes multivariate Gaussianity, $$ f_c(x)=\frac{1}{(2\pi)^{K/2}|\Sigma_c|^{1/2}}\exp\left\{-\frac{1}{2}(x-\mu_c)'\Sigma_c^{-1}(x-\mu_c)\right\} $$ Moreover, $$ \Sigma_c=\Sigma\qquad c=1,\ldots,C $$ The log-odds become, as normalization factors and quadratic terms cancel when $\Sigma_c=\Sigma$, \begin{eqnarray*} \log\frac{Pr(G=c|X=x)}{Pr(G=d|X=x)}&=&\log\frac{f_c(x)}{f_d(x)}+\log\frac{\pi_c}{\pi_d}\\ &=&\log\frac{\pi_c}{\pi_d}-\frac{1}{2}(\mu_c+\mu_d)'\Sigma^{-1}(\mu_c-\mu_d)\\ &&\quad +x'\Sigma^{-1}(\mu_c-\mu_d) \end{eqnarray*} The "decision boundary" is linear in $x$, hence the name.

Not assuming $\Sigma_c=\Sigma$ would have led to a quadratic decision boundary - a.k.a. quadratic discriminant analysis.

Here is a little visualization with three classes and two predictors:

enter image description here

Related Question