Discriminant Analysis – The Discriminant Function in Linear Discriminant Analysis

discriminant analysis

This is the bayesian probability for class 'k'(sigma is same for all k classes), assigning the observation to class with maximum pk(x)
with πk as the probability of response variable belonging to the kth class
enter image description here

This is a statement from the book Introduction to Statistical learning Taking the log of the bayesian above equation and rearranging the terms, it can be shown that this is equivalent to assigning the observation to the class for which the below equation is largest.

enter image description here

There is the summation term which remains , how we do we go about eliminating it?

Best Answer

it can be shown that this is equivalent to assigning the observation to the class for which the below equation is largest

Deriving the discriminant function for LDA

For LDA we assume that the random variable $X$ is a vector $\mathbf{X} = (X_1,X_2,...,X_p)$ which is drawn from a multivariate Gaussian with class-specific mean vector and a common covariance matrix $\Sigma$. In other words the covariance matrix is common to all $K$ classes: $Cov(X) = \Sigma$ of shape $p \times p$

Since $x$ follows a multivariate Gaussian distribution, the probability $p(X = x | Y = k)$ is given by:

$$ f_k(x) = \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)$$

($\mu_k$ is the mean of inputs for category $k$)

Assume that we know the prior distribution exactly: $P(Y = k) = \pi_k$, then we can find the posterior distribution using Bayes theorem as

$$ p_k(x) = p(Y = k | X = x) = \frac{f_k(x) \ \pi_k}{P(X = x)} = C \times f_k(x) \ \pi_k $$

There is the summation term which remains , how we do we go about eliminating it?

If you look at the term in which there is a summation, it is actually equal to $P(X = x)$ in the equation above. Since $P(X = x)$ does not depend on $k$ and we are only interested in the terms which are function of $k$ (see later) we can push it into a constant $C$.

We will now proceed to expand and simplify the algebra, putting all constant terms into $C, C', C''$ etc..

\begin{aligned} p_k(x) &= p(Y = k | X = x) = \frac{f_k(x) \ \pi_k}{P(X = x)} = C \times f_k(x) \ \pi_k \\ & = C \ \pi_k \ \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) \\ & = C' \pi_k \ \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) \end{aligned}

Take the log of both sides:

\begin{aligned} \log p_k(x) &= \log ( C' \pi_k \ \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) ) \\ & = \log C' + \log \pi_k - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \end{aligned}

Since the term $\log C'$ does not depend on $k$ and we aim to maximize the posterior probability over $k$, we can ignore it:

\begin{aligned} \log \pi_k - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \\ = \log \pi_k - \frac{1}{2} [ x^T \Sigma^{-1} x + \mu^T_k \Sigma^{-1} \mu_k ] + x^T \Sigma^{-1} \mu_k \\ = C'' + \log \pi_k - \frac{1}{2} \mu^T_k \Sigma^{-1} \mu_k + x^T \Sigma^{-1} \mu_k \end{aligned}

And so the objective function, sometimes called the linear discriminant function is:

$$ \delta_k(x) = \log \pi_k - \frac{1}{2} \mu^T_k \Sigma^{-1} \mu_k + x^T \Sigma^{-1} \mu_k $$

Which means that given an input $x$ we predict the class with the highest value of $\delta_k(x)$.

See here for an implementation in Python