it can be shown that this is equivalent to assigning the observation to the class for which the below equation is largest
Deriving the discriminant function for LDA
For LDA we assume that the random variable $X$ is a vector $\mathbf{X} = (X_1,X_2,...,X_p)$ which is drawn from a multivariate Gaussian with class-specific mean vector and a common covariance matrix $\Sigma$. In other words the covariance matrix is common to all $K$ classes: $Cov(X) = \Sigma$ of shape $p \times p$
Since $x$ follows a multivariate Gaussian distribution, the probability $p(X = x | Y = k)$ is given by:
$$ f_k(x) = \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)$$
($\mu_k$ is the mean of inputs for category $k$)
Assume that we know the prior distribution exactly: $P(Y = k) = \pi_k$, then we can find the posterior distribution using Bayes theorem as
$$ p_k(x) = p(Y = k | X = x) = \frac{f_k(x) \ \pi_k}{P(X = x)} = C \times f_k(x) \ \pi_k $$
There is the summation term which remains , how we do we go about eliminating it?
If you look at the term in which there is a summation, it is actually equal to $P(X = x)$ in the equation above. Since $P(X = x)$ does not depend on $k$ and we are only interested in the terms which are function of $k$ (see later) we can push it into a constant $C$.
We will now proceed to expand and simplify the algebra, putting all constant terms into $C, C', C''$ etc..
\begin{aligned}
p_k(x) &= p(Y = k | X = x) = \frac{f_k(x) \ \pi_k}{P(X = x)} = C \times f_k(x) \ \pi_k
\\
& = C \ \pi_k \ \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)
\\
& = C' \pi_k \ \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)
\end{aligned}
Take the log of both sides:
\begin{aligned}
\log p_k(x) &= \log ( C' \pi_k \ \exp \left( - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) )
\\
& = \log C' + \log \pi_k - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k)
\end{aligned}
Since the term $\log C'$ does not depend on $k$ and we aim to maximize the posterior probability over $k$, we can ignore it:
\begin{aligned}
\log \pi_k - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k)
\\
= \log \pi_k - \frac{1}{2} [ x^T \Sigma^{-1} x + \mu^T_k \Sigma^{-1} \mu_k ] + x^T \Sigma^{-1} \mu_k
\\
= C'' + \log \pi_k - \frac{1}{2} \mu^T_k \Sigma^{-1} \mu_k + x^T \Sigma^{-1} \mu_k
\end{aligned}
And so the objective function, sometimes called the linear discriminant function is:
$$ \delta_k(x) = \log \pi_k - \frac{1}{2} \mu^T_k \Sigma^{-1} \mu_k + x^T \Sigma^{-1} \mu_k $$
Which means that given an input $x$ we predict the class with the highest value of $\delta_k(x)$.
See here for an implementation in Python
Best Answer
I found the following: Raudys and Duin: Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix.
A wider search, looking at covariance estimation more generally, has also been fruitful. Under assumptions on the data distribution the following papers, found from the references in Vershynin's paper, look interesting:
Adamczak et al.: Quantitative estimates of the convergence of the empirical covariance matrix in Log-concave Ensembles
Adamczak et al.: Sharp bounds on the rate of convergence of the empirical covariance matrix
Rudelson.: Random vectors in the isotropic position
Vershynin.: How close is the sample covariance matrix to the actual covariance matrix?