Solved – Differences linear discriminant analysis and Gaussian mixture model

differencesdiscriminant analysisgaussian mixture distribution

I know that there are topics about this question but in my view, the answers are not clear enough. I don't understand the main difference between Linear Discriminant Analysis (LDA) and Gaussian Mixture Models (GMM).

Both have the same purpose : determine the posteriori $P(G=j|X=x)$, and maximize it for a certain $j$ in order to attribute class $j$ to $x$. I have the feeling that in GMM the way we estimate our parameters (EM algorithm) is the difference. Or, I don't know maybe the difference is that in LDA we want to draw an hyperplane in order to classify after any data ?
Because we agree that basically, LDA data correspond to a gaussian mixture model. It's just the way parameters are estimated that differs no?
Well as you can see, I'm a bit confused. I hope someone could explain me.

Best Answer

The building blocks of LDA and GMM are similar i.e both Gaussian but there are many differences. In GMM we are trying to estimate a distribution in the following form:
$ p(\boldsymbol{ x}|\theta) = \sum_{z=1}^K\pi_z \mathcal{N}(\boldsymbol{x|\tilde{\mu_z},\tilde{\Sigma_z}}) $

This is a density estimation problem, trying to estimate the density of an arbitrary distribution. The variable z is a hidden variable and the parameters $(\pi_z,\tilde{\mu_z},\tilde{\Sigma_z})$ are obtained via the EM algorithm. If you would like to do supervised classification for two classes you would train one model for each class $p(\boldsymbol{ x}|\theta_1)$ and $p(\boldsymbol{ x}|\theta_2)$ and select the model with the largest likelihood.

$ \hat{y}=\underset{y}{\operatorname{arg\,max}}\, p(\boldsymbol{ x}|\theta_y) $

THe LDA approaches the problem by assuming that the conditional probability density functions for each class $p(x|y=0)$ and $p(x|y=1)$ that are Multivariate normal distribution with mean and covariance parameters $( \mu_0, \Sigma_0)$ and $(\vec \mu_1, \Sigma_1)$. You would select a class as follows:

$ \hat{y}=\underset{y}{\operatorname{arg\,max}}\, p(y|x)=\underset{y}{\operatorname{arg\,max}}\, p(x|y)p(y) $

Where $p(y)$ is the prior. With some math one can show this is the same as:

$ (x- \mu_0)^T \Sigma_0^{-1} ( x- \vec \mu_0) + \ln|\Sigma_0| - ( x- \mu_1)^T \Sigma_1^{-1} ( x- \mu_1) - \ln|\Sigma_1| \ > \ T $

Where we predict points as being from the second class if the log of the likelihood ratios is below some threshold T.