I will provide only a short informal answer and refer you to the section 4.3 of The Elements of Statistical Learning for the details.
Update: "The Elements" happen to cover in great detail exactly the questions you are asking here, including what you wrote in your update. The relevant section is 4.3, and in particular 4.3.2-4.3.3.
(2) Do and how the two approaches relate to each other?
They certainly do. What you call "Bayesian" approach is more general and only assumes Gaussian distributions for each class. Your likelihood function is essentially Mahalanobis distance from $x$ to the centre of each class.
You are of course right that for each class it is a linear function of $x$. However, note that the ratio of the likelihoods for two different classes (that you are going to use in order to perform an actual classification, i.e. choose between classes) -- this ratio is not going to be linear in $x$ if different classes have different covariance matrices. In fact, if one works out boundaries between classes, they turn out to be quadratic, so it is also called quadratic discriminant analysis, QDA.
An important insight is that equations simplify considerably if one assumes that all classes have identical covariance [Update: if you assumed it all along, this might have been part of the misunderstanding]. In that case decision boundaries become linear, and that is why this procedure is called linear discriminant analysis, LDA.
It takes some algebraic manipulations to realize that in this case the formulas actually become exactly equivalent to what Fisher worked out using his approach. Think of that as a mathematical theorem. See Hastie's textbook for all the math.
(1) Can we do dimension reduction using Bayesian approach?
If by "Bayesian approach" you mean dealing with different covariance matrices in each class, then no. At least it will not be a linear dimensionality reduction (unlike LDA), because of what I wrote above.
However, if you are happy to assume the shared covariance matrix, then yes, certainly, because "Bayesian approach" is simply equivalent to LDA. However, if you check Hastie 4.3.3, you will see that the correct projections are not given by $\Sigma^{-1} \mu_k$ as you wrote (I don't even understand what it should mean: these projections are dependent on $k$, and what is usually meant by projection is a way to project all points from all classes on to the same lower-dimensional manifold), but by first [generalized] eigenvectors of $\boldsymbol \Sigma^{-1} \mathbf{M}$, where $\mathbf{M}$ is a covariance matrix of class centroids $\mu_k$.
In the reference at the bottom $^*$, I see the training involves the following:
Initialize the HMM & GMM parameters (randomly or using prior assumptions).
Then repeat the following until convergence criteria are satisfied:
Do a forward pass and backwards pass to find probabilities associated with the training sequences and the parameters of the GMM-HMM.
Recalculate the HMM & GMM parameters - the mean, covariances, and mixture coefficients of each mixture component at each state, and the transition probabilities between states - all calculated using the probabilities found in step 1.
$*$ University of Edinburgh GMM-HMM slides (Google: Hidden Markov Models and Gaussian Mixture Models, or try this link). This reference gives a lot of details and suggests doing these calculations in the log domain.
Best Answer
The building blocks of LDA and GMM are similar i.e both Gaussian but there are many differences. In GMM we are trying to estimate a distribution in the following form:
$ p(\boldsymbol{ x}|\theta) = \sum_{z=1}^K\pi_z \mathcal{N}(\boldsymbol{x|\tilde{\mu_z},\tilde{\Sigma_z}}) $
This is a density estimation problem, trying to estimate the density of an arbitrary distribution. The variable z is a hidden variable and the parameters $(\pi_z,\tilde{\mu_z},\tilde{\Sigma_z})$ are obtained via the EM algorithm. If you would like to do supervised classification for two classes you would train one model for each class $p(\boldsymbol{ x}|\theta_1)$ and $p(\boldsymbol{ x}|\theta_2)$ and select the model with the largest likelihood.
$ \hat{y}=\underset{y}{\operatorname{arg\,max}}\, p(\boldsymbol{ x}|\theta_y) $
THe LDA approaches the problem by assuming that the conditional probability density functions for each class $p(x|y=0)$ and $p(x|y=1)$ that are Multivariate normal distribution with mean and covariance parameters $( \mu_0, \Sigma_0)$ and $(\vec \mu_1, \Sigma_1)$. You would select a class as follows:
$ \hat{y}=\underset{y}{\operatorname{arg\,max}}\, p(y|x)=\underset{y}{\operatorname{arg\,max}}\, p(x|y)p(y) $
Where $p(y)$ is the prior. With some math one can show this is the same as:
$ (x- \mu_0)^T \Sigma_0^{-1} ( x- \vec \mu_0) + \ln|\Sigma_0| - ( x- \mu_1)^T \Sigma_1^{-1} ( x- \mu_1) - \ln|\Sigma_1| \ > \ T $
Where we predict points as being from the second class if the log of the likelihood ratios is below some threshold T.