How LDA does dimension reducion has same methodology with PCA. When you get J(W) at LDA (W is result that which minimizes within class scatter matrix but maximizes between class scatter matrix)
W = inv(Sw)*Sb
When you get W, you can get result with:
V is eigenvectors of W
final_result = V'*your_data
before that the trick plays role. As like at PCA, you can get eigenvectors and eigenvalues of W and you can discard some eigenvectors that has smallest corresponding eigenvalues. Then if you multiply your data with that eliminated eigenvector matrix (V) you get a lower dimension.
I will provide only a short informal answer and refer you to the section 4.3 of The Elements of Statistical Learning for the details.
Update: "The Elements" happen to cover in great detail exactly the questions you are asking here, including what you wrote in your update. The relevant section is 4.3, and in particular 4.3.2-4.3.3.
(2) Do and how the two approaches relate to each other?
They certainly do. What you call "Bayesian" approach is more general and only assumes Gaussian distributions for each class. Your likelihood function is essentially Mahalanobis distance from $x$ to the centre of each class.
You are of course right that for each class it is a linear function of $x$. However, note that the ratio of the likelihoods for two different classes (that you are going to use in order to perform an actual classification, i.e. choose between classes) -- this ratio is not going to be linear in $x$ if different classes have different covariance matrices. In fact, if one works out boundaries between classes, they turn out to be quadratic, so it is also called quadratic discriminant analysis, QDA.
An important insight is that equations simplify considerably if one assumes that all classes have identical covariance [Update: if you assumed it all along, this might have been part of the misunderstanding]. In that case decision boundaries become linear, and that is why this procedure is called linear discriminant analysis, LDA.
It takes some algebraic manipulations to realize that in this case the formulas actually become exactly equivalent to what Fisher worked out using his approach. Think of that as a mathematical theorem. See Hastie's textbook for all the math.
(1) Can we do dimension reduction using Bayesian approach?
If by "Bayesian approach" you mean dealing with different covariance matrices in each class, then no. At least it will not be a linear dimensionality reduction (unlike LDA), because of what I wrote above.
However, if you are happy to assume the shared covariance matrix, then yes, certainly, because "Bayesian approach" is simply equivalent to LDA. However, if you check Hastie 4.3.3, you will see that the correct projections are not given by $\Sigma^{-1} \mu_k$ as you wrote (I don't even understand what it should mean: these projections are dependent on $k$, and what is usually meant by projection is a way to project all points from all classes on to the same lower-dimensional manifold), but by first [generalized] eigenvectors of $\boldsymbol \Sigma^{-1} \mathbf{M}$, where $\mathbf{M}$ is a covariance matrix of class centroids $\mu_k$.
Best Answer
"Fisher's Discriminant Analysis" is simply LDA in a situation of 2 classes. When there is only 2 classes computations by hand are feasible and the analysis is directly related to Multiple Regression. LDA is the direct extension of Fisher's idea on situation of any number of classes and uses matrix algebra devices (such as eigendecomposition) to compute it. So, the term "Fisher's Discriminant Analysis" can be seen as obsolete today. "Linear Discriminant analysis" should be used instead. See also. Discriminant analysis with 2+ classes (multi-class) is canonical by its algorithm (extracts dicriminants as canonical variates); rare term "Canonical Discriminant Analysis" usually stands simply for (multiclass) LDA therefore (or for LDA + QDA, omnibusly).
Fisher used what was then called "Fisher classification functions" to classify objects after the discriminant function has been computed. Nowadays, a more general Bayes' approach is used within LDA procedure to classify objects.
To your request for explanations of LDA I may send you to these my answers: extraction in LDA, classification in LDA, LDA among related procedures. Also this, this, this questions and answers.
Just like ANOVA requires an assumption of equal variances, LDA requires an assumption of equal variance-covariance matrices (between the input variables) of the classes. This assumption is important for classification stage of the analysis. If the matrices substantially differ, observations will tend to be assigned to the class where variability is greater. To overcome the problem, QDA was invented. QDA is a modification of LDA which allows for the above heterogeneity of classes' covariance matrices.
If you have the heterogeneity (as detected for example by Box's M test) and you don't have QDA at hand, you may still use LDA in the regime of using individual covariance matrices (rather than the pooled matrix) of the discriminants at classification. This partly solves the problem, though less effectively than in QDA, because - as just pointed - these are the matrices between the discriminants and not between the original variables (which matrices differed).
Let me leave analyzing your example data for yourself.
Reply to @zyxue's answer and comments
LDA is what you defined FDA is in your answer. LDA first extracts linear constructs (called discriminants) that maximize the between to within separation, and then uses those to perform (gaussian) classification. If (as you say) LDA were not tied with the task to extract the discriminants LDA would appear to be just a gaussian classifier, no name "LDA" would be needed at all.
It is that classification stage where LDA assumes both normality and variance-covariance homogeneity of classes. The extraction or "dimensionality reduction" stage of LDA assumes linearity and variance-covariance homogeneity, the two assumptions together make "linear separability" feasible. (We use single pooled $S_w$ matrix to produce discriminants which therefore have identity pooled within-class covariance matrix, that give us the right to apply the same set of discriminants to classify to all the classes. If all $S_w$s are same the said within-class covariances are all same, identity; that right to use them becomes absolute.)
Gaussian classifier (the second stage of LDA) uses Bayes rule to assign observations to classes by the discriminants. The same result can be accomplished via so called Fisher linear classification functions which utilizes original features directly. However, Bayes' approach based on discriminants is a little bit general in that it will allow to use separate class discriminant covariance matrices too, in addition to the default way to use one, the pooled one. Also, it will allow to base classification on a subset of discriminants.
When there are only two classes, both stages of LDA can be described together in a single pass because "latents extraction" and "observations classification" reduce then to the same task.