I hope I've understood your question. So first thing is that the two quantities you stated may be equal are not equal in general (as you yourself deduced). To see this more explicitly (again, as you have pointed out),
\begin{align}
P_{suc,H_1} &= \int_{R_1} p(x|H_1) \text{d}x\\
&= \int_{R_1} 1.p(x|H_1) \text{d}x + \int_{R_1^c} 0.p(x|H_1) \text{d}x
\end{align}
and for comparison, the other quantity is
\begin{align}
\mathbb{E}_{x \sim H_1}\left[ p(H_1 |x) \right] &= \int p(H_1|x)p(x|H_1) \text{d}x\\
&= \int_{R_1} p(H_1|x)p(x|H_1) \text{d}x + \int_{R_1^c} p(H_1|x)p(x|H_1) \text{d}x
\end{align}
To answer your first question, see that
\begin{align}
P_{suc,H_1} &= \int_{R_1} p(x|H_1) \text{d}x = \int \mathbb{1}_{\{ x \in R_1 \}} p(x|H_1) \text{d}x = \mathbb{E}_{x \sim H_1} \left[\mathbb{1}_{\{ x \in R_1 \}} \right]
\end{align}
in other words, if you were to vary $x$ according to $H_1$, what would the expectation be of the random variable $\mathbb{1}_{\{ x \in R_1 \}}$. This is different from $\mathbb{E}_{x \sim H_1}\left[ p(H_1 |x) \right]$ which asks a similar question but replacing $\mathbb{1}_{\{ x \in R_1 \}}$ with $p(H_1|x)$. The first random variable is a on-off quantity while the latter is a typically smooth function.
To answer your second question, if you willing to achieve an approximate answer for $P_{suc,H_1} = \mathbb{E}_{x \sim H_1} \left[\mathbb{1}_{\{ x \in R_1 \}} \right]$ - try random sampling $x \sim H_1$, computing $\mathbb{1}_{\{ x \in R_1 \}}$ for each sample, and take the sample mean.
I will provide only a short informal answer and refer you to the section 4.3 of The Elements of Statistical Learning for the details.
Update: "The Elements" happen to cover in great detail exactly the questions you are asking here, including what you wrote in your update. The relevant section is 4.3, and in particular 4.3.2-4.3.3.
(2) Do and how the two approaches relate to each other?
They certainly do. What you call "Bayesian" approach is more general and only assumes Gaussian distributions for each class. Your likelihood function is essentially Mahalanobis distance from $x$ to the centre of each class.
You are of course right that for each class it is a linear function of $x$. However, note that the ratio of the likelihoods for two different classes (that you are going to use in order to perform an actual classification, i.e. choose between classes) -- this ratio is not going to be linear in $x$ if different classes have different covariance matrices. In fact, if one works out boundaries between classes, they turn out to be quadratic, so it is also called quadratic discriminant analysis, QDA.
An important insight is that equations simplify considerably if one assumes that all classes have identical covariance [Update: if you assumed it all along, this might have been part of the misunderstanding]. In that case decision boundaries become linear, and that is why this procedure is called linear discriminant analysis, LDA.
It takes some algebraic manipulations to realize that in this case the formulas actually become exactly equivalent to what Fisher worked out using his approach. Think of that as a mathematical theorem. See Hastie's textbook for all the math.
(1) Can we do dimension reduction using Bayesian approach?
If by "Bayesian approach" you mean dealing with different covariance matrices in each class, then no. At least it will not be a linear dimensionality reduction (unlike LDA), because of what I wrote above.
However, if you are happy to assume the shared covariance matrix, then yes, certainly, because "Bayesian approach" is simply equivalent to LDA. However, if you check Hastie 4.3.3, you will see that the correct projections are not given by $\Sigma^{-1} \mu_k$ as you wrote (I don't even understand what it should mean: these projections are dependent on $k$, and what is usually meant by projection is a way to project all points from all classes on to the same lower-dimensional manifold), but by first [generalized] eigenvectors of $\boldsymbol \Sigma^{-1} \mathbf{M}$, where $\mathbf{M}$ is a covariance matrix of class centroids $\mu_k$.
Best Answer
If you have two class each with data distributed with densities f$_1$ and f$_2$ then the Bayes rule that minimizes the expected classification error loss function with equal loss for each error selects class 1 for observation vector x if f$_1$(x)/f$_2$(x)>1 and selects class 2 otherwise. The LDA becomes this Bayes rule under special conditions on the multivariate distributions f$_1$ and f$_2$. Those conditions are that f$_1$ and f$_2$ have to both be multivariate normal distributions with the same covariance matrix and presumably different mean vectors.
The details of this can be found in any of the following three sources.
Duda, Hart and Stork: Pattern Classification
My book, Bootstrap Methods Chapter 2
McLachlan's Discriminant Analysis book
I have previously given an explanation like this on the post you referenced. Is this clear to you now? Why was it not the first time?
I am adding to the answer because it is now clear that the OP is asking how to compute the LDA and not how it works in theory.
LDA creates a separating hyperplane. The hyperplane is defined as a linear combination of variables equal to a constant. f$_1$(x)/f$_2$(x)=1 defines the hyperplane. So you multiply the variables by their coefficients and sum. Then this sum is compared to the defined constant to determine which side of the hyperplane the vector x lies (i.e. whether or not f$_1$(x)>f$_2$(x) or not.