Why is that with Bayes classifier we achieve the best performance that can be achieved ? What is the formal proof/explanation for this?
Usually, a dataset $D$ is considered to consist of $n$ i.i.d. samples $x_i$ of a distribution that generates your data. Then, you build a predictive model from the given data: given a sample $x_i$, you predict the class $\hat{f}(x_i)$, whereas the real class of the sample is $f(x_i)$.
However, in theory, you could decide not to choose one particular model $\hat{f}_\text{chosen}$, but rather consider all possible models $\hat{f}$ at once and combine them somehow into one big model $\hat F$.
Of course, given the data, many of the smaller modells could be quite improbable or inappropriate (for example, models that predict only one value of the target, even though there are multiple values of the target in your dataset $D$).
In any case, you want to predict the target value of new samples, which are drawn from the same distribution as $x_i$s. A good measure $e$ of the performance of your model would be
$$e(\text{model}) = P[f(X) = \text{model}(X)]\text{,}$$ i.e., the probability that you predict the true target value for a randomly sampled $X$.
Using Bayes formula, you can compute, what is the probability that a new sample $x$ has target value $v$, given the data $D$:
$$P(v\mid D) = \sum_{\hat{f}} P(v\mid \hat{f}) P(\hat{f}\mid D)\text{.}$$
One should stress that
- usually $P(v\mid \hat{f})$ is either $0$ or $1$,
since $\hat{f}$ is a deterministic function of $x$,
- not usually, but almost all the time, it is impossible to estimate $P(\hat{f}\mid D)$ (except for the aforementioned trivial cases),
- not usually, but almost all the time, the number of possible models $\hat{f}$ is too big, for the upper sum to be evaluated.
Hence, it is very hard to obtain/estimate $P(v\mid D)$ in most of the cases.
Now, we proceed to the Optimal Bayes classifier. For a given $x$, it predicts the value
$$\hat{v} = \text{argmax}_v \sum_{\hat{f}} P(v\mid \hat{f}) P(\hat{f}\mid D)\text{.}$$
Since this is the most probable value among all possible target values $v$, the Optimal Bayes classifier maximizes the performance measure $e(\hat{f})$.
As we always use Bayes classifier as a benchmark to compare the performance of all other classifiers.
Probably, you use the naive version of Bayes classifier. It is easy to implement,
works reasonably well most of the time, but computes only a naive estimate of $P(v\mid D)$.
I would try not to conflate naive Bayes and the concept of a Bayes classifier. The former is a specific kind of classification model, whereas the latter should really just be viewed as an "optimal" classifier in a given setting (it could be any type of model, so long as it's the "true" model).
The reason we call the optimal classifier a Bayes classifier is because the best classifier needs to use Bayesian updating when making predictions, by which we mean that we follow Bayes theorem (it is a theorem after all) when updating our expectations based on evidence.
To say that a naive Bayes classifier is the Bayes classifier would just mean that no classifier can perform better in terms of misclassification rate (that is, it "knows" all the marginal distributions of the predictors for each class and correctly assumes they're all independent).
To your second question, I believe naive Bayes gained popularity because it's easy to implement and historically (despite it's generally false assumptions) it often performed well at certain tasks, particularly text classification.
Best Answer
There's no need to treat separately the cases where $X$ is discrete, continuous or neither. We are just taking an iterated expectation here, which is always permissible. The idea is that since probabilities are simply expectations of indicator random variables we can always write something like
\begin{align} P(g(X) \neq Y) &= \text{E} \left (I_{ \{ g(X) \neq Y \}} \right ) \\ &= \text{E} \left [ \text{E} \left (I_{ \{ g(X) \neq Y \}} \mid X \right ) \right ] \\ &= \text{E}[ P(g(X) \neq Y \mid X)] \end{align}
and the integral above is just another way of writing this down. Again, this holds true no matter what "type" of random variable $X$ happens to be.