This paper seems to prove (I can't follow the math) that bayes is good not only when features are independent, but also when dependencies of features from each other are similar between features:
In this paper, we propose a novel explanation on the
superb classification performance of naive Bayes. We
show that, essentially, the dependence distribution; i.e.,
how the local dependence of a node distributes in each
class, evenly or unevenly, and how the local dependencies of all nodes work together, consistently (supporting a certain classification) or inconsistently (canceling each other out), plays a crucial role. Therefore,
no matter how strong the dependences among attributes
are, naive Bayes can still be optimal if the dependences
distribute evenly in classes, or if the dependences cancel each other out
The general term Naive Bayes refers the the strong independence assumptions in the model, rather than the particular distribution of each feature. A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. More formally, if I want to calculate the probability of observing features $f_1$ through $f_n$, given some class c, under the Naive Bayes assumption the following holds:
$$ p(f_1,..., f_n|c) = \prod_{i=1}^n p(f_i|c)$$
This means that when I want to use a Naive Bayes model to classify a new example, the posterior probability is much simpler to work with:
$$ p(c|f_1,...,f_n) \propto p(c)p(f_1|c)...p(f_n|c) $$
Of course these assumptions of independence are rarely true, which may explain why some have referred to the model as the "Idiot Bayes" model, but in practice Naive Bayes models have performed surprisingly well, even on complex tasks where it is clear that the strong independence assumptions are false.
Up to this point we have said nothing about the distribution of each feature. In other words, we have left $p(f_i|c)$ undefined. The term Multinomial Naive Bayes simply lets us know that each $p(f_i|c)$ is a multinomial distribution, rather than some other distribution. This works well for data which can easily be turned into counts, such as word counts in text.
The distribution you had been using with your Naive Bayes classifier is a Guassian p.d.f., so I guess you could call it a Guassian Naive Bayes classifier.
In summary, Naive Bayes classifier is a general term which refers to conditional independence of each of the features in the model, while Multinomial Naive Bayes classifier is a specific instance of a Naive Bayes classifier which uses a multinomial distribution for each of the features.
References:
Stuart J. Russell and Peter Norvig. 2003. Artificial Intelligence: A Modern Approach (2 ed.). Pearson Education. See p. 499 for reference to "idiot Bayes" as well as the general definition of the Naive Bayes model and its independence assumptions
Best Answer
I would try not to conflate naive Bayes and the concept of a Bayes classifier. The former is a specific kind of classification model, whereas the latter should really just be viewed as an "optimal" classifier in a given setting (it could be any type of model, so long as it's the "true" model).
The reason we call the optimal classifier a Bayes classifier is because the best classifier needs to use Bayesian updating when making predictions, by which we mean that we follow Bayes theorem (it is a theorem after all) when updating our expectations based on evidence.
To say that a naive Bayes classifier is the Bayes classifier would just mean that no classifier can perform better in terms of misclassification rate (that is, it "knows" all the marginal distributions of the predictors for each class and correctly assumes they're all independent).
To your second question, I believe naive Bayes gained popularity because it's easy to implement and historically (despite it's generally false assumptions) it often performed well at certain tasks, particularly text classification.