The general term Naive Bayes refers the the strong independence assumptions in the model, rather than the particular distribution of each feature. A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. More formally, if I want to calculate the probability of observing features $f_1$ through $f_n$, given some class c, under the Naive Bayes assumption the following holds:
$$ p(f_1,..., f_n|c) = \prod_{i=1}^n p(f_i|c)$$
This means that when I want to use a Naive Bayes model to classify a new example, the posterior probability is much simpler to work with:
$$ p(c|f_1,...,f_n) \propto p(c)p(f_1|c)...p(f_n|c) $$
Of course these assumptions of independence are rarely true, which may explain why some have referred to the model as the "Idiot Bayes" model, but in practice Naive Bayes models have performed surprisingly well, even on complex tasks where it is clear that the strong independence assumptions are false.
Up to this point we have said nothing about the distribution of each feature. In other words, we have left $p(f_i|c)$ undefined. The term Multinomial Naive Bayes simply lets us know that each $p(f_i|c)$ is a multinomial distribution, rather than some other distribution. This works well for data which can easily be turned into counts, such as word counts in text.
The distribution you had been using with your Naive Bayes classifier is a Guassian p.d.f., so I guess you could call it a Guassian Naive Bayes classifier.
In summary, Naive Bayes classifier is a general term which refers to conditional independence of each of the features in the model, while Multinomial Naive Bayes classifier is a specific instance of a Naive Bayes classifier which uses a multinomial distribution for each of the features.
References:
Stuart J. Russell and Peter Norvig. 2003. Artificial Intelligence: A Modern Approach (2 ed.). Pearson Education. See p. 499 for reference to "idiot Bayes" as well as the general definition of the Naive Bayes model and its independence assumptions
If you're doing this yourself, as opposed to using a package, it's fairly straightforward to do all three of these things. If you're using an off-the-shelf implementation, it would depend on what you were using as to whether this was possible. I'm going to assume the attributes take categorical values (as most simple versions of NB will) in these explanations. I'll describe for a single continuous valued feature (the frequency of some word in your text, normalised by the document length, say $f_w$), with three bins in the histogram:
very rare: $0 <= f_w < 0.001$
rare: $0.001 <= f_w < 0.01$
frequent: $0.01 <= f_w <= 1$
thus for our word $w$, its feature value must always be in exactly one of these three intervals. Now to answer your questions:
1) The parameters can be updated as you see new examples by maintaining counts over the three bins for all the documents you've seen. The probability of the bin in subsequent documents is this count divided by the sum of the counts. Each time you see a document, increment the counts
2) Technically the NB model is the likelihood---you train a model as above for each class. Multiply the likelihood by the prior to get the posterior probability of the class, but be aware that in NB your likelihoods often swamp your priors because the assumption of independence leads to very sharp distributions (see this paper by Hand and Yu)
3) Easy---just change the feature $f_w$ from the normalised frequency of $w$ to the $tf-idf$ of $w$. Be aware that you'll need to specify new, sensible bins in your histogram if you stick with the categorical approach (the alternative is to specify continuous distributions on your features, but it's tricky to come up with good ones)
Best Answer
You should construct your features (in this case, the words you're including as descriptors of each document) based only on your training set. This will calculate the probability of having a certain word given that it belongs to a particular class: $P(w_i|c_k)$. In case you're wondering, this probability is needed when calculating the probability of a document belonging to some class: $P(c_{k}|\text{document})$
When you want to predict the class for a new document in the test set, ignore the words that are not included in the training set. The reason is that you can't use the test set for anything other than testing your predictions. Furthermore, the training set must be representative of the test set. Otherwise, you won't get a good classifier. Therefore, it is to be expected that the majority of the words in the test set are also included in the training set.
Some people add an extra column for unknown words and try to calculate a probability of such words given a certain class: $P(\text{unknown} | c_{i})$. I don't think this is necessary or even appropriate because in order to obtain this probability, you need to peek at the test set. That's something you must never do.