You should construct your features (in this case, the words you're including as descriptors of each document) based only on your training set. This will calculate the probability of having a certain word given that it belongs to a particular class: $P(w_i|c_k)$. In case you're wondering, this probability is needed when calculating the probability of a document belonging to some class: $P(c_{k}|\text{document})$
When you want to predict the class for a new document in the test set, ignore the words that are not included in the training set. The reason is that you can't use the test set for anything other than testing your predictions. Furthermore, the training set must be representative of the test set. Otherwise, you won't get a good classifier. Therefore, it is to be expected that the majority of the words in the test set are also included in the training set.
Some people add an extra column for unknown words and try to calculate a probability of such words given a certain class: $P(\text{unknown} | c_{i})$. I don't think this is necessary or even appropriate because in order to obtain this probability, you need to peek at the test set. That's something you must never do.
You're correct that the supervised learner can often be redundant with TF-IDF weighting. Here's the basic outline of why: In one typical form of TF-IDF weighting, the rescaling is logarithmic, so the weighting for a word $w$ in a document $d$ is
$$
\text{TF-IDF}(w,d) = (\text{no. occurrences of $w$ in $d$}) \cdot f(w)
$$ for $N$ the number of documents in the corpus and $f(w)=\log\left(\frac{N}{\text{no. documents containing $w$}}\right)$. When $f(w)>0$, TF-IDF just amounts to a rescaling of the term frequency. So if we write the matrix counting the number of occurrences of a word in each document as $X$, then a linear model has the form $X\beta$. If we use TF-IDF instead of just term frequency alone, the linear model can be written as $X(k I)\tilde{\beta}$, where $k$ is a vector storing all of our weights $k_i=f(w_i)$. The effect of $kI$ is to rescale each column of $X$. In this setting, the choice to use TF-IDF or TF alone is inconsequential, because you'll get the same predictions. Using the substitution $(kI)\tilde{\beta}=\beta$, we can see the effect is to rescale $\beta$.
But there are at least two scenarios where the choice to use TF-IDF is consequential for supervised learning.
The first case is when $f(w)=0$. This happens whenever a term occurs in every document, such as very common words like "and" or "the." In this case, TF-IDF will zero out the column in $X(kI)$, resulting in a matrix which is not full-rank. A rank-deficient matrix is often not preferred for supervised learning, so instead these words are simply dropped from $X$ because they add no information. In this way, TF-IDF provides automatic screening for the most common words.
The second case is when the matrix $X(kI)$ has its document vectors rescaled to the same norm. Since a longer document is very likely to have a much larger vocabulary than a shorter document, it can be hard to compare documents of different lengths. Rescaling each document vector will also suppress importance rare words in the document independently of how rare or common the word is in the corpus. Moreover, rescaling each document's vector to have the same norm after computing TF-IDF gives a design matrix which is not a linear transformation of $X$, so original matrix cannot be recovered using a linear scaling.
Rescaling the document vectors has a close connection to cosine similarity, since both methods involve comparing unit-length vectors.
The popularity of TF-IDF in some settings does not necessarily impose a limitation on the methods you use. Recently, it has become very common to use word and token vectors that are either pre-trained on a large corpus or trained by the researcher for their particular task. Depending on what you're doing and scale of the data, and the goal of your analysis, it might be more expedient to use TD-IDF, word2vec, or another method to represent natural language information.
A number of resources can be found here, which I reproduce for convenience.
K. Sparck Jones. "A statistical interpretation of term specificity
and its application in retrieval". Journal of Documentation, 28 (1).
1972.
G. Salton and Edward Fox and Wu Harry Wu. "Extended Boolean
information retrieval". Communications of the ACM, 26 (11). 1983.
G. Salton and M. J. McGill. "Introduction to modern information
retrieval". 1983
G. Salton and C. Buckley. "Term-weighting approaches in automatic
text retrieval". Information Processing & Management, 24 (5). 1988.
H. Wu and R. Luk and K. Wong and K. Kwok. "Interpreting TF-IDF term
weights as making relevance decisions". ACM Transactions on
Information Systems, 26 (3). 2008.
Best Answer
It is always hard to assess a priori the performance of a pre-treatment on the data. Even something as simple as normalizing the data does not have an obvious influence on the performance on the later trained classifiers (see per example this post : Normalizing data worsens the performance of CNN?).
However the following links may help you implement your idea :
Text Classification With Word2Vec the author assesses the performance of various classifiers on text documents, with a word2vec embedding. It happens that the best performance is attained with the "classical" linear support vector classifier and a TF-IDF encoding (the approach is really helpful in terms of code, especially if you work with python and sk-learn)
Regarding SVMs, there are kernels designed for text. I once had nice results with Information diffusion kernels and TF-IDF encoding. Or you have kernels that works directly on strings : Text Classification using String Kernels, but their implementations are scarcer...