Have you thought about storing the term frequency and the document frequency separately for the trained set, then when you add a new document you can update the document frequency of the new trained set, i.e. including the new document, and calculate tf-idf. Or am I missing something?
Kernel methods can be used for supervised and unsupervised problems. Well-known examples are the support vector machine and kernel spectral clustering, respectively.
Kernel methods provide a structured way to use a linear algorithm in a transformed feature space, for which the transformation is typically nonlinear (and to a higher dimensional space). The key advantage this so-called kernel trick brings is that nonlinear patterns can be found at a reasonable computational cost.
Note that I said the computational cost is reasonable, but not negligible. Kernel methods typically construct a kernel matrix $\mathbf{K} \in \mathbb{R}^{N\times N}$ with $N$ the number of training instances. The complexity of kernel methods is therefore a function of the number of training instances, rather than the number of input dimensions. Support vector machines, for example, have a training complexity between $O(N^2)$ and $O(N^3)$. For problems with very large $N$, this complexity is currently prohibitive.
This makes kernel methods very interesting from a computational perspective when the number of dimensions is large and the number of samples is relatively low (say, less than 1 million).
Related: Linear kernel and non-linear kernel for support vector machine?
SVM for Large Scale Problems
For very high dimensional problems, such as the 10000
dimensions you mention in the question, there is often no need to map to a higher dimensional feature space. The input space is already good enough. For such problems, linear methods are orders of magnitude faster with almost the same predictive performance. Examples of these methods can be found in LIBLINEAR or Vowpal Wabbit.
Linear methods are particularly interesting when you have many samples in a high dimensional input space. When you have only $500$ samples, using a nonlinear kernel method will also be cheap (since $N$ is small). If you had, say, $5.000.000$ samples in $10.000$ dimensions, kernel methods would be infeasible.
For low-dimensional problems with many training instances (so-called large $N$ small $p$ problems), linear methods may yield poor predictive accuracy. For such problems, ensemble methods such as EnsembleSVM provide nonlinear decision boundaries at significantly reduced computational cost compared to standard SVM.
Best Answer
There are two advantages (or rather two use-cases):
For every application of SVMs, a user has to choose which kernel to use and sometimes even have to design their own kernel matrices. Is it possible to alleviate choosing kernels or specialized kernel designs? MKL was a step towards that.
The second case IMHO is by far a more compelling case. Consider that your data input is a video data + cc. The feature representation of each video consists of video features, audio features and text features. Such a data is known as multi-modal data. Each set of these features may require a different notion of similarity (a different kernel). Instead of building a specialized kernel for such applications, is it possible to just define kernel for each of these modes and linearly combine them?