Kernel methods can be used for supervised and unsupervised problems. Well-known examples are the support vector machine and kernel spectral clustering, respectively.
Kernel methods provide a structured way to use a linear algorithm in a transformed feature space, for which the transformation is typically nonlinear (and to a higher dimensional space). The key advantage this so-called kernel trick brings is that nonlinear patterns can be found at a reasonable computational cost.
Note that I said the computational cost is reasonable, but not negligible. Kernel methods typically construct a kernel matrix $\mathbf{K} \in \mathbb{R}^{N\times N}$ with $N$ the number of training instances. The complexity of kernel methods is therefore a function of the number of training instances, rather than the number of input dimensions. Support vector machines, for example, have a training complexity between $O(N^2)$ and $O(N^3)$. For problems with very large $N$, this complexity is currently prohibitive.
This makes kernel methods very interesting from a computational perspective when the number of dimensions is large and the number of samples is relatively low (say, less than 1 million).
Related: Linear kernel and non-linear kernel for support vector machine?
SVM for Large Scale Problems
For very high dimensional problems, such as the 10000
dimensions you mention in the question, there is often no need to map to a higher dimensional feature space. The input space is already good enough. For such problems, linear methods are orders of magnitude faster with almost the same predictive performance. Examples of these methods can be found in LIBLINEAR or Vowpal Wabbit.
Linear methods are particularly interesting when you have many samples in a high dimensional input space. When you have only $500$ samples, using a nonlinear kernel method will also be cheap (since $N$ is small). If you had, say, $5.000.000$ samples in $10.000$ dimensions, kernel methods would be infeasible.
For low-dimensional problems with many training instances (so-called large $N$ small $p$ problems), linear methods may yield poor predictive accuracy. For such problems, ensemble methods such as EnsembleSVM provide nonlinear decision boundaries at significantly reduced computational cost compared to standard SVM.
Best Answer
You have several options, the best one being using a kernel function that is tailored to your specific problem. This is also the least intuitive option, so I won't elaborate on that.
Usually, categorical data are encoded using so-called one-hot encoding. This means we introduce one binary feature for every category. For instance, assume we have 3 categories $A$, $B$ and $C$: $$ A \rightarrow [1, 0 ,0],\quad B\rightarrow[0,1,0],\quad C\rightarrow[0,0,1] $$ The crucial thing is that we must ensure that all pairwise distances are equal, otherwise the distance will be biased towards some pairs (unless, ofcourse, this is what you want). One-hot encoding is the simplest way to ensure that all pairwise distances between categories are equal.