Let $\mathcal{X}$ represent your input space i.e the space where your data points resides. Consider a function $\Phi:\mathcal{X} \rightarrow \mathcal{F}$ such that it takes a point from your input space $\mathcal{X}$ and maps it to a point in $\mathcal{F}$. Now, let us say that we have mapped all your data points from $\mathcal{X}$ to this new space $\mathcal{F}$. Now, if you try to solve the normal linear svm in this new space $\mathcal{F}$ instead of $\mathcal{X}$, you will notice that all the earlier working simply look the same, except that all the points $x_i$ are represented as $\Phi(x_i)$ and instead of using $x^Ty$ (dot product) which is the natural inner product for Euclidean space, we replace it with $\langle \Phi(x), \Phi(y) \rangle$ which represents the natural inner product in the new space $\mathcal{F}$. So, at the end, your $w^*$ would look like,
$$
w^*=\sum_{i \in SV} h_i y_i \Phi(x_i)
$$
and hence,
$$
\langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i \langle \Phi(x_i), \Phi(x) \rangle
$$
Similarly,
$$
b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j \langle \Phi(x_j), \Phi(x_i)\rangle\right)\right)
$$
and your classification rule looks like: $c_x=\text{sign}(\langle w, \Phi(x) \rangle+b)$.
So far so good, there is nothing new, since we have simply applied the normal linear SVM to just a different space. However, the magic part is this -
Let us say that there exists a function $k:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ such that $k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle$. Then, we can replace all the dot products above with $k(x_i, x_j)$. Such a $k$ is called a kernel function.
Therefore, your $w^*$ and $b^*$ look like,
$$
\langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i k(x_i, x)
$$
$$
b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j k(x_j, x_i)\right)\right)
$$
For which kernel functions is the above substitution valid? Well, that's a slightly involved question and you might want to take up proper reading material to understand those implications. However, I will just add that the above holds true for RBF Kernel.
To answer your question, "Is the situation so that all the support vectors are needed for the classification?"
Yes. As you may notice above, we compute the inner product of $w$ with $x$ instead of computing $w$ explicitly. This requires us to retain all the support vectors for classification.
Note: The $h_i$'s in the final section here are solution to dual of the SVM in the space $\mathcal{F}$ and not $\mathcal{X}$. Does that mean that we need to know $\Phi$ function explicitly? Luckily, no. If you look at the dual objective, it consists only of inner product and since we have $k$ that allows us to compute the inner product directly, we don't need to know $\Phi$ explicitly. The dual objective simply looks like,
$$
\max \sum_i h_i - \sum_{i,j} y_i y_j h_i h_j k(x_i, x_j) \\
\text{subject to : } \sum_i y_i h_i = 0, h_i \geq 0
$$
Neural networks are not a kernel, they are a learning algorithm.
Plenty of kernel functions exist, such as:
- sigmoid, popular in the early days of kernel methods due to their influence in neural networks; not really used heavily now
- Tanimoto/Jaccard/diffusion, popular for binary features
- tree/graph kernels, popular in natural language processing
- histogram kernel, popular in image processing -- essentially it's a very fast approximation to the RBF kernel
The right kernel depends very much on the nature of the data. Often the best kernel is a custom-made one, particularly in bioinformatics. The Gaussian/RBF and linear kernels are by far the most popular ones, followed by the polynomial one.
Best Answer
First of all, your question is not quite well-posed. The reason is that Mercer's theorem only applies for the case of a kernel defined on a finite measure space. Practically, this means that in order to apply the theorem, the eigenfunctions $\phi_i$ are in fact taken with respect to the operator $$K_{\mu}(f)= \left( x\mapsto \int_{\mathbb{R}} K(x,y)f(y)\mu(dy)\right)$$ where $\mu(dy)=p(y)dy$ is a probability measure. The $\phi_i$ are then orthonormal wrt to the inner product defined by $<f,g>=\int f(x)g(x)\mu(dx)$.
It is simple to see that the condition $\mu(\mathbb{R})<\infty$ is necessary for Mercer's theorem to hold. Consider the identity:
$$\int_{\mathbb{R}} e^{-(x-y)^2}xdx=\sqrt{\pi}y$$
This shows that the function $f(x)=x$ is an eigfunction of the operator $\int K(x,y)f(y)dy$. But evidently $\int_{\mathbb{R}} f(x)^2dx=\infty$, which shows that it is not possible to construct an orthonormal basis of $K$, without introducing a weighting function $p(y)$.
Secondly, I am assuming there should be a minus sign in the definition of the kernel $e^{-|x-y|}$, otherwise the resulting kernel fails to be positive definite.