Solved – SVM kernels and decision boundaries

kernel trickmachine learningsvm

I trying to figure out intuitively how the kernel trick gives rise to a decision boundary. I've always thought of SVMs as hyperdimensional spaces with a decision plane dividing them up and I'm trying to understand how kernels fit into them aside from allowing the decision plane to be nonlinear.

In my particular case I'm looking at string kernels where strings are classified. I get that kernelization helps to make decision boundaries for nonlinearly separably data but whenever they're explaining they usually only seem to show how the dot product give you a similarity score between two data points not how it partitions a set of data into two categories.

https://www.quora.com/What-are-kernels-in-machine-learning-and-SVM-and-why-do-we-need-them

For example in this above link the Lili commentator simply gives an example of what looks like two datapoints in 3d space (x and y) and calculates the dot product between them. But I don't see an explanation about how this product can be used to determine similarity between the two points or how it can be extended into a decision boundary between two groups of points.

Does someone have some insight in regards to this?

Best Answer

they usually only seem to show how the dot product give you a similarity score between two data points not how it partitions a set of data into two categories.

Linear SVM partitions space using a hyperplane. Hyperplane can be represented as $$H = \{x \in \mathbb{R}^k | x \cdot n = b \}$$ for some $n \in \mathbb{R}^k $.

$H$ partitions $\mathbb{R}^k$ according to the sign of $x \cdot n - b$.

But I don't see an explanation about how this product can be used to determine similarity between the two points

You can calculate Euclidean distance between two points using dot products: $$\|x - y\|^2 = (x - y)\cdot (x-y) = x \cdot x - 2 x \cdot y + y \cdot y$$

Kernel trick lets you to separate data which is not linearly separable in some other space - you just need the kernel to actually fit SVM, since its training objective is expressed solely using dot products (and not explicitly using coordinates from the space into which you embed your data).

Related Question