Solved – Same kernel for mixed/categorical data

categorical datakernel trickmixed type datasvm

I know it's common practice, but is it right to apply the common kernels to categorical/mixed data? If not, are there alternatives? I'm expecting answers from both theoretical and practical points of view.

The intended application is to any machine learning algorithm that supports the use of kernels, like (LS)-SVM, RVM, Gaussian Processes, Kernel K-NN, etc.

For all intents, consider non-binary categorical variables can be coded into binary variables, under any scheme. Solutions using all levels at once are preferred though.


Another question (Kernel methods on Categorical Data) nearly addressed the issue, but the accepted answer only mentions

[…] using a kernel function that is tailored to your specific problem.
This is also the least intuitive option, so I won't elaborate on that.

Which doesn't answer the current question.

Best Answer

From a practical point of view, there're no issues with that practice, and some benefices (like a simplified framework).

From a theoretical point of view coincidence between categorical features might not really mean much similarity, these similarities depend on the probabilities of occurrence, and these could (should?) be taken into account, adding more information to the problem.


Marco Antonio Villegas García describes some valid kernels for categorical data in his MSc thesis [1], even beating common kernels in SVM classification tasks benchmarking.

They are:

$$\begin{align} &k_{0}(z_{i},z_{j}) = \left\{\begin{matrix}&1, &&z_{i} = z_{j}\\&0, &&z_{i} \neq z_{j}\end{matrix}\right.\\&k_{1}(z_{i},z_{j}) = \left\{\begin{matrix}&h(P_{z}(z_{i})), &z_{i} = z_{j}\\&0, &z_{i} \neq z_{j}\end{matrix}\right.\end{align}$$

With $h(z) = (1-z^{\alpha})^{1/\alpha}$ being a measures of "probabilistic" similarity and $P_{z}$ a Probability Mass Function (PMF), or, in other words $P_{z}(z_{i})$ is the probability that variable $z$ assumes the value $z_{i}$.

While $k_{0}$ takes a naive approach to similarity, $k_{1}$ takes into account the probabilities of occurence.

He also introduces a third kernel as an afterthought:

$$\begin{align} &k_{2}(z_{i},z_{j}) = \left\{\begin{matrix}&h(P_{z}(z_{i})), &z_{i} = z_{j}\\&g(P_{z}(z_{i}),P_{z}(z_{j})), &z_{i} \neq z_{j}\end{matrix}\right.\end{align}$$

With the introduction of another inverting function, $g$.


[1] Villegas García, M. A. (2013). An investigation into new kernels for categorical variables.

Related Question