Kernel Trick – Exploring the Properties of Hyperbolic Tangent Kernel

kernel trick

I've read from various sources that Hyperbolic Tangent kernels are not positive semi-definite and thus are not actually a valid kernel. Does this mean they are misnomer?

Furthermore, if they are technically not a kernel, how come they perform well in certain cases? What type of data do they perform well?

And what is the rationale behind their method of measuring similarity? By this I mean: as we can consider the rationale behind polynomial kernels to be taking the logical AND of the features of a feature vector, what is the corresponding analogy for tanh kernels?

Hyperbolic Tangent kernels are sometimes also called Sigmoid Kernels or tanh kernels and are defined as
$$
k(x,x^\prime)=\tanh\left(\nu+ x\cdot x^\prime\right)
$$
This website provides some discussion.

Best Answer

I've read from various sources that Hyperbolic Tangent kernels are not positive semi-definite and thus are not actually a valid kernel. Does this mean they are misnomer?

Consider $X=\{x_1,x_2\}.$ Denote $K$ the Gram matrix of $X$ for $k$ the hyperbolic tangent kernel. It has eigenvalues given by the set of solutions $\lambda$ to

$$\begin{align} 0 &= |I\lambda-K| \\ &= \begin{vmatrix} \lambda-k_{11} & k_{12} \\ k_{21} & \lambda-k_{22} \\ \end{vmatrix} \\ &=\lambda^2-\lambda(k_{11}+k_{22})+k_{11}k_{22}-k_{12}k_{21} \end{align} $$

Clearly this is a quadratic. Observe that this definition of the hyperbolic tangent kernel is symmetric because the dot product is symmetric, so $k_{12}=k_{21}$, so we know $$\begin{align} \lambda &= \frac{(k_{11} + k_{22})\pm\sqrt{(-k_{11} - k_{22})^2-4\times1\times (k_{11}k_{22} - k_{12}k_{21})}}{2 \times 1} \\ &= \frac{1}{2}\left[ (k_{11} + k_{22}) \pm \sqrt{(k_{11} - k_{22})^2 - 4k_{12}k_{21}} \right] \end{align} $$

A matrix is PSD if and only if all $\lambda\ge0$, but specific choices of $\nu, x_1, x_2$ will result in negative $\lambda$. Showing indeterminacy is a matter of picking $k_{ij}$ s.t. at one or two solutions $\lambda$ are negative, and then backing into values $x$ which yield those $k_{ij}$.

if they are technically not a kernel, how come they perform well in certain cases? What type of data do they perform well?

It may be that they are only "slightly" non-PSD in cases where they perform well, by which I mean that the negative eigenvalues are "close to" 0. Or it may be that the most important attributes of RKHS are retained even in the case of indefinite Gram matrices, and losing PSD is only a slight reduction in power. Finding cases where $k$ works well is just a matter of finding a paper that puts it to use. The website you linked to provides one such example. But I'm not convinced that there are general classes of kernels that are better for some data as compared to other data that is distinct from choices of how to represent feature vectors. Indefinite kernels are an ongoing topic of research.

And what is the rationale behind their method of measuring similarity? By this I mean: as we can consider the rationale behind polynomial kernels to be taking the logical AND of the features of a feature vector, what is the corresponding analogy for $\tanh$ kernels?

In this definition of the $\tanh$ kernel, we only look to the inner product of the feature vectors. These might be considered as "non-nomralized" cosine similarities, so they are larger when two vectors are more similar, and negative if they are dissimilar, so we have $\tanh(10)\approx 1$ and $\tanh(-10)\approx -1$. The parameter $\nu$ shifts where the sign changes away from 0. So it's like an "and" over the features, but with a weighting that corresponds to the magnitude of agreement -- more agreement makes the output more positive.

Related Question