I've written a couple ;o)
G. C. Cawley and N. L. C. Talbot, Efficient approximate leave-one-out cross-validation for kernel logistic regression, Machine Learning, vol, 71, no. 2-3, pp. 243--264, June 2008.
Which gives a reasonable method for choosing kernel and regularisation parameters and an empirical evaluation
G. C. Cawley, G. J. Janacek and N. L. C. Talbot, Generalised kernel machines, in Proceedings of the IEEE/INNS International Joint Conference on Neural Networks (IJCNN-2007), pages 1732-1737, Orlando, Florida, USA, August 12-17, 2007.
Which basically documents a MATLAB toolbox for making kernel versions of generalised linear models with kernel logistic regression as one of the examples. The library includes code for model selection (but sadly no manual yet, just some demos)
However the earliest paper I know of that uses that particular name is
"Kernel logistic regression and the import vector machine" by Zhu and Hastie, Advances in Neural Information Processing Systems (2001) (available via google scholar)
It is correct to use the method predict_proba to get the estimated probabilities.
Just to make it a bit clearer, consider the example given here (http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html).
Below I copy just the relevant lines, so that anyone can copy & paste it, and reproduce what I say next,
import numpy as np
from sklearn import linear_model, datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
h = .02 # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, Y)
In the iris dataset you have three classes (0, 1, 2 = ['setosa', 'versicolor', 'virginica']), contained in Y. When you call,
logreg.predict_proba(X)
you get an array of arrays of probabilities, each element being p(class|x),
array([[ 9.05823905e-01, 6.81672013e-02, 2.60088939e-02],
[ 7.64631786e-01, 2.16376590e-01, 1.89916235e-02],
[ 8.46908157e-01, 1.42190177e-01, 1.09016662e-02],
[ 8.15654921e-01, 1.75608861e-01, 8.73621791e-03],
...
The other alternatives are predict_log_proba() to get log P(class|x) and predict() to get the class with highest probability among the three for a given sample.
Best Answer
scikit-learn, unfortunately, currently supports neither kernel logistic regression nor the ANOVA kernel.
There are some workarounds, though.
The idea of the kernel trick is: since many algorithms access the data only through inner products $\langle x, y \rangle$, we can map our inputs $x$ into some feature space $\varphi(x)$, which can actually be any Hilbert space $\mathcal H$, if we can compute $\langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$ as $k(x, y)$. Then we don't have to do the actual mapping $\varphi$, which is often infinite-dimensional.
But, it turns out that we can often find a function $z$, where $z(x) \in \mathbb R^D$, such that $z(x)^T z(y) \approx \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$. Then we can run standard logistic regression on the $z(x)$ features, and we'll get approximately the result of kernel logistic regression.
Option 1: Cholesky
The Gram matrix $K$, where $K_{ij} = k(x_i, x_j)$, is all the data we need to run a method. But $K$ is necessarily positive semi-definite, and for the ANOVA kernel, I believe that it is always nonsingular as well. Thus there is some matrix $R$ such that $K = R R^T$, i.e. $K$ is the Gram matrix of $R$. So you can compute $K$ yourself, and then use $R$ as the features.
This approach is exact, but the logistic regression model will be slow, since the dimensionality is equal to the number of inputs, and it can't be extended to new points easily (so you need to know what the test points are at training time).
I don't recommend this; it's just here for completeness.
Note in the
anova_kernel
function above that it doesn't do the thing rapidminer does in "double-counting" points where they're both zero. I don't know why it does that; none of the other definitions of the anova kernel I've seen do.Option 2: Nyström
This is an approximate kernel embedding which works by assuming that the kernel matrix $K$ is low-rank – which it almost always is. In Nyström, you pick a set of $D$ data points, then compute the full kernel evaluation of all the $n$ points to only those $D$ points. If you pick the first $D$ points, then the kernel matrix looks like:
$$ \begin{bmatrix} A & B \\ B^T & ? \end{bmatrix} $$ where you've computed $A$ (the pairwise kernel evaluations among the $D$ points) and $B$ (the kernel evaluations from each of the other points to the $D$ points), and $?$ will be guessed based on the assumption that $K$ is rank $D$. It turns out that the way to do that is to eigendecompose $A = U \Lambda U^T$, and then use the features $Z = \begin{bmatrix}U \Lambda^{\tfrac12} \\ B^T U \Lambda^\dagger \Lambda^{\tfrac12}\end{bmatrix}$.
Compared to Cholesky, this approach is inexact, but gives a low-dimensional embedding, and easily allows extension to new data points. Plus, it's already implemented in
sklearn.kernel_approximation.Nystroem
.Option 3: Random Fourier Features
The other major paradigm of approximate kernel embeddings is random Fourier features (also sometimes called "random kitchen sinks"). These work on shift-invariant kernels $k(x, y) = k(x - y)$ by taking the Fourier transform of $k$, which by Bochner's theorem will be (scalable to) a probability distribution, sampling $\{\omega_i\}_{i=1}^{D/2}$ from it, and then using the feature map $$z(x) = \begin{bmatrix}\sin(\omega_1^T x) & \cos(\omega_1^T x) & \dots & \sin(\omega_{D/2}^T x) & \cos(\omega_{D/2}^T x)\end{bmatrix}.$$ (Sometimes you'll see this as $\cos(\omega_i^T X + b_i)$ instead, but this way is better.)
scikit-learn has these implemented for RBF and some variants of $\chi^2$ kernels, but not for ANOVA. But the kernel is shift-invariant:
$$ k(\Delta) = \left( \sum_{i=1}^d \exp(-\gamma \Delta_i^2) \right)^p $$
The problem is, I don't know how to Fourier transform it. :) For $d = 1$, it's a Gaussian with mean 0 and variance $2 p \gamma$. For $p = 1$, $d > 1$, you can do it by getting an embedding for each dimension and concatenating them together.
For integer $p$, you might be able to turn the product of sums into a sum of products and then Fourier transform each term, I suppose. Don't think that'd generalize to non-integral $p$, and I haven't tried it for integral $p > 1$ yet either.