Solved – Is an SVM’s (maximum) likelihood uniquely defined as a function of hyperparameters

hyperparameteroptimizationsvm

I think that I must be reading this paragraph (below) incorrectly.

Note that both types of evidence that we have defined in general depend on the inverse noise level $C$ and the kernel $K(x, x^\prime )$ separately. This is in contrast to the conventional SVM solution: The latter is found by maximizing the log-posterior (13), and the position of this maximum clearly only depends on the product $CK(x, x^\prime )$. This is an important point: It implies that properties of the conventional SVM alone—generalization error bounds, test error or cross-validation error, for example—can never be used to assign an unambiguous value to $C$. Since $C$ determines the class probabilities (10), this also means that well-determined class probabilities for SVM predictions cannot be obtained in this way. The intuition behind this observation is simple: If $C$ is varied while $CK(x, x^\prime)$ is kept fixed (which means changing the amplitude of the kernel in inverse proportion to $C$), then the position of the maximum of the posterior $P(\theta | D)$, i.e., the conventional SVM solution, remains unchanged. The shape of the posterior, on the other hand, does vary in a nontrivial way, being more peaked around the maximum for larger $C$; the evidence is sensitive to these changes in shape and so depends on $C$.

(While the Dr. Sollich's discussion is in the context of Bayesian SVM methods, I'd like to set aside the Bayesian perspective for the moment and just focus on what the author is saying about conventional SVM methods.)

The paper is Sollich, Peter. 2002. Machine Learning. V 46, 1-3. "Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities" pp 21-52.

My interpretation of the paragraph is that the conventional SVM performance surface of the hyper-parameters for a given data set is the same along a hyperbola defined by $\text{constant}=CK(x,x^\prime),$ but because the data are fixed, $K(x,x^\prime)$ only varies through $\gamma$, we have
$$\text{constant}=C\gamma$$

In the example of a radial basis function kernel with width parameter $\gamma$, this in turn implies that we don't need to search over a grid of $C\times\gamma$, but can instead fix one parameter and search over the remaining parameter. Clearly this would dramatically speed up any grid search over hyperparameters. (But yes, I am aware that grid search is sub-optimal).

Illustrated, under my interpretation, each of the hyperbolae below has the same likelihood value along the line (but some hyperbolae might have the same value because the hyperparameter surface is nonconvex). Fixing $\frac{1}{C}=\lambda$ at, say, 4, means that we can still find a maximum likelihood on one of the many hyperbolae corresponding to alternative values of $\gamma$, and that this likelihood will be no worse than any of the other likelihoods on the same hyperbola.

My interpretation

Is my interpretation correct?

Best Answer

The paragraph in question refers to that document's equation (13): $$ \ln P(\theta \mid D) = - \frac12 \sum_{x, x'} \theta(x) K^{-1}(x, x') \theta(x') - C \sum_i l(y_i \, \theta(x_i)) + \text{const} $$ where $K^{-1}(x, x')$ is the element of the inverse kernel matrix corresponding to points $x$ and $x'$, i.e. it actually depends on the full dataset.

You can divide through by $C$ without changing the location of the maximum, which implies that the solution depends only on $\mathbf K^{-1} / C$ (where $\mathbf K$ denotes the full training set kernel matrix), or equivalently on $C \mathbf K$.


You can also see this directly from the dual objective function:

$$ \max_\alpha \sum_i \alpha_i - \frac12 \sum_{i, j} \alpha_i \alpha_j y_i y_j k(x_i, x_j) \\\text{s.t.}\; \forall i. 0 \le \alpha_i \le C, \; \sum_i \alpha_i y_i = 0 $$

Consider replacing each $\alpha_i$ by $c \alpha_i$, $k(x_i, x_j)$ with $k(x_i, x_j) / c$, and changing the first constraint to $0 \le \alpha_i \le c C$. Then the new objective function is $$ \sum_i c \alpha_i - \frac12 \sum_{i, j} (c \alpha_i) (c \alpha_j) y_i y_j \frac{k(x_i, x_j)}{c} = c \left( \sum_i \alpha_i - \frac12 \sum_{i, j} \alpha_i \alpha_j y_i y_j k(x_i, x_j) \right) ,$$ the unaltered constraint is still satisfied, and so the optimal $\alpha$ is then scaled by $c$. Since predictions are based on $\alpha^T \mathbf k$, where $\mathbf k$ is the vector of test point kernels, replacing it with $c \alpha^T \mathbf k/c$ doesn't change the predictions.


If we're using a Gaussian RBF kernel $K(x, x') = \exp\left(- \gamma \lVert x - x' \rVert^2 \right)$, however, it is not true that this means we can explore only a single $C \gamma = \text{const}$ curve: $K$'s dependence on $\gamma$ is nonlinear. Doubling $\gamma$ and halving $C$ will result in a different $C \mathbf K$ matrix.

Here's some empirical evidence (in Python) on some arbitrary synthetic data:

from sklearn import svm, datasets, cross_validation
X, y = datasets.make_classification(
          n_samples=1000, n_features=10, random_state=12)
Xtrain, Xtest, ytrain, ytest = cross_validation.train_test_split(X, y)
Cs = [.01, .1, 1, 10, 100]

print("RBF:")
for C in Cs:
    acc = svm.SVC(C=C, gamma=1/C, kernel='rbf') \
             .fit(Xtrain, ytrain).score(Xtest, ytest)
    print("{:5}: {:.3}".format(C, acc))

print("\nScaled linear:")
for C in Cs:
    acc = svm.SVC(C=C, gamma=1/C, kernel=lambda x, y: x.dot(y.T) / C)  \
             .fit(Xtrain, ytrain).score(Xtest, ytest)
    print("{:5}: {:.3}".format(C, acc))

which prints

RBF:
 0.01: 0.84
  0.1: 0.504
    1: 0.832
   10: 0.86
  100: 0.868

Scaled linear:
 0.01: 0.88
  0.1: 0.88
    1: 0.88
   10: 0.88
  100: 0.88

The first set of numbers is for an RBF kernel with varying $C$ and $\gamma = 1/C$; because we get different classification accuracies, the predictions are clearly different.

The second is for a scaled linear kernel $K(x, x') = \gamma x^T x'$, in which case (because the dependence on $\gamma$ is linear) your reasoning is correct and all the predictions are the same.