Solved – Generalization bounds on SVM

machine learningsvmvc-dimension

I am interested in theoretical results for the generalization ability of Support Vector Machines, e.g. bounds on the probability of classification error and on the Vapnik-Chervonenkis (VC) dimension of these machines. However, reading through the literature I have had the impression that some similar recurring results tend to differ slightly from author to author, particularly regarding the technical conditions required for a given bound to hold.

In the following I will recall the structure of the SVM problem and state 3 of the main generalization results that I have recurrently found in one form or another $-$ I give 3 main references throughout the exposition.

Problem setting:

Assume we have a data sample of independent and identically distributed (i.i.d.) pairs $(x_i,y_i)_{1\leq i\leq n}$ where for all $i$, $x_i \in \mathbb{R}^p$ and $y_i \in \{-1,1\}$. We construct a support vector machine (SVM) that maximizes the minimal margin $m^*$ between the separating hyperplane defined by $\{x : w \cdot x + b = 0\}$, $w \in \mathbb{R}^p$ and $b \in \mathbb{R}$, and the closest point among $x_1,\cdots,x_n$ so as to separate the two classes defined by $y = -1$ and $y = 1$. We let the SVM admit some errors through a soft margin by introducing slack variables $\xi_1,\cdots,\xi_n$ $-$ but for notational simplicity we ignore the possibility of kernels. The solution parameters $w^*$ and $b^*$ are obtained by solving the following convex quadratic optimization program:

$$\begin{align}
\min_{w, \, b, \, \xi_1, \, \cdots, \, \xi_n} \; & \; \frac{1}{2}\|w\|^2 + C\sum_{i=1}^n\xi_i
\\
\text{s.t.} \; : \; & \; y_i(w\cdot x_i+b) \geq 1 – \xi_i \, & , \, \forall \, i \in \{1,\cdots,n\}
\\
& \; \xi_i \geq 0\, & , \, \forall \, i \in \{1,\cdots,n\}
\end{align}$$

We are interested in the generalization ability of this machine.

Vapnik-Chervonenkis dimension $VC$ :

A first result is due to (Vapnik, 2000), in which he bounds the VC dimension of a separating hyperplane, theorem 5.1. Letting $R = \max_{x_i} \|x_i\|$, we have:

$$VC \leq \min \left( \left( \frac{R}{m^*}\right)^2, \, p\right) + 1$$

This result can again be found in (Burges, 1998), theorem 6. However, it seems Burges' theorem is more restrictive than the same result by Vapnik, as he needs to define a special category of classifiers, known as gap-tolerant classifiers $-$ to which the SVM belongs $-$, to state the theorem.

Bounds on probability of errors:

In (Vapnik, 2000), theorem 5.2 in page 139 gives the following bound on the SVM generalization ability:

$$\mathbb{E}[P_{\text{error}}] \leq \frac{1}{n}\mathbb{E} \left[ \min\left(p,n_{SV},(R \, \|w\|)^2 \right) \right]$$

where $n_{SV}$ is the number of support vectors of the SVM. This results seems to be found again in (Burges, 1998), equations (86) and (93) respectively. But again, Burges seems to differ from Vapnik as he separates the components within the minimum function above in different theorems, with different conditions.

Another result appearing in (Vapnik, 2000), p.133, is the following. Assuming again that, for all $i$, $\|x_i\|^2 \leq R^2$ and letting $h \equiv VC$ and $\epsilon \in [0,1]$, we define $\zeta$ to be equal to:

$$ \zeta = 4 \frac{h\left( \text{ln}\frac{2n}{h} + 1\right) – \text{ln}\frac{\epsilon}{4}}{n}$$

We also define $n_{\text{error}}$ to be the number of misclassified training examples by the SVM. Then with probability $1-\epsilon$ we can assert that the probability that a test example will not be separated correctly by the $m^*$-margin hyperplane $-$ i.e. SVM with margin $m^*$ $-$ has the bound:

$$ P_{\text{error}} \leq \frac{n_{\text{error}}}{n} + \frac{\zeta}{2} \left( 1 + \sqrt{1+ \frac{4 \, n_{\text{error}}}{n \, \zeta}} \right)$$

However, in (Hastie, Tibshirani and Friedman, 2009), p.438, a very similar result is found:

$$ \text{Error}_{\text{Test}} \leq \zeta $$

Conclusion:

It seems to me that there is a certain degree of conflict between these results. On the other hand, two of these references, although canonical in the SVM literature, start to be slightly old (1998 and 2000), especially if we consider that research into the SVM algorithm started in the mid-nineties.

My questions are:

  • Are these results still valid today, or have they been proved wrong?
  • Have tighter bounds with relatively loose conditions been derived since then? If so, by whom and where can I found them?
  • Finally, is there any reference material which synthetises the main generalization results about the SVM?

References:

Burges, J. C. (1998). "A Tutorial on Support Vector Machines for Pattern Recognition", Data Mining and Knowledge Discovery, 2: 121-167

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning, 2nd edition, Springer

Vapnik, V. N. (1998). Statistical Learning Theory, 1st edition, John Wiley & Sons

Vapnik, V. N. (1999). "An Overview of Statistical Learning Theory", IEEE Transactions on Neural Networks, 10(5): 988-999

Vapnik, V. N. (2000). The Nature of Statistical Learning Theory, 2nd edition, Springer

Best Answer

I do not know the literature you referring to in detail, but I think a comprehensive summary of generalization bounds that should be up to date can be found in Boucheron et al. (2004) (Link: https://www.researchgate.net/profile/Olivier_Bousquet/publication/238718428_Advanced_Lectures_on_Machine_Learning_ML_Summer_Schools_2003_Canberra_Australia_February_2-14_2003_Tubingen_Germany_August_4-16_2003_Revised_Lectures/links/02e7e52c5870850311000000/Advanced-Lectures-on-Machine-Learning-ML-Summer-Schools-2003-Canberra-Australia-February-2-14-2003-Tuebingen-Germany-August-4-16-2003-Revised-Lectures.pdf#page=176)

I will sketch part of the SVM bound in the following, leaving out details and proves.

Before elaborating specifically about SVM bound, we need to understand what the generalization bounds are trying to achieve.

First let us assume that the true probability $P(Y = +1| X = x)$ is known then the best possible classifier would be the bayes classifier, i.e. \begin{align} g* = \begin{cases} + 1 \ \ if P(Y = 1| X = x) > 0.5 \\ -1 \ \ otherwise \end{cases} \end{align}

The goal of statistical learning theory now is to find the difference between a classifier of class $C$ (e.g. SVM) \begin{align} \hat{g}_n = arg \min_{g \in C} L_n(g) \end{align} and the bayes classifier, i.e. \begin{align} L(\hat{g}_n) - L(g*) = L(\hat{g}_n) - L(g^{*}_c) + L(g^{*}_c) - L(g*). \end{align} Note that $L(g) = \mathbb{E}l(g(X),Y)$ is the expected loss given data and $g^{*}_c$ is the best possible classifier in the model class $C$. The term $Z =: L(g*) - L(\hat{g}_n)$ is called estimation error and often the focus because it can be bounded much easier than the approximation error (the other term). I will also omit the approximation error here.

The estimation error $Z$ can be further decomposed with \begin{align} Z = Z - \mathbb{E}Z + \mathbb{E}Z. \end{align} Now this can be bounded by two steps:

  1. Bound $Z - \mathbb{E}Z$ using McDiarmid inequality

  2. Bound $\mathbb{E}Z$ with the Rademacher complexity $R_n(C) = \mathbb{E}sup_{g \in C}|1/n \sum_{i=1}^{n} l(g(X_i),Y_i)|$

Using McDiarmids inequality one can show that if the loss function is ranging in an interval no more than $B$, step one result in a bound of \begin{align} Z - \mathbb{E}Z \leq 2 B \sqrt{\dfrac{ln(1/\delta)}{2n}}, \end{align} where $\delta$ is the confidence level. For the second step we can show that \begin{align} \mathbb{E}Z \leq 2R_n(C), \end{align} If you have a discrete loss-function, i.e. non-Lipschitz such as the 0-1-loss, you would need the VC-Dimension for further bounding the Rademacher Complexity. However, for L-lipschitz functions such as the Hinge-loss this can be further bounded by \begin{align} R_n(C) \leq \lambda L R/\sqrt{n}, \end{align}
where $\lambda$ denotes the regularizer. Since for the Hinge-Loss $L = 1$ and $B = 1 + \lambda R$ (prove with Gauchy-Schwartz inequality) this further simplifies. Finally putting all results together, we can a bound of \begin{align} L(\hat{g}_n) - L(g^{*}_c) \leq 2(1 + \lambda R) \sqrt{\dfrac{ln(1/\delta)}{2n}} + 4 \lambda L R/\sqrt{n} \end{align}