Machine-Learning – Proof of Upper Bound on Leave-One-Out Cross-Validation Error of Linear SVM

cross-validationmachine learningsvm

I have to show that for a two-class SVM classifier trained on a linearly separable dataset $(x_i, y_i)_{i =1}^n$ the following upper bound on the leave-one-out-cross-validation error holds true:

$$
L_1OCV = \frac{1}{n} \sum_{i = 1}^n [y_i \ne f_i(x_i)] \le \frac{|SV|}{n},
$$

where for all $i = 1, \dots, n$ $f_i(x_i)$ is the SVM classifier fitted on the entire data without the observation $(x_i, y_i)$ and $|SV|$ is the number of support vectors of the SVM classifier fitted on the entire data.

I have no idea about how to start. Has someone an idea?

Best Answer

Hint: Start with the dual SVM objective with $\ell_2$ regularization

$$\text{maximize}\quad J(\mathbf{\Lambda}) = \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \lambda_i \lambda_j y_i y_j \mathbf{K}(\mathbf{x}_i, \mathbf{x}_j)$$ $$\text{subject to}\quad \lambda_i \geq 0; \; \forall \; i.$$

Note that we take the hard-margin SVM, because of the linear separability assumption.

Denote the objective for the SVM trained without point $\mathbf{x}_k$ as $J_{(-k)}(\mathbf{\Lambda}_{(-k)})$, or

$$J_{(-k)}(\mathbf{\Lambda}_{(-k)}) = \sum_{i \neq k}^n \alpha_i - \frac{1}{2} \sum_{i \neq k}^n \sum_{j \neq k}^n \lambda_i \lambda_j y_i y_j \mathbf{K}(\mathbf{x}_i, \mathbf{x}_j).$$

In terms of these variables, an equivalent statement of your theorem is that your leave-one-out cross validation error is bounded by

$$\frac{1}{n} \sum_{k=1}^n \mathbf{1}\left[y_k \sum_{i\neq k}\lambda^k_i y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k) < 0\right]$$

where $\lambda^k_i$ is the $i$th element of $\mathbf{\Lambda}^k$, the result of training without point $\mathbf{x}_k$. This is because $|SV|$ is the number of nonzero $\lambda^*_i$, via complementary slackness.

We can prove this statement by analyzing the effect of "adding back" an arbitrary missing point in leave-one-out CV. To do this, show the following.

a) Let $\lambda_k^*$ be the optimal dual variable corresponding to the left-out $\mathbf{x}_k$. Show that when $\mathbf{x}_k$ is "added back" into the dataset, for fixed $\lambda^*_k$, the resulting objective to optimize, i.e. $\mathbf{J}(\mathbf{\Lambda}_{(-k)})$, is

$$J_{(-k)}(\mathbf{\Lambda}_{(-k)}) - \lambda^*_k y_k \sum_{i \neq k} \lambda_i y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k).$$

b) Let $\mathbf{\Lambda}^k_{(-k)}$ be the unique maximizer of $J_{(-k)}(\mathbf{\Lambda}_{(-k)})$, and $\mathbf{\Lambda}^*_{(-k)}$ be the unique maximizer of $J(\mathbf{\Lambda}_{(-k)})$. Use the statement $J(\mathbf{\Lambda}^*_{(-k)}) \geq J(\mathbf{\Lambda}_{(-k)})$ to conclude that

$$y_k \sum_{i\neq k} \lambda_i^k y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k) \geq y_k \sum_{i\neq k} \lambda_i^* y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k)$$

where $\lambda_i^*, \lambda_i^k$ are simply specific elements of $\mathbf{\Lambda}^*_{(-k)}, \mathbf{\Lambda}_{(-k)}$, respectively.

c) Does your reasoning in part b) still make sense when $\lambda^*_k = 0$? Reason about why the bound still holds in that case (or simply justify why $\lambda^*_k = 0$ doesn't affect part b), if that's what you find).

Detail to consider: what assumptions do I need to make to ensure these maximizers are unique?

d) [Edit: reworded this part] Use the bound in part b) to show that

$$\frac{1}{n} \sum_{k=1}^n \mathbf{1}\left[y_k \sum_{i\neq k}\lambda^k_i y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k) < 0\right] = \frac{1}{n} \sum_{k=1}^n \mathbf{1}\left[\lambda^*_k < 0\right].$$

Use the fact that support vectors correspond to nonzero $\lambda_k^*$ to conclude.

Related Question