Solved – SVM: Does C increase variance or stability (bias)

regularizationsvm

I was learning about SVM using 2 sources: Andrew Ng Machine Learning course from Coursera and Stanford 'Statistical Learning' (from Trevor Hastie and Robert Tibshirani). And I encountered the following contradiction:

Andrew Ng provides this formula for SVM cost function:
$$C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1 – y^{(i)})cost_0(\theta^Tx^{(i)})] + \frac{1}{2}\sum_{i=1}^n\theta_j^2$$
(He doesn't dive into details about $cost_1$ and $cost_0$, but says they are more or less like log-likelihood loss in logistic regression, but linear.)

Then he says that increasing $C$ leads to increased variance – and it is completely okay with my intuition from the aforementioned formula – for higher $C$ algorithm cares less about regularization, so it fits training data better. That implies higher bias, lower variance, worse stability.

But then Trevor Hastie and Robert Tibshirani say, quote:

And that means that $C$ gets bigger the more stable the margin becomes.

Also they show the following picture – the larger margins correspond to larger $C$, they say.
C regularization

Stability implies higher bias, right? So they say the opposite to Andrew Ng.

So the question is who is right?

Best Answer

The effect of the SVM C-Parameter

While the first textbook description of an SVM always speaks of "maximizing the margin", but this is only the first step. If your data is not perfectly separable there will points on the wrong side of the separating hyperplane. To allow for such points slack variables were introduced (= soft-margin SVM). They include the problematic points into the equation and weight them using the C-Parameter. This parameter is a tradeoff between maximizing the margin and minimizing the error.

Why this?

Imagine (or draw on a paper) a perfectly separable 2D dataset with a plot similar to the above. Imagine a suitable hyperplane. Image you have a hard margin svm which does not allow for such misclassified points. Now imagine you will break the rules and place a document intentionally on the other side of the hyperplane. The hyperplane will probably change a lot and will be worse than before. If you had used a soft-margin SVM instead the old solution would still be a better one.

Your example

Increasing the value of the C-Parameter
$\iff$ Weight of misclassified points is increased
$\iff$ Margin gets smaller

And i think that is what Hastie and Tibshirani meant in terms of stable: In other words closer to the hard-margin SVM.