Given the data points $x_1, \ldots, x_n \in \mathbb{R}^d$ and labels $y_1, \ldots, y_n \in \left \{-1, 1 \right\}$, the hard margin SVM primal problem is

$$ \text{minimize}_{w, w_0} \quad \frac{1}{2} w^T w $$

$$ \text{s.t.} \quad \forall i: y_i (w^T x_i + w_0) \ge 1$$

which is a quadratic program with $d+1$ variables to be optimized for and $i$ constraints. The dual

$$ \text{maximize}_{\alpha} \quad \sum_{i=1}^{n}{\alpha_i} – \frac{1}{2}\sum_{i=1}^{n}{\sum_{j=1}^{n}{y_i y_j \alpha_i \alpha_j x_i^T x_j}}$$

$$ \text{s.t.} \quad \forall i: \alpha_i \ge 0 \land \sum_{i=1}^{n}{y_i \alpha_i} = 0$$

is a quadratic program with $n + 1$ variables to be optimized for and $n$ inequality and $n$ equality constraints.

When implementing a hard margin SVM, **why would I solve the dual problem instead of the primal problem?** The primal problem looks more 'intuitive' to me, and I don't need to concern myself with the duality gap, the Kuhn-Tucker condition etc.

It would make sense to me to solve the dual problem if $d \gg n$, but I suspect there are better reasons. Is this the case?

## Best Answer

Based on the lecture notes referenced in @user765195's answer (thanks!), the most apparent reasons seem to be:

Solving the primal problem, we obtain the optimal $w$, but

know nothing about the $\alpha_i$. In order to classify a query point $x$ we need to explicitly compute the scalar product $w^Tx$, which may beexpensiveif $d$ is large.Solving the dual problem, we obtain the $\alpha_i$ (where $\alpha_i = 0$ for all but a few points - the support vectors). In order to classify a query point $x$, we calculate

$$ w^Tx + w_0 = \left(\sum_{i=1}^{n}{\alpha_i y_i x_i} \right)^T x + w_0 = \sum_{i=1}^{n}{\alpha_i y_i \langle x_i, x \rangle} + w_0 $$

This term is very

efficiently calculatedif there are only few support vectors. Further, since we now have a scalar product only involvingdatavectors, we mayapply the kernel trick.