Given the data points $x_1, \ldots, x_n \in \mathbb{R}^d$ and labels $y_1, \ldots, y_n \in \left \{-1, 1 \right\}$, the hard margin SVM primal problem is
$$ \text{minimize}_{w, w_0} \quad \frac{1}{2} w^T w $$
$$ \text{s.t.} \quad \forall i: y_i (w^T x_i + w_0) \ge 1$$
which is a quadratic program with $d+1$ variables to be optimized for and $i$ constraints. The dual
$$ \text{maximize}_{\alpha} \quad \sum_{i=1}^{n}{\alpha_i} – \frac{1}{2}\sum_{i=1}^{n}{\sum_{j=1}^{n}{y_i y_j \alpha_i \alpha_j x_i^T x_j}}$$
$$ \text{s.t.} \quad \forall i: \alpha_i \ge 0 \land \sum_{i=1}^{n}{y_i \alpha_i} = 0$$
is a quadratic program with $n + 1$ variables to be optimized for and $n$ inequality and $n$ equality constraints.
When implementing a hard margin SVM, why would I solve the dual problem instead of the primal problem? The primal problem looks more 'intuitive' to me, and I don't need to concern myself with the duality gap, the Kuhn-Tucker condition etc.
It would make sense to me to solve the dual problem if $d \gg n$, but I suspect there are better reasons. Is this the case?
Best Answer
Based on the lecture notes referenced in @user765195's answer (thanks!), the most apparent reasons seem to be:
Solving the primal problem, we obtain the optimal $w$, but know nothing about the $\alpha_i$. In order to classify a query point $x$ we need to explicitly compute the scalar product $w^Tx$, which may be expensive if $d$ is large.
Solving the dual problem, we obtain the $\alpha_i$ (where $\alpha_i = 0$ for all but a few points - the support vectors). In order to classify a query point $x$, we calculate
$$ w^Tx + w_0 = \left(\sum_{i=1}^{n}{\alpha_i y_i x_i} \right)^T x + w_0 = \sum_{i=1}^{n}{\alpha_i y_i \langle x_i, x \rangle} + w_0 $$
This term is very efficiently calculated if there are only few support vectors. Further, since we now have a scalar product only involving data vectors, we may apply the kernel trick.