What is an SVM, anyway?
I think the answer for most purposes is “the solution to the following optimization problem”:
$$
\begin{split}
\operatorname*{arg\,min}_{f \in \mathcal H} \frac{1}{n} \sum_{i=1}^n \ell_\mathit{hinge}(f(x_i), y_i) \, + \lambda \lVert f \rVert_{\mathcal H}^2
\\ \ell_\mathit{hinge}(t, y) = \max(0, 1 - t y)
,\end{split}
\tag{SVM}
$$
where $\mathcal H$ is a reproducing kernel Hilbert space, $y$ is a label in $\{-1, 1\}$, and $t = f(x) \in \mathbb R$ is a “decision value”; our final prediction will be $\operatorname{sign}(t)$. In the simplest case, $\mathcal H$ could be the space of affine functions $f(x) = w \cdot x + b$, and $\lVert f \rVert_{\mathcal H}^2 = \lVert w \rVert^2 + b^2$. (Handling of the offset $b$ varies depending on exactly what you’re doing, but that’s not important for our purposes.)
In the ‘90s through the early ‘10s, there was a lot of work on solving this particular optimization problem in various smart ways, and indeed that’s what LIBSVM / LIBLINEAR / SVMlight / ThunderSVM / ... do. But I don’t think that any of these particular algorithms are fundamental to “being an SVM,” really.
Now, how do we train a deep network? Well, we try to solve something like, say,
$$
\begin{split}
\operatorname*{arg\,min}_{f \in \mathcal F} \frac1n \sum_{i=1}^n \ell_\mathit{CE}(f(x_i), y) + R(f)
\\
\ell_\mathit{CE}(p, y) = - y \log(p) - (1-y) \log(1 - p)
,\end{split}
\tag{$\star$}
$$
where now $\mathcal F$ is the set of deep nets we consider, which output probabilities $p = f(x) \in [0, 1]$. The explicit regularizer $R(f)$ might be an L2 penalty on the weights in the network, or we might just use $R(f) = 0$. Although we could solve (SVM) up to machine precision if we really wanted, we usually can’t do that for $(\star)$ when $\mathcal F$ is more than one layer; instead we use stochastic gradient descent to attempt at an approximate solution.
If we take $\mathcal F$ as a reproducing kernel Hilbert space and $R(f) = \lambda \lVert f \rVert_{\mathcal F}^2$, then $(\star)$ becomes very similar to (SVM), just with cross-entropy loss instead of hinge loss: this is also called kernel logistic regression. My understanding is that the reason SVMs took off in a way kernel logistic regression didn’t is largely due to a slight computational advantage of the former (more amenable to these fancy algorithms), and/or historical accident; there isn’t really a huge difference between the two as a whole, as far as I know. (There is sometimes a big difference between an SVM with a fancy kernel and a plain linear logistic regression, but that’s comparing apples to oranges.)
So, what does a deep network using an SVM to classify look like? Well, that could mean some other things, but I think the most natural interpretation is just using $\ell_\mathit{hinge}$ in $(\star)$.
One minor issue is that $\ell_\mathit{hinge}$ isn’t differentiable at $\hat y = y$; we could instead use $\ell_\mathit{hinge}^2$, if we want. (Doing this in (SVM) is sometimes called “L2-SVM” or similar names.) Or we can just ignore the non-differentiability; the ReLU activation isn’t differentiable at 0 either, and this usually doesn’t matter. This can be justified via subgradients, although note that the correctness here is actually quite subtle when dealing with deep networks.
An ICML workshop paper – Tang, Deep Learning using Linear Support Vector Machines, ICML 2013 workshop Challenges in Representation Learning – found using $\ell_\mathit{hinge}^2$ gave small but consistent improvements over $\ell_\mathit{CE}$ on the problems they considered. I’m sure others have tried (squared) hinge loss since in deep networks, but it certainly hasn’t taken off widely.
(You have to modify both $\ell_\mathit{CE}$ as I’ve written it and $\ell_\mathit{hinge}$ to support multi-class classification, but in the one-vs-rest scheme used by Tang, both are easy to do.)
Another thing that’s sometimes done is to train CNNs in the typical way, but then take the output of a late layer as "features" and train a separate SVM on that. This was common in early days of transfer learning with deep features, but is I think less common now.
Something like this is also done sometimes in other contexts, e.g. in meta-learning by Lee et al., Meta-Learning with Differentiable Convex Optimization, CVPR 2019, who actually solved (SVM) on deep network features and backpropped through the whole thing. (They didn't, but you can even do this with a nonlinear kernel in $\mathcal H$; this is also done in some other "deep kernels" contexts.) It’s a very cool approach – one that I've also worked on – and in certain domains this type of approach makes a ton of sense, but there are some pitfalls, and I don’t think it’s very applicable to a typical "plain classification" problem.
This page shows a nice graph https://github.com/potterhsu/SVHNClassifier ,and there's a tensorflow model here https://github.com/potterhsu/SVHNClassifier/blob/master/model.py
So basically, as far as the model structure:
- there's just one stack of convolutional layers, which feed into all of the softmax layers
- each 'softmax' layer comprises a fully-connected layer (
dense
in tensorflow parlance), followed by a softmax
As far as backprop, backprop will run down the entire stack, for each digit. But, there are 5 available digit outputs right? eg, it could output 31256
. But lets say the target number is 432
, what should we do with the two additional digit classifiers? And the answer is: no backprop happens through those two additional digit classifiers, for which there is no target in this case.
And what will happen is that L
for such cases will be, well in this case it will be 3, so the prediction output from the network will simply ignore the output of the two additional digit classifiers.
But otherwise, backprop is just standard backprop, through all layers.
As far as how to backprop only through some numbers in practice, a couple of approaches:
- get the output from your network, and feed back the exact same numbers as the target for the numbers that arent being used: that way there'll be no gradient for those, or
- use the value of
L
to modify the loss function something like, conceptually:
loss = digit_one_loss * (L >= 1) + digit_two_loss * (L >= 2) ...
You'll need to figure out a way to change (L >= 2)
and so on into numbers having a value of 0 or 1.
Best Answer
It's not so simple. First of all, a SVM is, in a way, a type of neural network (you can learn a SVM solution through backpropagation). See What *is* an Artificial Neural Network?. Second, you can't know beforehand which model will work better, but the thing is with a fully neuromorphic architecture you can learn the weights end-to-end, while attaching a SVM or RF to the last hidden layer activation of a CNN is simply an ad hoc procedure. It may perform better, and it may not, we can't know without testing.
The important part is that a fully convolutional architecture is capable of representation learning, which is useful for a myriad of reasons. For once, it may reduce or eliminate feature engineering altogether in your problem.
About the FC layers, they are mathematically equivalent to 1x1 Convolutional layers. See Yann Lecun's post, which I transcript below: