Solved – a main difference between RBF neural networks and SVM with a RBF kernel

neural networkssvm

What is a main difference between RBF neural networks and SVM with a RBF kernel, from a practical point of view?

Best Answer

A RBF SVM would be virtually equivalent to a RBF neural nets where the weights of the first layer would be fixed to the feature values of all the training samples. Only the second layer weights are tuned by the learning algorithm. This allows the optimization problem to be convex hence admit a single global solution. The fact that the number of the potential hidden nodes can grow with the number of samples makes this hypothetical neural network a non-parametric model (which is usually not the case when we train neural nets: we tend to fix the architecture in advance as a fixed hyper-parameter of the algorithm independently of the number of samples).

Off-course in practice the SVM algorithm implementations do not treat all the samples from the training set as concurrently active support vectors / hidden nodes. Samples are incrementally selected in the active set if they contribute enough and pruned as soon as their are shadowed by a more recent set of support vectors (thanks to the combined use of a margin loss function such as the hinge loss and regularizer such as l2). That allows the number of parameters of the model to be kept low enough.

On the other hand when classical RBF neural networks are trained with a fixed architecture (fixed number of hidden nodes) but with tunable input layer parameters. If we allow the neural network to have as many hidden nodes as samples, then the expressive power such a RBF NN would be much higher than the SVM model as the weights of the first layer are tunable but that comes at the price of a non convex objective function that can be stuck in local optima that would prevent the algorithm to converge to good parameter values. Furthermore this increased expressive power comes with a serious capability to overfit: reducing the number of hidden nodes can help decrease overfitting.

To summarize from a practical point of view:

  • RBF neural nets have a higher number of hyper-parameters (the bandwidth of the RBF kernel, number of hidden nodes + the initialization scheme of the weights + the strengths of the regularizer a.k.a weight decay for the first and second layers + learning rate + momentum) + the local optima convergence issues (that may or not be an issue in practice depending on the data and the hyper-parameter)

  • RBF SVM has 2 hyper-parameters to grid search (the bandwidth of the RBF kernel and the strength of the regularizer) and the convergence is independent from the init (convex objective function)

BTW, both should have scaled features as input (e.g. unit variance scaling).

Related Question