Solved – Does the order of features affect the solution model classifier accuracy in an SVM using RBF kernel

classificationfeature selectionfeature-engineeringsvm

Is it possible that the order of features will change the accuracy of an SVM RBF kernel based classifier? I'm also interested in whether this would affect other ML classification algorithms, if applicable?

For example, if you have 5 features, and you order the columns differently:

1 2 3 4 5

5 4 3 2 1

1 5 2 4 3

Will all of these always produce the same result?

Best Answer

In general, it should not make a difference. For many methods (Naive Bayes, Decision Trees, Regression) this is not a factor. For SVM, it may depend on the type of SVM and the method used to solve it - if the algorithm used is approximate, or not run to convergence, or involves randomness that may lead to somewhat different results.

Related Solutions

Solved – Non-linear SVM classification with RBF kernel

Let $\mathcal{X}$ represent your input space i.e the space where your data points resides. Consider a function $\Phi:\mathcal{X} \rightarrow \mathcal{F}$ such that it takes a point from your input space $\mathcal{X}$ and maps it to a point in $\mathcal{F}$. Now, let us say that we have mapped all your data points from $\mathcal{X}$ to this new space $\mathcal{F}$. Now, if you try to solve the normal linear svm in this new space $\mathcal{F}$ instead of $\mathcal{X}$, you will notice that all the earlier working simply look the same, except that all the points $x_i$ are represented as $\Phi(x_i)$ and instead of using $x^Ty$ (dot product) which is the natural inner product for Euclidean space, we replace it with $\langle \Phi(x), \Phi(y) \rangle$ which represents the natural inner product in the new space $\mathcal{F}$. So, at the end, your $w^*$ would look like,

$$ w^*=\sum_{i \in SV} h_i y_i \Phi(x_i) $$

and hence, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i \langle \Phi(x_i), \Phi(x) \rangle $$

Similarly, $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j \langle \Phi(x_j), \Phi(x_i)\rangle\right)\right) $$

and your classification rule looks like: $c_x=\text{sign}(\langle w, \Phi(x) \rangle+b)$.

So far so good, there is nothing new, since we have simply applied the normal linear SVM to just a different space. However, the magic part is this -

Let us say that there exists a function $k:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ such that $k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle$. Then, we can replace all the dot products above with $k(x_i, x_j)$. Such a $k$ is called a kernel function.

Therefore, your $w^*$ and $b^*$ look like, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i k(x_i, x) $$ $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j k(x_j, x_i)\right)\right) $$

For which kernel functions is the above substitution valid? Well, that's a slightly involved question and you might want to take up proper reading material to understand those implications. However, I will just add that the above holds true for RBF Kernel.

To answer your question, "Is the situation so that all the support vectors are needed for the classification?" Yes. As you may notice above, we compute the inner product of $w$ with $x$ instead of computing $w$ explicitly. This requires us to retain all the support vectors for classification.

Note: The $h_i$'s in the final section here are solution to dual of the SVM in the space $\mathcal{F}$ and not $\mathcal{X}$. Does that mean that we need to know $\Phi$ function explicitly? Luckily, no. If you look at the dual objective, it consists only of inner product and since we have $k$ that allows us to compute the inner product directly, we don't need to know $\Phi$ explicitly. The dual objective simply looks like, $$ \max \sum_i h_i - \sum_{i,j} y_i y_j h_i h_j k(x_i, x_j) \\ \text{subject to : } \sum_i y_i h_i = 0, h_i \geq 0 $$

Solved – Applying an RBF kernel first and then train using a Linear Classifier

As the previous answers say, RBF kernels embed data points into an infinite-dimensional space. But it turns out you can approximate that in a finite-dimensional space, as proposed by the paper Random Features for Large-Scale Kernel Machines by Rahimi and Recht, NIPS 2007. The method is also sometimes called "random kitchen sinks."

The gist of the method is: if you want to use an RBF kernel $k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)$, then you can get a feature map $z : \mathbb R^d \to \mathbb R^D$ such that $k(x, y) \approx z(x)^T z(y)$ by:

Sample $D/2$ $d$-dimensional weight vectors $\omega_i \sim \mathcal{N}(0, \frac{1}{\sigma^2} I)$.
Define $z(x) = \begin{bmatrix} \sin(\omega_1^T x) & \cos(\omega_1^T x) & \cdots & \sin(\omega_{D/2}^T x) & \cos(\omega_{D/2}^T x) \end{bmatrix}^T$.

Then you can train a linear SVM on these features, which will approximate the RBF-kernel SVM on the original features.

(Note that the linked version of the paper doesn't discuss this particular version, but rather one that seems like it might be better but my (very) recent paper argues is worse.)

This is implemented in scikit-learn, shogun, and JSAT.

There's also a method called Fastfood (Le, Sarlós, and Smola, Fastfood – Approximating Kernel Expansions in Loglinear Time, ICML 13) that speeds up the method for large $d$ and decreases storage requirements. Good implementations are more complicated, though. Here's one for scikit-learn that's okay, but I might work on making more parallelized soon; there's also a matlab one, and a Shark/C++ one that I haven't tested.

Best Answer

Related Solutions

Solved – Non-linear SVM classification with RBF kernel

Solved – Applying an RBF kernel first and then train using a Linear Classifier

Related Question