Solved – Using SVD on features before SVM classification, when p >> N

dimensionality reductionhigh-dimensionalsvdsvm

So I am going through Hastie's Elements of Statistical computing, and in section 18.3.5 which deals with computational shortcuts when the number of dimensions $p$ is much larger that the number of samples $N$ the authors mention:

When $p > N$ , the computations can be carried out in an $N$-dimensional
space, rather than $p$, via the singular value decomposition […]

Here is the geometric intuition: just like
two points in three- dimensional space always lie on a line, $N$ points
in $p$-dimensional space lie in an $(N − 1)$-dimensional affine subspace.

That does indeed make a lot of sense. They proceed to mention that:

This result can be applied to many of the learning methods discussed
in this chapter, such as regularized logistic regression, linear
discriminant analysis, and support vector machines.

My question is: Can this result be applied when using the RBF or some other non-linear kernel for the SVM?

If yes, then does it make any sense to perform SVM classification on the full-feature dataset, or should one always use the SVD transform of the data when $p \gg N$?

Best Answer

My question is: Can this result be applied when using the RBF or some other non-linear kernel for the SVM?

There's no gain. The result lies in the fact that instead of a coefficient vector $\beta_{P\times1}$ you have instead to estimate a $\beta_{N\times1}$, which is smaller.

Consider the linear Kernel/Gram matrix decomposition, and notice $V$ is orthonormal:

$$K = XX^T=UDV^TVDU^T = UD^2U^T=(UD)(DU^T)= RR^T$$

As shown, applying kernels to the SVM you are already estimating $\beta_{N\times1}$ coefficients. So the solutions are the same, and there's no gain whatsoever. Other kernels are functions of the same dot product, reproducing this result.

Basically, any linear Kernel model already has this trick imbued into it. The trick only results in gains when Kernels are not explicitly used.