The proportion of support vectors is an upper bound on the leave-one-out cross-validation error (as the decision boundary is unaffected if you leave out a non-support vector), and thus provides an indication of the generalisation performance of the classifier. However, the bound isn't necessarily very tight (or even usefully tight), so you can have a model with lots of support vectors, but a low leave-one-out error (which appears to be the case here). There are tighter (approximate) bounds, such as the Span bound, which are more useful.
This commonly happens if you tune the hyper-parameters to optimise the CV error, you get a bland kernel and a small value of C (so the margin violations are not penalised very much), in which case the margin becomes very wide and there are lots of support vectors. Essentially both the kernel and regularisation parameters control capacity, and you can get a diagonal trough in the CV error as a function of the hyper-parameters because their effects are correlated and different combinations of kernel parameter and regularisation provide similarly good models.
It is worth noting that as soon as you tune the hyper-parameters, e.g. via CV, the SVM no longer implements a structural risk minimisation approach as we are just tuning the hyper-parameters directly on the data with no capacity control on the hyper-parameters. Essentially the performance estimates or bounds are biased or invalidated by their direct optimisation.
My advice would be to no worry about it and just be guided by the CV error (but remember that if you use CV to tune the model, you need to use nested CV to evaluate its performance). The sparsity of the SVM is a bonus, but I have found it doesn't generate sufficient sparsity to be really worthwhile (L1 regularisation provides greater sparsity). For small problems (e.g. 400 patterns) I use the LS-SVM, which is fully dense and generally performs similarly well.
One class SVM are hard!!
1) First, in general, one class SVM is a unsupervised learning technique. So there is no correct answer, like there is no correct answer for the number of clusters in k-means. Like for k-means, there may be some metric that evaluates the quality of the solution but they are all heuristic and therefore there are many of them (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation lists some of the cluster metrics implemented in sklearn). But unfortunately I do not know any quality metrics for 1-class. Hopefully someone in CV can answer that.
Nor can you use cross validation to select the hyperparameter because again you have no correct solution to measure some form of accuracy.
So unfortunately you have to set some possible values of the nu hyperparmeter and verify if the solution "makes sense" to your problem. As far as I know there is no simpler heuristics to solve this problem.
Of course, if you have the correct class for the data, that is, you know which data is normal and which is not-normal, them you can use cross validation to select the hyperparameters. In this case, your problem is really a classification problem, and you are using a 1-class as a classifier. Almost always this is not a good idea - if you have a classification problem use a classification algorithm not an unsupervised algorithm!!
2) I do not think a Linear Kernel 1-class makes a lot of sense. What 1-cass does is to find at most nu of your data to be considered as non-normal - let us call it the positive class, and the rest will be the negative class. And solve the usual SVM optimization to find the location of the separating hyperplane. In the linear kernel, the hyperplane will be a plane - that is an odd sentence! So what you get at the end is that half of the space will be called negative (the normal) and the other half positive.
Usually what one wants is to create some curved closed surface that contains the negative, and all of the "outside" will be positive. The normal data is contained in the curved closed surface (or surfaces). This can only be accomplished with a non-linear kernel - certainly the RBF kernel will do it (I dont know about the polynomial kernel).
But using a RBF kernel increases your problem because now you have two hyperparameters nu and gamma and no way to select them.
3) There is a technique called SVDD (support vector data description) that tries to create hypersheres around your negative data (which can be deformed by using kernles) but at least on the linear case, you only have one hyperparameter - the nu - but your negative space will be contained inside the sphere, and not be a half -space as in the case of the 1-class linear SVM. I have never used SVDD - I no experience in it.
And that is why 1-class SVM are hard. Sorry not much help there.
Best Answer
Hint: Start with the dual SVM objective with $\ell_2$ regularization
$$\text{maximize}\quad J(\mathbf{\Lambda}) = \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \lambda_i \lambda_j y_i y_j \mathbf{K}(\mathbf{x}_i, \mathbf{x}_j)$$ $$\text{subject to}\quad \lambda_i \geq 0; \; \forall \; i.$$
Note that we take the hard-margin SVM, because of the linear separability assumption.
Denote the objective for the SVM trained without point $\mathbf{x}_k$ as $J_{(-k)}(\mathbf{\Lambda}_{(-k)})$, or
$$J_{(-k)}(\mathbf{\Lambda}_{(-k)}) = \sum_{i \neq k}^n \alpha_i - \frac{1}{2} \sum_{i \neq k}^n \sum_{j \neq k}^n \lambda_i \lambda_j y_i y_j \mathbf{K}(\mathbf{x}_i, \mathbf{x}_j).$$
In terms of these variables, an equivalent statement of your theorem is that your leave-one-out cross validation error is bounded by
$$\frac{1}{n} \sum_{k=1}^n \mathbf{1}\left[y_k \sum_{i\neq k}\lambda^k_i y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k) < 0\right]$$
where $\lambda^k_i$ is the $i$th element of $\mathbf{\Lambda}^k$, the result of training without point $\mathbf{x}_k$. This is because $|SV|$ is the number of nonzero $\lambda^*_i$, via complementary slackness.
We can prove this statement by analyzing the effect of "adding back" an arbitrary missing point in leave-one-out CV. To do this, show the following.
a) Let $\lambda_k^*$ be the optimal dual variable corresponding to the left-out $\mathbf{x}_k$. Show that when $\mathbf{x}_k$ is "added back" into the dataset, for fixed $\lambda^*_k$, the resulting objective to optimize, i.e. $\mathbf{J}(\mathbf{\Lambda}_{(-k)})$, is
$$J_{(-k)}(\mathbf{\Lambda}_{(-k)}) - \lambda^*_k y_k \sum_{i \neq k} \lambda_i y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k).$$
b) Let $\mathbf{\Lambda}^k_{(-k)}$ be the unique maximizer of $J_{(-k)}(\mathbf{\Lambda}_{(-k)})$, and $\mathbf{\Lambda}^*_{(-k)}$ be the unique maximizer of $J(\mathbf{\Lambda}_{(-k)})$. Use the statement $J(\mathbf{\Lambda}^*_{(-k)}) \geq J(\mathbf{\Lambda}_{(-k)})$ to conclude that
$$y_k \sum_{i\neq k} \lambda_i^k y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k) \geq y_k \sum_{i\neq k} \lambda_i^* y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k)$$
where $\lambda_i^*, \lambda_i^k$ are simply specific elements of $\mathbf{\Lambda}^*_{(-k)}, \mathbf{\Lambda}_{(-k)}$, respectively.
c) Does your reasoning in part b) still make sense when $\lambda^*_k = 0$? Reason about why the bound still holds in that case (or simply justify why $\lambda^*_k = 0$ doesn't affect part b), if that's what you find).
Detail to consider: what assumptions do I need to make to ensure these maximizers are unique?
d) [Edit: reworded this part] Use the bound in part b) to show that
$$\frac{1}{n} \sum_{k=1}^n \mathbf{1}\left[y_k \sum_{i\neq k}\lambda^k_i y_i \mathbf{K}(\mathbf{x}_i, \mathbf{x}_k) < 0\right] = \frac{1}{n} \sum_{k=1}^n \mathbf{1}\left[\lambda^*_k < 0\right].$$
Use the fact that support vectors correspond to nonzero $\lambda_k^*$ to conclude.