The decision value (distance to $\mathbf{w}$) for a test instance $\mathbf{z}$ is:
$$f(\mathbf{z}) = \sum_{i\in SV} \alpha_i y_i \kappa(\mathbf{x}_i, \mathbf{z}) - \rho$$
In LIBSVM models, svm_coef
contains $\alpha \cdot y$, the RBF kernel function is defined as $\kappa(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i-\mathbf{x}_j\|^2)$. The support vectors are stored in SV
.
So what you need to do is compute the decision values yourself based on the content of the model object. Retraining is not necessary, everything you need is available.
You'll need to take care in the structure of the support vectors and coefficients, which is poorly documented. I don't remember the details myself anymore so I can't provide the full code, but it's not that difficult.
Here's what I would recommend: Use probability rankings and class proportions in the training sample to determine the class assignments.
You have three (estimated) probabilities: $p_a, p_b,$ and $p_c$. And you have the original class proportions from the training sample: $m_a, m_b,$ and $m_c$, where $m_a$ is the percentage of classes that belong to class $a$ (e.g., 0.6), and so on.
You can start with the smallest class, say $b$, and use $p_b$ to rank order all records from the highest to lowest values. From this rank-ordered list, start assigning each record to class $b$ until you have $m_b$ percent records assigned to this class. Record the value for $p_b$ at this stage; this value will become the cut-off point for class $b$.
Now take the next smallest class, say $c$, and use $p_c$ to rank order all records and follow the same steps described in the paragraph above. At the end of this step, you will get a cut-off value for $p_c$, and $m_c$ percent of all records would be assigned to class $c$.
Finally, assign all remaining records to (the largest) class $a$.
For future scoring purposes, you can follow these steps but discard the class proportions. You can let the probability cut-off values for class $b$ and $c$ to drive class assignments.
In order to make sure that this approach yields a reasonable level of accuracy, you can review the classification matrix (and any other measures you are using) on the validation set.
Best Answer
First a brief answer, and then a longer comment:
Answer
SNE techniques compute an N ×N similarity matrix in both the original data space and in the low-dimensional embedding space in such a way that the similarities form a probability distribution over pairs of objects. Specifically, the probabilities are generally given by a normalized Gaussian kernel computed from the input data or from the embedding. In terms of classification, this immediately brings to mind instance-based learning methods. You have listed one of them: SVM's with RBF, and @amoeba has listed kNN. There are also radial basis function networks, which I am not an expert on.
Comment
Having said that, I would be doubly careful about making inferences on a dataset just looking at t-SNE plots. t-SNE does not necessarily focus on the local structure. However, you can adjust it to do so by tuning the
perplexity
parameter, which regulates (loosely) how to balance attention between local and global aspects of your data.In this context,
perplexity
itself is a stab in the dark on how many close neighbours each observation may have and is user-provided. The original paper states: “The performance of t-SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.” However, my experience is that getting the most from t-SNE may mean analyzing multiple plots with different perplexities.In other words, tuning
learning rate
andperplexity
, it is possible to obtain very different looking 2-d plots for the same number of training steps and using the same data.This Distill paper How to Use t-SNE Effectively gives a great summary of the common pitfalls of t-SNE analysis. The summary points are:
Those hyperparameters (e.g. learning rate, perplexity) really matter
Cluster sizes in a t-SNE plot mean nothing
Distances between clusters might not mean anything
Random noise doesn’t always look random.
You can see some shapes, sometimes
For topology, you may need more than one plot
Specifically from points 2, 3, and 6 above, I would think twice about making inferences about the separability of the data by looking at individual t-SNE plots. There are many cases where you can 'manufacture' plots that show clear clusters using the right parameters.