It sounds right to me. People sometimes also use the word "Perceptron" to refer to the training algorithm together with the classifier. For example, someone explained this to me in the answer to this question. Also, there is nothing to stop you from using a kernel with the perceptron, and this is often a better classifier. See here for some slides (pdf) on how to implement the kernel perceptron.
The major practical difference between a (kernel) perceptron and SVM is that perceptrons can be trained online (i.e. their weights can be updated as new examples arrive one at a time) whereas SVMs cannot be. See this question for information on whether SVMs can be trained online. So, even though a SVM is usually a better classifier, perceptrons can still be useful because they are cheap and easy to re-train in a situation in which fresh training data is constantly arriving.
The most common measures of separability are based on how much the intra-class distributions overlap (probabilistic measures). There are a couple of these, Jeffries-Matusita distance, Bhattacharya distance and the transformed divergence. You can easily google up some descriptions. They are quite straightforward to implement.
There also some based on the behavior of nearest neighbors. The separability index, which basically looks at the proportion of neighbors that overlap. And the Hypothesis margin which looks at the distance from an object’s nearest neighbour of the same class (near-hit) and a nearest neighbour of the opposing class (near-miss). Then creates a measure by summing over this.
And then you also have things like class scatter matrices and collective entropy.
EDIT
Probabilistic separability measures in R
separability.measures <- function ( Vector.1 , Vector.2 ) {
# convert vectors to matrices in case they are not
Matrix.1 <- as.matrix (Vector.1)
Matrix.2 <- as.matrix (Vector.2)
# define means
mean.Matrix.1 <- mean ( Matrix.1 )
mean.Matrix.2 <- mean ( Matrix.2 )
# define difference of means
mean.difference <- mean.Matrix.1 - mean.Matrix.2
# define covariances for supplied matrices
cv.Matrix.1 <- cov ( Matrix.1 )
cv.Matrix.2 <- cov ( Matrix.2 )
# define the halfsum of cv's as "p"
p <- ( cv.Matrix.1 + cv.Matrix.2 ) / 2
# --%<------------------------------------------------------------------------
# calculate the Bhattacharryya index
bh.distance <- 0.125 *t ( mean.difference ) * p^ ( -1 ) * mean.difference +
0.5 * log (det ( p ) / sqrt (det ( cv.Matrix.1 ) * det ( cv.Matrix.2 )
)
)
# --%<------------------------------------------------------------------------
# calculate Jeffries-Matusita
# following formula is bound between 0 and 2.0
jm.distance <- 2 * ( 1 - exp ( -bh.distance ) )
# also found in the bibliography:
# jm.distance <- 1000 * sqrt ( 2 * ( 1 - exp ( -bh.distance ) ) )
# the latter formula is bound between 0 and 1414.0
# --%<------------------------------------------------------------------------
# calculate the divergence
# trace (is the sum of the diagonal elements) of a square matrix
trace.of.matrix <- function ( SquareMatrix ) {
sum ( diag ( SquareMatrix ) ) }
# term 1
divergence.term.1 <- 1/2 * trace.of.matrix (( cv.Matrix.1 - cv.Matrix.2 ) *
( cv.Matrix.2^ (-1) - cv.Matrix.1^ (-1) )
)
# term 2
divergence.term.2 <- 1/2 * trace.of.matrix (( cv.Matrix.1^ (-1) + cv.Matrix.2^ (-1) ) *
( mean.Matrix.1 - mean.Matrix.2 ) *
t ( mean.Matrix.1 - mean.Matrix.2 )
)
# divergence
divergence <- divergence.term.1 + divergence.term.2
# --%<------------------------------------------------------------------------
# and the transformed divergence
transformed.divergence <- 2 * ( 1 - exp ( - ( divergence / 8 ) ) )
indices <- data.frame(
jm=jm.distance,bh=bh.distance,div=divergence,tdiv=transformed.divergence)
return(indices)
}
And some silly reproducible examples:
##### EXAMPLE 1
# two samples
sample.1 <- c (1362, 1411, 1457, 1735, 1621, 1621, 1791, 1863, 1863, 1838)
sample.2 <- c (1362, 1411, 1457, 10030, 1621, 1621, 1791, 1863, 1863, 1838)
# separability between these two samples
separability.measures ( sample.1 , sample.2 )
##### EXAMPLE 2
# parameters for a normal distibution
meen <- 0.2
sdevn <- 2
x <- seq(-20,20,length=5000)
# two samples from two normal distibutions
normal1 <- dnorm(x,mean=0,sd=1) # standard normal
normal2 <- dnorm(x,mean=meen, sd=sdevn) # normal with the parameters selected above
# separability between these two normal distibutions
separability.measures ( normal1 , normal2 )
Note that these measures only work for two classes and 1 variable at a time, and sometimes have some assumptions (like the classes following a normal distibution) so you should read about them before using them thoroughly. But they still might suit your needs.
Best Answer
As noted in the question and comments, a dataset of $m$ points $(\boldsymbol{x}_i,y_i)$ with $\boldsymbol{x}_i\in\mathbb{R}^n$ and $y_i\in\{-1,+1\}$ is linearly separable if we can find a normal vector $\boldsymbol{a}$ and scalar bias $b$ such that the linear inequality $$y_i(\boldsymbol{a}^T\boldsymbol{x}_i+b)\geq 1$$ is satisfied for all $i=1,\ldots,m$. Physically this says that for each point $\boldsymbol{x}_i$, the signed distance $d_i$ to the separating plane has sign $y_i$ and magnitude at least $\|\boldsymbol{a}\|$.
This can be written also as $$\boldsymbol{w}^T\boldsymbol{z}_i \geq 1$$ where $$\boldsymbol{z}_i = \begin{bmatrix}\boldsymbol{x}_i \\ 1\end{bmatrix} y_i \text{ , } \boldsymbol{w}=\begin{bmatrix}\boldsymbol{a} \\ b\end{bmatrix}$$
As noted in the question, this is essentially a linear program with a "0" objective function.
The proposed "degree of separability" measure $S_\min$ can be expressed as $$ \min_{\boldsymbol{\sigma},\boldsymbol{w}} S[\boldsymbol{\sigma}] $$ where $\boldsymbol{\sigma}\in\{0,1\}^m$, the function $S$ is defined by $$ S[\boldsymbol{\sigma}] = \sum_i\sigma_i = \boldsymbol{1}^T\boldsymbol{\sigma} $$ and the minimization is subject to the constraint $$ \sigma_i(\boldsymbol{w}^T\boldsymbol{z}_i - 1) \geq 0 \text{ , } i=1,\ldots,m$$
This is a (slightly nonlinear) "mixed integer programming" problem, a problem class which is NP hard (even in the linear & binary case).
Now a more feasible alternative could be to solve a "soft-margin" classification problem using a standard approach like SVM*. The tolerance and/or lagrange multiplier variables could then be used to quantify "degree of separability".
(*For $\lambda\approx 0$ the SVM will essentially reduce to a "softened" version of the linear program above. However to get meaningful distances, you will have to normalize tolerances by $\|\boldsymbol{a}\|$.)
Update: Using something like soft-margin SVM can give you an idea of which points are "hard" to separate, but it would not necessarily address your question of "how many points must be removed to allow a hard-margin split?"
However, I think that if you pre-process the $\boldsymbol{X}$ matrix by whitening, then it should be possible to get some more consistent results. (In terms of numerical offset-scales and directional clustering of the problem points.)