Solved – measure to describe the degree of linear separability

linearlinear modellinear programmingmachine learningneural networks

I know that given two sets of points, one can use linear programming to see if there is a solution/hyperplane that linearly separates the two data sets. But this holds for completely linearly separability. I wonder in case that the two sets can only be linearly separated by neglecting some points in both sets, then is there a measure to describe the degree of linear separability?


In other ways, the problem may be transformed to

At least how many points we need to kick out to have linear
separability? (and the degree may be measured by the number of points
to be kicked out divided by sample size)

But is this computationally expensive to implement using computers? Or is there any better way to measure the degree of linear separability?

Best Answer

As noted in the question and comments, a dataset of $m$ points $(\boldsymbol{x}_i,y_i)$ with $\boldsymbol{x}_i\in\mathbb{R}^n$ and $y_i\in\{-1,+1\}$ is linearly separable if we can find a normal vector $\boldsymbol{a}$ and scalar bias $b$ such that the linear inequality $$y_i(\boldsymbol{a}^T\boldsymbol{x}_i+b)\geq 1$$ is satisfied for all $i=1,\ldots,m$. Physically this says that for each point $\boldsymbol{x}_i$, the signed distance $d_i$ to the separating plane has sign $y_i$ and magnitude at least $\|\boldsymbol{a}\|$.

This can be written also as $$\boldsymbol{w}^T\boldsymbol{z}_i \geq 1$$ where $$\boldsymbol{z}_i = \begin{bmatrix}\boldsymbol{x}_i \\ 1\end{bmatrix} y_i \text{ , } \boldsymbol{w}=\begin{bmatrix}\boldsymbol{a} \\ b\end{bmatrix}$$

As noted in the question, this is essentially a linear program with a "0" objective function.

The proposed "degree of separability" measure $S_\min$ can be expressed as $$ \min_{\boldsymbol{\sigma},\boldsymbol{w}} S[\boldsymbol{\sigma}] $$ where $\boldsymbol{\sigma}\in\{0,1\}^m$, the function $S$ is defined by $$ S[\boldsymbol{\sigma}] = \sum_i\sigma_i = \boldsymbol{1}^T\boldsymbol{\sigma} $$ and the minimization is subject to the constraint $$ \sigma_i(\boldsymbol{w}^T\boldsymbol{z}_i - 1) \geq 0 \text{ , } i=1,\ldots,m$$

This is a (slightly nonlinear) "mixed integer programming" problem, a problem class which is NP hard (even in the linear & binary case).

Now a more feasible alternative could be to solve a "soft-margin" classification problem using a standard approach like SVM*. The tolerance and/or lagrange multiplier variables could then be used to quantify "degree of separability".

(*For $\lambda\approx 0$ the SVM will essentially reduce to a "softened" version of the linear program above. However to get meaningful distances, you will have to normalize tolerances by $\|\boldsymbol{a}\|$.)


Update: Using something like soft-margin SVM can give you an idea of which points are "hard" to separate, but it would not necessarily address your question of "how many points must be removed to allow a hard-margin split?"

However, I think that if you pre-process the $\boldsymbol{X}$ matrix by whitening, then it should be possible to get some more consistent results. (In terms of numerical offset-scales and directional clustering of the problem points.)