I have often seen the statement that linear separability is more easily achieved in high dimensions, but I don't see why. Is it an empirical fact? An heuristic? Plain nonsense?

# Solved – Is it true that in high dimensions, data is easier to separate linearly

large datalinear modelmathematical-statistics

#### Related Solutions

The most common measures of separability are based on how much the intra-class distributions overlap (probabilistic measures). There are a couple of these, Jeffries-Matusita distance, Bhattacharya distance and the transformed divergence. You can easily google up some descriptions. They are quite straightforward to implement.

There also some based on the behavior of nearest neighbors. The separability index, which basically looks at the proportion of neighbors that overlap. And the Hypothesis margin which looks at the distance from an objectâ€™s nearest neighbour of the same class (near-hit) and a nearest neighbour of the opposing class (near-miss). Then creates a measure by summing over this.

And then you also have things like class scatter matrices and collective entropy.

**EDIT**

Probabilistic separability measures in R

```
separability.measures <- function ( Vector.1 , Vector.2 ) {
# convert vectors to matrices in case they are not
Matrix.1 <- as.matrix (Vector.1)
Matrix.2 <- as.matrix (Vector.2)
# define means
mean.Matrix.1 <- mean ( Matrix.1 )
mean.Matrix.2 <- mean ( Matrix.2 )
# define difference of means
mean.difference <- mean.Matrix.1 - mean.Matrix.2
# define covariances for supplied matrices
cv.Matrix.1 <- cov ( Matrix.1 )
cv.Matrix.2 <- cov ( Matrix.2 )
# define the halfsum of cv's as "p"
p <- ( cv.Matrix.1 + cv.Matrix.2 ) / 2
# --%<------------------------------------------------------------------------
# calculate the Bhattacharryya index
bh.distance <- 0.125 *t ( mean.difference ) * p^ ( -1 ) * mean.difference +
0.5 * log (det ( p ) / sqrt (det ( cv.Matrix.1 ) * det ( cv.Matrix.2 )
)
)
# --%<------------------------------------------------------------------------
# calculate Jeffries-Matusita
# following formula is bound between 0 and 2.0
jm.distance <- 2 * ( 1 - exp ( -bh.distance ) )
# also found in the bibliography:
# jm.distance <- 1000 * sqrt ( 2 * ( 1 - exp ( -bh.distance ) ) )
# the latter formula is bound between 0 and 1414.0
# --%<------------------------------------------------------------------------
# calculate the divergence
# trace (is the sum of the diagonal elements) of a square matrix
trace.of.matrix <- function ( SquareMatrix ) {
sum ( diag ( SquareMatrix ) ) }
# term 1
divergence.term.1 <- 1/2 * trace.of.matrix (( cv.Matrix.1 - cv.Matrix.2 ) *
( cv.Matrix.2^ (-1) - cv.Matrix.1^ (-1) )
)
# term 2
divergence.term.2 <- 1/2 * trace.of.matrix (( cv.Matrix.1^ (-1) + cv.Matrix.2^ (-1) ) *
( mean.Matrix.1 - mean.Matrix.2 ) *
t ( mean.Matrix.1 - mean.Matrix.2 )
)
# divergence
divergence <- divergence.term.1 + divergence.term.2
# --%<------------------------------------------------------------------------
# and the transformed divergence
transformed.divergence <- 2 * ( 1 - exp ( - ( divergence / 8 ) ) )
indices <- data.frame(
jm=jm.distance,bh=bh.distance,div=divergence,tdiv=transformed.divergence)
return(indices)
}
```

And some silly reproducible examples:

```
##### EXAMPLE 1
# two samples
sample.1 <- c (1362, 1411, 1457, 1735, 1621, 1621, 1791, 1863, 1863, 1838)
sample.2 <- c (1362, 1411, 1457, 10030, 1621, 1621, 1791, 1863, 1863, 1838)
# separability between these two samples
separability.measures ( sample.1 , sample.2 )
##### EXAMPLE 2
# parameters for a normal distibution
meen <- 0.2
sdevn <- 2
x <- seq(-20,20,length=5000)
# two samples from two normal distibutions
normal1 <- dnorm(x,mean=0,sd=1) # standard normal
normal2 <- dnorm(x,mean=meen, sd=sdevn) # normal with the parameters selected above
# separability between these two normal distibutions
separability.measures ( normal1 , normal2 )
```

Note that these measures only work for two classes and 1 variable at a time, and sometimes have some assumptions (like the classes following a normal distibution) so you should read about them before using them thoroughly. But they still might suit your needs.

As noted in the question and comments, a dataset of $m$ points $(\boldsymbol{x}_i,y_i)$ with $\boldsymbol{x}_i\in\mathbb{R}^n$ and $y_i\in\{-1,+1\}$ is linearly separable if we can find a normal vector $\boldsymbol{a}$ and scalar bias $b$ such that the linear inequality $$y_i(\boldsymbol{a}^T\boldsymbol{x}_i+b)\geq 1$$ is satisfied for all $i=1,\ldots,m$. Physically this says that for each point $\boldsymbol{x}_i$, the signed distance $d_i$ to the separating plane has sign $y_i$ and magnitude at least $\|\boldsymbol{a}\|$.

This can be written also as $$\boldsymbol{w}^T\boldsymbol{z}_i \geq 1$$ where $$\boldsymbol{z}_i = \begin{bmatrix}\boldsymbol{x}_i \\ 1\end{bmatrix} y_i \text{ , } \boldsymbol{w}=\begin{bmatrix}\boldsymbol{a} \\ b\end{bmatrix}$$

As noted in the question, this is essentially a linear program with a "0" objective function.

The proposed "degree of separability" measure $S_\min$ can be expressed as $$ \min_{\boldsymbol{\sigma},\boldsymbol{w}} S[\boldsymbol{\sigma}] $$ where $\boldsymbol{\sigma}\in\{0,1\}^m$, the function $S$ is defined by $$ S[\boldsymbol{\sigma}] = \sum_i\sigma_i = \boldsymbol{1}^T\boldsymbol{\sigma} $$ and the minimization is subject to the constraint $$ \sigma_i(\boldsymbol{w}^T\boldsymbol{z}_i - 1) \geq 0 \text{ , } i=1,\ldots,m$$

This is a (slightly nonlinear) "mixed integer programming" problem, a problem class which is NP hard (even in the linear & binary case).

Now a more feasible alternative could be to solve a "soft-margin" classification problem using a standard approach like SVM*. The tolerance and/or lagrange multiplier variables could then be used to quantify "degree of separability".

(*For $\lambda\approx 0$ the SVM will essentially reduce to a "softened" version of the linear program above. However to get meaningful distances, you will have to normalize tolerances by $\|\boldsymbol{a}\|$.)

**Update:** Using something like soft-margin SVM can give you an idea of which points are "hard" to separate, but it would not necessarily address your question of "how many points must be removed to allow a hard-margin split?"

However, I think that if you pre-process the $\boldsymbol{X}$ matrix by whitening, then it should be possible to get some more consistent results. (In terms of numerical offset-scales and directional clustering of the problem points.)

## Best Answer

Trivially, if you have $N$ data points, they will be linearly separable in $N-1$ dimensions. Any structure in the data may reduce the required dimensionality for linear separation further. You might say that (a projection of) a data set either

isoris notcompletely linearly separable, in which using any (projection into) dimensionality lower than $N-1$ requires either additional properties of the data, of the projection into this higher dimensionality, or can be viewed as a heuristic (for instance in the case of random projections). In general we usually do not care to much about precise separability, in which case it is sufficient that we can meaningfully separatemoredata points correctly in higher dimensions.