Classification – Measure for Separability in SVM and Linear Models

classificationdistancelinear modelsvm

I have a (binary) classification problem where after merging single training data points (that can be tracked back to the same source) into aggregates, test accuracy (on single data points again) increases significantly. By merging I mean adding and averaging feature vectors. A SVM classifier/linear kernel is used. (The training data is noisy since it is semi-automatically generated).

So there is the weird situation that making training and test data more 'dissimilar' (aggregates vs. single data points) increases performance. (Moreover, for plain linear regression, adding together feature vectors shouldn't make a difference, I guess).

I'm trying to find out what might be the reason for that. One hypothesis is that for large margin classifiers like SVM's separability is crucial, and that by aggregating, there is less overlap between classes caused by noise.

It would be nice to have a measure indicating the amount of 'overlap' or 'separability'. What are standard measures here?
I plan to do some plot of the first two PCA dimensions and see how things look, but some 'credible number' would be good in addition.

Best Answer

The most common measures of separability are based on how much the intra-class distributions overlap (probabilistic measures). There are a couple of these, Jeffries-Matusita distance, Bhattacharya distance and the transformed divergence. You can easily google up some descriptions. They are quite straightforward to implement.

There also some based on the behavior of nearest neighbors. The separability index, which basically looks at the proportion of neighbors that overlap. And the Hypothesis margin which looks at the distance from an object’s nearest neighbour of the same class (near-hit) and a nearest neighbour of the opposing class (near-miss). Then creates a measure by summing over this.

And then you also have things like class scatter matrices and collective entropy.

EDIT

Probabilistic separability measures in R

separability.measures <- function ( Vector.1 , Vector.2 ) {
# convert vectors to matrices in case they are not
  Matrix.1 <- as.matrix (Vector.1)
  Matrix.2 <- as.matrix (Vector.2)
# define means
mean.Matrix.1 <- mean ( Matrix.1 )
mean.Matrix.2 <- mean ( Matrix.2 )
# define difference of means
mean.difference <- mean.Matrix.1 - mean.Matrix.2
# define covariances for supplied matrices
cv.Matrix.1 <- cov ( Matrix.1 )
cv.Matrix.2 <- cov ( Matrix.2 )
# define the halfsum of cv's as "p"
p <- ( cv.Matrix.1 + cv.Matrix.2 ) / 2
# --%<------------------------------------------------------------------------
# calculate the Bhattacharryya index
bh.distance <- 0.125 *t ( mean.difference ) * p^ ( -1 ) * mean.difference +
0.5 * log (det ( p ) / sqrt (det ( cv.Matrix.1 ) * det ( cv.Matrix.2 )
)
)
# --%<------------------------------------------------------------------------
# calculate Jeffries-Matusita
# following formula is bound between 0 and 2.0
jm.distance <- 2 * ( 1 - exp ( -bh.distance ) )
# also found in the bibliography:
# jm.distance <- 1000 * sqrt (   2 * ( 1 - exp ( -bh.distance ) )   )
# the latter formula is bound between 0 and 1414.0
# --%<------------------------------------------------------------------------
# calculate the divergence
# trace (is the sum of the diagonal elements) of a square matrix
trace.of.matrix <- function ( SquareMatrix ) {
sum ( diag ( SquareMatrix ) ) }
# term 1
divergence.term.1 <- 1/2 * trace.of.matrix (( cv.Matrix.1 - cv.Matrix.2 ) * 
( cv.Matrix.2^ (-1) - cv.Matrix.1^ (-1) )
)
# term 2
divergence.term.2 <- 1/2 * trace.of.matrix (( cv.Matrix.1^ (-1) + cv.Matrix.2^ (-1) ) *
( mean.Matrix.1 - mean.Matrix.2 ) *
t ( mean.Matrix.1 - mean.Matrix.2 )
)
# divergence
divergence <- divergence.term.1 + divergence.term.2
# --%<------------------------------------------------------------------------
# and the transformed divergence
transformed.divergence  <- 2 * ( 1 - exp ( - ( divergence / 8 ) ) )
indices <- data.frame(
jm=jm.distance,bh=bh.distance,div=divergence,tdiv=transformed.divergence)
return(indices)
}

And some silly reproducible examples:

##### EXAMPLE 1
# two samples
sample.1 <- c (1362, 1411, 1457, 1735, 1621, 1621, 1791, 1863, 1863, 1838)
sample.2 <- c (1362, 1411, 1457, 10030, 1621, 1621, 1791, 1863, 1863, 1838)

# separability between these two samples
separability.measures ( sample.1 , sample.2 )

##### EXAMPLE 2
# parameters for a normal distibution
meen <- 0.2
sdevn <- 2
x <- seq(-20,20,length=5000)
# two samples from two normal distibutions
normal1 <- dnorm(x,mean=0,sd=1) # standard normal
normal2 <- dnorm(x,mean=meen, sd=sdevn) # normal with the parameters selected above

# separability between these two normal distibutions
separability.measures ( normal1 , normal2 )

Note that these measures only work for two classes and 1 variable at a time, and sometimes have some assumptions (like the classes following a normal distibution) so you should read about them before using them thoroughly. But they still might suit your needs.

Related Solutions

Machine Learning – Test for Linear Separability Techniques

Well, support vector machines (SVM) are probably, what you are looking for. For example, SVM with a linear RBF kernel, maps feature to a higher dimenional space and tries to separet the classes by a linear hyperplane. This is a nice short SVM video illustrating the idea.

You may wrap SVM with a search method for feature selection (wrapper model) and try to see if any of your features can linearly sparate the classes you have.

There are many interesting tools for using SVM including LIBSVM, MSVMPack and Scikit-learn SVM.

Solved – What do these decision boundaries indicate in random forest and svm

The images you present are the same as those here: link.

The following is some code, translated to R with some adjustments, to work through this. The RF selected (2 trees) is not acceptable. This is not apples-to-apples, so any of the authors' assertions about "entropy" can be mis-informative.

First we get the data:

#reproducibility
set.seed(136526) #I like to use question number as random seed

#libraries
library(data.table)      #to read the url
library(randomForest)    #to have randomForests 
library(miscTools)       #column medians

#main program

#get data
wine_df = fread("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")

#conver to frame
wine_df <- as.data.frame(wine_df)

#parse data
Y <- (wine_df[,12])
X <- wine_df[,-12]

Next we find the right size of random forest for it.

max_trees <- 100     #same range
N_retest <- 35      #fair sample size

err <- matrix(0,max_trees,N_retest)  #initialize for the loop

for (i in 1:max_trees){

     for (j in 1:N_retest){

          #fit random forest with "i" number of trees
          my_rf <- randomForest(x = X, y = Y, ntree = i)

          #pop out sum of squared residuals divided by n
          temp <- mean(my_rf$mse)
          err[i,j] <- temp
     }
}

Now we can look at how many elements should be in the ensemble:

#make friendly for boxplot
err_frame <- as.data.frame(t(err))
names(err_frame) <- as.character(1:max_trees)

#central tendency
my_meds <- colMedians((err_frame))

#normalized slope of central tendency
est <- smooth.spline(x = 1:max_trees,y = my_meds,spar = 0.7)
pred <- predict(est)
my_sl <- c(diff(pred$y)/diff(pred$x))
my_sl <- (0.7-0.4)*(my_sl-min(my_sl))/(max(my_sl)-min(my_sl))+0.4

#make boxplot
boxplot(err_frame, 
        main = "MSE vs. number of trees", 
        xlab = "number of trees", 
        ylab = "forest mean MSE", xlim= c(0,75))

#draw central tendency (red)
lines(est, col="red", lwd=2)

#draw slope
lines(pred$x,c(0.4,my_sl),col="green")
points(pred$x,c(0.4,my_sl),col="green", pch=16)

grid()
legend(x = 60,y = 0.6,c("bxp","fit","slope"), 
       col = c("black","Red","Green"), 
       lty = c(NA, 1,1), 
       pch = c(22,-1,20),
       pt.cex = c(1.2,1,1) )

And it gives us this, which I then manually draw blue and black lines on in a version of midangle-skree heuristic to get a "decent" ensemble size of 30. It is two tangent lines from the slope: one at highest slope, one at right end of domain. We make a ray from intersection of those tangent lines to the slope-line along the mid-angle. The next highest point after the intersection informs tree-count.

Now that we have a decent random forest we can look at errors. First we compute the error.

# make "final" model
my_rf_fin <- randomForest(x = X, y = (Y), ntree = 30)

#predict on it
pred_fin <- predict(my_rf_fin)

#compute error
fit_err <- pred_fin - Y

The first plots to start with are basic EDA plots including the 4-plot of error.

#EDA on error
par(mfrow = n2mfrow(4) )

#run seq
plot(fit_err, type="l")
grid()

#lag plot
plot(fit_err[2:length(fit_err)],fit_err[1:(length(fit_err)-1)] )
abline(a = 0,b=1, col="Green", lwd=2)
grid()

#histogram
hist(fit_err,breaks = 128, main = "")
grid()

#normal quantile
qqnorm(fit_err, main = "")
grid()

par(mfrow = c(1,1))

Which yields:

The error is reasonably well behaved. It is narrow tailed. There is a non-Gaussian set of samples on the right side of the lag plot. The central part of the distribution looks triangular. It isn't Gaussian, but it wasn't expected to be. This is a discrete level output modeled as continuous.

Here is a variability plot of actual vs. predicted, and of error vs. predicted.

If systematically over-predicts the poorest class as better than rated, and under-predicts the highest class as poorer than rated.

This random forest is less poorly constructed, and likely is a healthier function approximator.

Next steps: make the boundary plot like yours on the first 2 principle components.

Notes on the code:

I'm not a big scikit.learn guy, so I am going to misunderstand parts of what they are doing. Standard disclaimers apply.
Two trees in an ensemble is a contradiction in terms like "one man
army". The random forest is no "one man army" because it would be
CART as a non-weak learner. The author did a disservice to an
ensemble learner by selecting 2 elements as the ensemble size. The
big joy of a random forest is you can add ensemble elements. Never
(ever) accept a random forest smaller than 20 trees. Double-check
any forest smaller than 50 trees.
The author has no split between training/validation or test. They
use all the data to fit the learners. A better way is to split into those groups then determine the ensemble parameters, then make the model with the combined train/valid data. I don't see that here.
Author does not specify whether the "y" is discretized or continuous. This means the RF might be living in regression instead of classification.

Best Answer

Related Solutions

Machine Learning – Test for Linear Separability Techniques

Solved – What do these decision boundaries indicate in random forest and svm

Related Question