Solved – SMOTE using unbalanced package in R fails on simple simulated data

machine learningrsynthetic dataunbalanced-classes

SMOTE is a popular method to generate synthetic examples of the minority class in an unbalanced-class data set.

I am trying out SMOTE in the "unbalanced" package in R. I am generating a simple simulate data but SMOTE seems to fail on it. Not sure what the problem is.

library(unbalanced)
set.seed(1)
X <- matrix(rnorm(1000), ncol = 2)
X[1:50,] <- X[1:50,]+5
Y <- as.factor(c(rep(1,50), rep(0,450)))
smoted <- ubSMOTE(X,Y,k=1)
#WARNING: NAs generated by SMOTE removed 
dim(smoted$X)
#[1] 50  2

I would expect the smoted to be a larger data set that consists of the original data plus the smoted examples. Using other values of k or perc.over does not make a difference.

EDIT

When using the SMOTE function in the DMwR library I get the expected result. So the problem seems to be in the unbalanced library.

library(DMwR)
df <- data.frame(X,y=Y)
smoted2 <- SMOTE(y~.,data=df,k=1)
dim(smoted2)
#[1] 350   3
table(smoted2$y)
#  0   1 
#200 150

Best Answer

I know it's a very old question, but I just ran into it.

the problems is the following: unbalanced::ubSMOTE() will behave the same as DMwR::SMOTE() if its X input is given as data.frame(). if you modifying your code in the following way, the two functions will produce the same result.

library(unbalanced)
set.seed(1)
X <- matrix(rnorm(1000), ncol = 2)
Y <- factor(c(rep(1, 50), rep(0, 450)))
set.seed(1) # ubSmote() is randomly initialized, as well as SMOTE()
smoted <- ubSMOTE(data.frame(X), Y, k = 1)
smoted <- cbind(smoted$X, smoted$Y)

library(DMwR)
set.seed(1)
smoted2 <- SMOTE(y ~ ., data = data.frame(X, y = Y), k = 1)