Regression – SMOTE for Logistic Regression Model Producing Worse Results

logisticregressionsmote

Not sure why using more sample from SMOTE() could lower the overall accuracy:

over = SMOTE(sampling_strategy=0.4)

X, y = over.fit_resample(X, y)

counter = Counter(y)

print(counter)

Counter({'no': 19548, 'yes': 7819})

from sklearn.linear_model import LogisticRegression

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()
# Fit it to the training data
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)
# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[3616 293]
[ 964 601]]
precision recall f1-score support

      no       0.79      0.93      0.85      3909
     yes       0.67      0.38      0.49      1565

accuracy                           0.77      5474

macro avg 0.73 0.65 0.67 5474
weighted avg 0.76 0.77 0.75 5474

the original result without SMOTE()
[[3897 41]
[ 514 48]]
precision recall f1-score support

      no       0.88      0.99      0.93      3938
     yes       0.54      0.09      0.15       562

accuracy                           0.88      4500

macro avg 0.71 0.54 0.54 4500
weighted avg 0.84 0.88 0.84 4500

(I used get_dummies and MinMaxScaler)

Thanks a lot!!

Best Answer

The objective of SMOTE is to oversample the minority class with some synthetic data and undersample the majority class, followed by using those samples in a model.

This is good if you are more concerned about mislabeling the minority class as the majority class than vice-versa. However, if accuracy is all you are concerned with—as in, the % labeled correctly—then most of the value of that metric is going to come from labeling the majority class, in which case I would not recommend SMOTE.

It may be possible that SMOTE's synthetic data creation can reduce some overfitting on the minority class to improve accuracy, but I think that would be overshadowed by changing the objective from equal-weighting of classes to biasing more towards the minority class.

Without seeing some additional statistics, it could be that your features or model type are not well-suited for the problem, either. e.g., a sparse, minority-class ring closely around a dense, majority-class ring/circle.

Related Question