Solved – Improve precision/recall for class imbalance

classificationpythonrandom forestscikit learnunbalanced-classes

Trying to get better precision/recall for both classes … any tips?

  • I have heterogeneous features [a few num vars, a few cat vars, and 2 text vars]
  • Target is a binary classification w/ class imbalance [about 85% class 1 and 15% class 0]
  • Don't have much training data [only around 17K rows]

Here is my pipeline:

cat_transformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])

num_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('num_scaler', StandardScaler())])

text_transformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=SPLIT_PATTERN,\
                                 stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()

text_transformer_1 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=SPLIT_PATTERN,\
                                 stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()

FE = ColumnTransformer(
    transformers=[
        ('cat', cat_transformer, CAT_FEATURES),
        ('num', num_transformer, NUM_FEATURES),
        ('text0', text_transformer_0, TEXT_FEATURES[0]),
        ('text1', text_transformer_1, TEXT_FEATURES[1])])

pipe = Pipeline(steps=[('feature_engineer', FE),
                     ("scales", MaxAbsScaler()),
                     ('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])

random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\
              "rand_forest__n_estimators": sp_randint(10, 100),\
              "rand_forest__max_features": ["auto", "sqrt", "log2", None],\
              "rand_forest__bootstrap": [True, False],\
              "rand_forest__criterion": ["gini", "entropy"]}

strat_shuffle_fold = StratifiedKFold(n_splits=5,\
  random_state=123,\
  shuffle=True)

cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold)
cv_train.fit(X_train, y_train)

from sklearn.metrics import classification_report, confusion_matrix
preds = cv_train.predict(X_test)
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))

On average based on many different combinations of attempts via classification report i am getting around:

  • class 1 => approx. 95% precision; 98% recall
  • class 0 => approx. 80-85% precision; 57-66% recall

When i perform stratified k-fold shuffles and added class_weight='balanced" i can get to 66% recall, but would like to get around 75%-80%

questions:

  1. Any other feature engineering techniques i can do to improve predicting class 0? [have tried different things on text like TFIDF, Hashing Trick, selectKBest, SVD(), and maxAbsScaler() on all features]
  2. Any other algorithms i should try? [have only tried random forest classifier]
  3. Is low recall a big deal?
  4. Mostly have been just "plugging and playing" … anything obvious i am missing?
  5. Would applying over-sampling help? if so, how can that be done in python / sklearn?

Any help would be much appreciated!

Best Answer

It's clear that your models are suffering from the imbalance in your data, which is a thing you'll need to fix. Now, on to your questions:

Any other feature engineering techniques i can do to improve predicting class 0? [have tried different things on text like TFIDF, Hashing Trick, selectKBest, SVD(), and maxAbsScaler() on all features]

These are all valid preprocessing steps, but no feature engineering step can help you fix your real problem (i.e. class imbalance). They help in dealing with other issues such as high-dimensionslity, overfitting, etc.

Any other algorithms i should try? [have only tried random forest classifier]

Tree-based algorithms are usually the most suited in dealing with imbalanced data. You could try some of the popular tree-boosting algorithms that are very popular these days (e.g. XGBoost, LightGBM, Catboost)

Is low recall a big deal?

Depending on what you're aiming for... What strikes me as important isn't the value of recall that is low but its difference to that of class 0. A dropoff from 98% to 66% is a massive difference and should be dealt with.

Mostly have been just "plugging and playing" ... anything obvious i am missing? would applying over-sampling help? If so, how can that be done in python / sklearn?

Yes over-sampling is the first thing you should do! This can be done in Python through imbalanced-learn, which offers a large variety of under and over-samplers. This will play well, as long as you swap sklearn.pipeline.Pipeline with imblearn.pipeline.Pipeline. Just note that this step will have to be done after converting your text to vectors (i.e. after CountVectorizer or TFIDF).

Related Question