When is it okay to accept overfit model for production

classificationhyperparametermachine learningneural networksscikit learn

I am working on a binary classification problem using random forests (75:25 class proportion).m label 0 is minority class. I am following the below approach

a) execute RF with default hyperparameters

b) execute RF with best hyperparameters (GridsearchCV with stratified K fold and scoring was F1)

While with default hyperparameters, my train data was overfit as can be seen from the results below. But, I went ahead and tried the default parameters in test data as well (results below)

enter image description here

default hyperparameters – Test data confusion matrix and classification report

enter image description here

enter image description here

Later, I used GridesearchCV to get the best hyperparameters based on 10 splits and stratified Kfold. However, the performance is poor for test data (using best hyperparameters)

Best hyperparameters – Test confusion matrix and classification report

enter image description here

enter image description here

So, my question is

The results makes me feel like it is okay to stick with the overfit model as it provides me relatively good performance on test data (when compared to best parameter model because it performs poorly). In this case, should I go ahead with the overfit model with default parameters? Because when my business looks at the results, they would definitely wish for overfit model (of course, I didn;t show them yet). But looking at the results of ML model, it's a no brainer that overfit model is better and seems to help them better than non-overfit model

b) As my dataset is imbalanced, I chose a scoring= f1. But looks like the model works to maximize the **f1-score only for majority class (Label 1) **. How can I input it to the model to indicate that it should maximize the metrics like recall and precision for label 0 (minority class) or should I invert my labels? Meaning, make 0's as 1 and 1's as 0?

c) Am I doing any mistake with model building because this sort of result is impossible to obtain?

update – code for best hyparameters

from sklearn.model_selection import GridSearchCV
param_grid = { 
    'n_estimators': [100,200,300,500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
skf = StratifiedKFold(n_splits=10, shuffle=False)
model = GridSearchCV(rfc,param_grid=None,cv = skf, scoring='f1')
model.fit(ord_train_t, y_train)
print(model.best_params_)
print(model.best_score_)
rfc = RandomForestClassifier(random_state=42, max_features='sqrt', n_estimators= 500, max_depth=8, criterion='gini')
rfc.fit(ord_train_t, y_train)
y_train_pred = rfc.predict(ord_train_t)
y_test_pred = rfc.predict(ord_test_t)
y_train_proba = rfc.predict_proba(ord_train_t)
y_test_proba = rfc.predict_proba(ord_test_t)

code for default hyparameters

rfc = RandomForestClassifier()
rfc.fit(ord_train_t, y_train)
y_train_pred = rfc.predict(ord_train_t)
y_test_pred = rfc.predict(ord_test_t)
y_train_proba = rfc.predict_proba(ord_train_t)
y_test_proba = rfc.predict_proba(ord_test_t)

Best Answer

This is an interesting and precarious situation you are in, which should be treated with caution.

First, let's establish some terms: let's not use "overfit" and "non-overfit" to describe your approaches, as both (or neither) could potentially be overfitted. The difference between you two approaches is whether or not you tuned the hyperparameters through CV, right? So let's call them "not tuned/optimized" and "tuned/optimized".

The reason I say your situation is interesting, is because it appears that the "incorrect" or "improper" approach gave a better result than the "proper" approach. The situation is precarious because in order to come to the conclusion I just mentioned, you're using the test-set results as a point of comparison. This is problematic because the test set is supposed to be used as a realistic error estimation of your final model, which you should have selected during cross validation. The final test set is not for multiple comparisons of models for purposes of model selection. Let me repeat that the test set is not for model selection. This is because you could keep retrying 1,000 different times until you finally got a great result on the final test set and think "wow this is a great model I found" when in reality, you did many comparisons. I hope this intuitively makes sense. Any model/hyperparameter/variable selection should be done through cross validation. The test set is just for getting a realistic error estimation of the model selected through CV.

So, our discussion is a bit flawed to begin with, as we are using the test set as a basis for comparison of two models/approaches (which we really shouldn't be doing). But nevertheless, your question stands! Why is it that my un-optimized model gave better results than the one that was optimized through a technique that's supposed to deliver superior results?

There are several ways of looking at this, and I will offer all the perspective I can.

1. False Premise : You say that the optimized model gives worse predictions, but is this really the case? I know it appears that way because having a model (like the optimized one) which gives almost exclusively 1 as the prediction seems less useful (and it's F1-score for the 0 class is worse), but consider this: the accuracy of the unoptimized model on the test set was 77.5%, whereas the optimized gave accuracy of 77.9% . You will say "Sure, but accuracy isn't a good performance metric for this because my dataset is imbalanced, we should look at F1 score." Yes! You're correct, but your RF model isn't trying to optimize the F1 score - your model is optimizing accuracy . Even though the dataset is imbalanced, getting a 1 correct is just as valuable to the model as getting a 0 correct (because they are weighted equally). And because there are way less 0's than 1's, why waste resources trying to get the 0's correct (sacrificing precision on 1's) when you can focus heavily on getting all the 1's right because that's how you maximize overall accuracy (which is what the model is trying to do, and what it's done). The optimized model does not perform worse than the unoptimized one, it's just that for your situation, you don't like how it arrives at that accuracy level (by basically ignoring the 0 class). We'll go into how to fix this a bit lower down, but you must understand that the model is doing the task you gave it - to get the biggest number of predictions correct, regardless of whether they are 1's or 0's.

2. Luck : Cross validation hyperparameter tuning works because it essentially selects those hyperparameters which on average produce the greatest out of sample accuracy. It's on average because you're averaging out of sample fold performance across all k folds. This, of course, means that sometimes the performance may be higher or lower (you can check the performance on individual folds to see how much of a spread/variance there is). What this also means is that if you randomly/arbitrarily select hyperparameters and run the model on the test set, you might have just gotten lucky - the arbitrary model did pretty well on the test set, but on average, it doesn't do so well. Same thing with the optimized model - it is optimized because on average, this is the best combination of parameters - maybe it just got a bit unlucky on your test set, and happened to perform below average. My point here is that your optimized model is still your best bet because it was chosen based on average performance, whereas you are looking at your unoptimized model as a candidate based solely on one out of sample prediction set (the test set).

I hope you can see why comparing the two approaches (random/arbitrary hyperparameter selection vs. CV) based on test set performance isn't correct. Think of it this way - if your arbitrary choice of hyperparameters was the best combination on average.. that would have been the combination chosen during CV. But since it wasn't, it's not the combination that does best on average, and so it's likely that the model just got "lucky" on your test set.

It's also possible, though, that hyperparameter tuning just doesn't make a huge difference in accuracy. However, you correctly observed a large difference in the distribution of predictions between the optimized and unoptimized model. You don't like the distribution of predictions of the optimized model because it basically just predicts a 1 every single time. Yet at the same time, that appears to be the optimal approach! Right? That's how the model gets the largest number of data points correct.

To get around this conundrum, you need to employ techniques to deal with your imbalanced dataset.

Essentially, right now, your model is saying "Why bother getting good with predicting 0's, they are only 25% of the dataset and are worth the same as a 1 to me, I can do better by just catching all the 1's " .

You have basically two options, which I have copy-pasted from my answer here.

Your options to fix the problem

Oversampling/undersampling - creating a new dataset which doesn't have the imbalance problem, therefore forcing the model to contend with the rare category of data.

Weighting - not changing the data, but changing the scoring so that the model is penalized differently fir getting different categories wrong. (you can do this easily in RandomForest by adjusting a hyperparameter that's called something like "class weight", "sample weight", "weight" or something similar. Read into your RF package to find this hyperparameter).

I will explain both, but I would recommend trying weighting first, as it's usually a much easier fix than changing your sampling.

Category 1: Sampling (oversampling/undersampling)

Right off the bat, I (like many) would advise against the undersampling approach, where you would keep all of the rare category examples, but randomly sample some fraction of the "common" category samples to include. You already have a limited amount data and you want to make use of all of it. Undersampling gives your model just gives your model less information to work with. So if you're doing some sampling approach, I would go with oversampling. Undersampling extreme example: You have 100 "common" cases and 10 "rare" cases, so you sample randomly 10 of the common cases and now you have a new, balanced dataset with 10 common cases and 10 rare cases.

Simply speaking, in over-sampling, you want to create a synthetic dataset where you have, in some way, repeated the rare data so that the model encounters more often and therefore has to address it in order to find the optimal solution (i.e. optimal solution can't be found by ignoring it anymore). Oversampling extreme example: You still have 100 "common" cases and 10 "rare" cases, so you randomly sample (with replacement) from the 10 "rare" cases until you have 100 of them (of course, 90 of the 100 are randomly-selected repeats of the original 10). But now you have 100 "common" and 100 "rare" cases.

There's a technique called SMOTE, which you can read about here: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ . I personally have not used it, but it's very well-known and considered to be quite effective (it's a specific kind of oversampling technique). If simple oversampling doesn't solve the issue, consider SMOTE.

Category 2: Weighting

This is pretty straightforward too. Besides making it more common, you can try to force the model to pay attention to the rare cases by assigning a larger weight to them - that is, the model is penalized more than usual when it gets them wrong. Many algorithms allow some kind of incorporation of weights into it (as parameters).

I hope this helps!

Related Question