Solved – Techniques to detect overfitting

cross-validationoverfittingregularization

I had a job interview for a data science position. During the interview, I was asked what do I do to make sure the model is not overfitting. My first answer was to use cross-validation to assess the performance of the model. However, the interviewer said that even cross-validation cannot identify completely overfitting. Then I mentioned regularization, but the interviewer said that this could help to reduce overfitting (which I agree), but not to detect it. Are there other techniques that can be used to make sure a model is not overfitting?

Best Answer

I believe that when asking about over fitting the interviewer was looking for the "text book answer" while you went few steps after that.

A symptom of over fitting is that the classifier performance on the train set is better that the one on the test set. I refer to this answer as the "text book answer" since it is the common answer and an reasonable approximation.

Note that this answer has many open ends. For example, how much difference is overfitting?. Also, a difference in performance between the data sets is not necessarily due to overfitting. On the other hand, overfitting, won't necessarily result in a significant difference in the performance on the two datasets.

Cross validation is a technique to evaluate the performance of a learner (e.g., decision tree) on data it didn't see before. However, overfitting refers to a specific model (e.g., if "f1" then and not "f2" predict True). It will show you the tendency of the learner to overfit on this data but won't answer whether your specific model overfitted.

In order to overfitted the model will need complexity and that is were regularization helps. It bounds (or trades off) the complexity of the model. Note that another source of overfitting is the hypothesis set size (can be considered to be the number of possible models). Deciding in advance to use a restricted hypothesis set is another way to avoid overfitting.