Solved – effect of increasing the number of iterations while optimising logistic regression cost function

gradient descentoptimizationoverfitting

I am taking an online Deep learning class from Andrew Ng and it starts with optimising a classifier based on Logistic Regression. During the online assignment there was one paragraph which does not make sense to me:

You can see the cost decreasing. It shows that the parameters are
being learned. However, you see that you could train the model even
more on the training set. Try to increase the number of iterations in
the cell above and rerun the cells. You might see that the training
set accuracy goes up, but the test set accuracy goes down. This is
called overfitting.

I cannot understand why increasing the number of iterations will result in overfitting? I can understand that increasing model complexity can result in overfitting but cannot understand why increasing the number of gradient descent iterations for the logistic regression cost function can overfit.

Is the statement wrong or have I failed to understand some important concept?

Best Answer

I share your confusion about vanilla logistic regression. That is, when you are working with an unpenalised likelihood/objective function. This is where the iterations are not really about model selection, but rather about finding the maximum of a non-linear function.

However....having said this...if you think of the starting point for the algorithm, which is often intercept (or bias) equal to log odds for the whole dataset, and everything else equal to zero. This could be a "simple" model, and you can think of your actual model as the "complex" model. As you do the iterations, the parameters move from the "simple" to the "complex" model.

We can then imagine finding the MLE for a logistic model with "too many" predictors. The idea is that the iterations always start from a model with "too few" predictors. The "hand wavy" argument goes that somewhere in the middle iterations might be a nice "good fit". This is obviously dependent on how the iterations of the algorithm are done, such as how quickly it converges, and how much parameters are allowed to vary at each iteration.

Hope this helps!