Solved – Overfitting in Genetic Programming

cross-validationgenetic algorithms

I've recently started experimenting with Genetic Programming as an optimization tool. I'm still a little confused as to how to reduce overfitting in this framework.

A couple of techniques I've read about

  1. Limiting the number of generations
  2. Limiting the tree (genome) size
  3. Adding a complexity penalty to fitness function
  4. Incorporate cross validation into fitness function

Of the methods above, 4 is the most intriguing/confusing to me. I've read this post on SO How to avoid overfitting when using crossvalidation within Genetic Algorithms, but it didn't give me too much insight on the actual implementation.

Let's say I'm trying to implement cross validation in my fitness function. I can "train" my GP on some time period (let's say 2000-2010) and evaluate it's fitness on another time period (let's say 2010-2014). However, won't this just introduce overfitting on the time period from 2010-2014. The GP will just gravitate towards finding good programs on the testing time period.

Alternatively, you could say that you want a program that performs well on 2000-2010 (performance P1) and 2010-2014 (performance P2). Say your fitness function requires that the performance of the GP on those two time periods has to be high and similar (i.e. Fitness = (P1 + P2) * 1/(P1 + P2)), but then isn't that the same as just testing the fitness on the whole period 2000-2014?

I guess I'm asking is it even possible to implement a form of cross validation into your fitness function. GP isn't a parameterized algorithm like kNN, SVM, or Simple Linear Regression, so I'm confused as to how cross validation even works in this setting. I can always just evaluate the output of the naive GP on test data, but in my experiments I haven't found good performance, because of GP's tendency to overfit.

Best Answer

I can "train" my GP on some time period (let's say 2000-2010) and evaluate it's fitness on another time period (let's say 2010-2014). However, won't this just introduce overfitting on the time period from 2010-2014. The GP will just gravitate towards finding good programs on the testing time period.

Yes that is a fundamental problem which is triggered by a very fundamental difference between evolution (as basis for the heuristics behind GA) and GA for this kind of optimization: in evolution, each generation consists of new unknown test cases.

Recommendation specifically for iterative optimization schemes: restrict the number of generations as much as possible

The underlying problem that causes @DikranMarsupial to conclude that optimization is the root of all evil is that algorithms searching for the maximum performance will "skim" variance, i.e. they find solutions which are a lucky combination of training & test set splits and the model hyperparameters.

For GAs, you'll accumulate such solutions in the elite, so you may want to switch off the elite or at least re-evaluate with new cross validation splits.

One general recommendation for any kind of data driven optimization that helps with these variance issues is: Use a proper scoring rule for performance evaluation. Proper scoring rules are well behaved in the sense that they react continuously to continuous changes in the model, and they often also exhibit less variance than the dichtomized (error counting) loss functions.

Two other basic cautionary steps I'd recommend are to check

  • Compare some kind of "baseline" model's performance (e.g. a model where you set the hyperparameters as your knowledge of data and application field suggests) against "good" results of the optimizer with a statistical test (e.g. McNemar's)
    You can also check beforehand whether you have a realistic chance to obtain a model where you can actually prove an increase in performance (e.g. if your baseline model already has 90% correct, is there any chance given the total number of test cases you have even to prove that even observing 100 % correct test cases acctually corresponds to a better model?

  • as the GA performs a data driven optimization, you anyways need to validate the final set of hyperparameters by another validation (loop). Calculate the same performance measure you use internally for the GA also for the outer validataion: the difference is an indicator of overfitting.
    While this doesn't help if there was overfitting, you can at least detect that there are problems.

Let me know if you need literature - in that case: do you read German (I used a GA in my Diplom thesis - that's how I found out about all those problems...)

Related Question