I can "train" my GP on some time period (let's say 2000-2010) and evaluate it's fitness on another time period (let's say 2010-2014). However, won't this just introduce overfitting on the time period from 2010-2014. The GP will just gravitate towards finding good programs on the testing time period.
Yes that is a fundamental problem which is triggered by a very fundamental difference between evolution (as basis for the heuristics behind GA) and GA for this kind of optimization: in evolution, each generation consists of new unknown test cases.
Recommendation specifically for iterative optimization schemes: restrict the number of generations as much as possible
The underlying problem that causes @DikranMarsupial to conclude that optimization is the root of all evil is that algorithms searching for the maximum performance will "skim" variance, i.e. they find solutions which are a lucky combination of training & test set splits and the model hyperparameters.
For GAs, you'll accumulate such solutions in the elite, so you may want to switch off the elite or at least re-evaluate with new cross validation splits.
One general recommendation for any kind of data driven optimization that helps with these variance issues is:
Use a proper scoring rule for performance evaluation. Proper scoring rules are well behaved in the sense that they react continuously to continuous changes in the model, and they often also exhibit less variance than the dichtomized (error counting) loss functions.
Two other basic cautionary steps I'd recommend are to check
Compare some kind of "baseline" model's performance (e.g. a model where you set the hyperparameters as your knowledge of data and application field suggests) against "good" results of the optimizer with a statistical test (e.g. McNemar's)
You can also check beforehand whether you have a realistic chance to obtain a model where you can actually prove an increase in performance (e.g. if your baseline model already has 90% correct, is there any chance given the total number of test cases you have even to prove that even observing 100 % correct test cases acctually corresponds to a better model?
as the GA performs a data driven optimization, you anyways need to validate the final set of hyperparameters by another validation (loop). Calculate the same performance measure you use internally for the GA also for the outer validataion: the difference is an indicator of overfitting.
While this doesn't help if there was overfitting, you can at least detect that there are problems.
Let me know if you need literature - in that case: do you read German (I used a GA in my Diplom thesis - that's how I found out about all those problems...)
EDIT: Apparently, I should have posted my original recommendation about the facility location problem as a reply to your question, not as an answer. I am new to the forum, but I get it now.
Any way, I believe your problem can be formulated as follows.
Let $y_i$ = 1 if supply center i is opened and 0 otherwise
$x_i,_j$ = 1 if supply center i feeds shop j and 0 otherwise
$c_i,_j$ = distance from supply center i to shop j
Then you want to solve the following linear programming problem.
minimize $\sum_i\sum_jc_i,_jx_i,_j$
subject to $\sum_ix_i,_j = 1$ j = 1,...,1000 (or the number of shops you have)
$\sum_jx_i,_j \le 1000y_i$ i = 1,...,28 (If you have a max capacity for each supply center, use that number instead of 1000; 1000 assumes there is no max capacity so one center could feasibly supply all shops)
$\sum_iy_i \le 20$ (max of 20 supply centers chosen)
$x_i,_j$ = 0 or 1 i=1,...,28; j =1,...,1000
$y_i$ = 0 or 1 i=1,...,14
The Rsymphony package in R can solve this problem for you. You will need a vector of the distances $c_i,_j$ and 0's for each $y_i$ as well as a matrix to take care of the constraints (the three summations starting after "subject to"). If you aren't familiar with the constraint matrix, each column contains the coefficients for that variable. The help file for the Rsymphony package has a couple examples that should help with that. Don't forget to include columns for the $y_i$ in the matrix as well. You will also need to set the $x_i,_j$ and $y_i$ to be binary by using
types = "B"
As an example, suppose we have 3 shops and 2 supply centers with distances between them as $x_1,_1$ = 2.8, $x_1,_2$ = 5.4, $x_1,_3$ = 1.4, $x_2,_1$ = 4.2, $x_2,_2$ = 3.0, $x_2,_3$ = 6.3. Then
library(Rsymphony)
obj <- c(2.8, 5.4, 1.4, 4.2, 3.0, 6.3, 0, 0)
mat <- matrix(c(1,0,0,1,0,0,0,1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,-3,0,1,0,0,0,0,-3,1), nrow = 6)
dir <- c("==", "==", "==", "<=", "<=", "<=")
rhs <- c(1, 1, 1, 0, 0, 2)
max <- FALSE
types <- "B"
fac.loc <- Rsymphony_solve_LP(obj, mat, dir, rhs, types = types, max = max, write_lp = TRUE)
fac.loc
tells us that we should open both supply centers and supply center 1 will supply shops 1 and 3 and supply center 2 will supply shop 2, for a total distance of 7.2.
I hope this helps.
Best Answer
Your pollutant problem probably doesn't need much of a language at all. It looks like a symbolic regression rather than a control problem, in which case you could just use standard tree GP, with features and a few useful constants as the terminal set and relevant operators in the function set. The GP system will weed out irrelevant features and there are techniques to handle very large datasets. Generally, specify the smallest function set that you estimate could solve the problem, and expand it with care if necessary.
You'll need to choose between tree and linear GP early on. Lisp is tree, Slash/A is linear. Read up on both to understand the pros & cons, but from what you wrote I'd suggest a simple tree GP system. It's not too hard to write your own, but there are existing Python implementations. These ones below are for evolutionary algorithms in Python in general but not all do GP and some are inactive:
Also, see the (free) introductory book on GP by well-known GP authors Poli, Langdon and McPhee:
A Field Guide to Genetic Programming -- http://www.gp-field-guide.org.uk/