Solved – What language to use for genetic programming

genetic algorithms

As part of an assignment I'll have to write a genetic programming algorithm that does prediction of atmospheric pollutant levels. Since I have no experience, can anyone point me pointers to propositions of programming languages in which evolved programs will be written.

Clarification: I'm not asking what will be the language I'll write the genetic algorithm itself (as I will be able to make the decision myself), I'm asking in what programming language the evolved programs should be created.

My instructor suggested Lisp, but I don't like this idea — first I would have to work on some kind of Abstract Syntax Tree, secondly reliably doing crossovers on tree structure can be hell of a mess.

I'd rather use something that is dedicated for genetic programming like slash/A. SlashA does not require working on ASTs — programs in bytecode are a just an array of ints that can be changed in any fasion necessary since every int array represents some slash/A program.

Additional remarks:

I'd like to avoid manipulating ASTs!
This problem is hard (maybe not as hard as predicting stock values). This is due to the fact that (most probably) we don't have enough input information (there are some hidden parameters). Creating a model that has better performance that model that returns mean is somewhat of a challenge (mean models have 35% MAPE), most models have MAPE of about 25%, best have 20%.
I'd like to have a language that manages datasets with many features with assumption that I'm not sure which are important. (Slash/A has a disadvantage here — in this language input features are read sequentially — so some features will be used with bigger probability).
I'd like to be able to program this in Python, so python libs would be great — but I can do bindings for C/C++ (no Java, no Matlab, etc).

I'm conscious this is a survey question, so if it is to early for such question please close it, but I feel it is specific enough.

Best Answer

Your pollutant problem probably doesn't need much of a language at all. It looks like a symbolic regression rather than a control problem, in which case you could just use standard tree GP, with features and a few useful constants as the terminal set and relevant operators in the function set. The GP system will weed out irrelevant features and there are techniques to handle very large datasets. Generally, specify the smallest function set that you estimate could solve the problem, and expand it with care if necessary.

You'll need to choose between tree and linear GP early on. Lisp is tree, Slash/A is linear. Read up on both to understand the pros & cons, but from what you wrote I'd suggest a simple tree GP system. It's not too hard to write your own, but there are existing Python implementations. These ones below are for evolutionary algorithms in Python in general but not all do GP and some are inactive:

PyGressionGP (GP for symbolic regression in Python) -- http://code.google.com/p/pygressiongp/
PyGene -- https://github.com/blaa/PyGene
A Simple Genetic Programming in Python -- http://zhanggw.wordpress.com/2009/11/08/a-simple-genetic-programming-in-python-4/
Pyevolve -- https://github.com/perone/Pyevolve -- also see blog -- http://blog.christianperone.com -- and this post -- http://blog.christianperone.com/?p=549
esec (Evolutionary Computation in Python) -- http://code.google.com/p/esec/
Peach -- http://code.google.com/p/peach/
PyBrain (does a lot, not just NN) -- http://pybrain.org/
dione -- http://dione.sourceforge.net/
PyGEP (Genetic Expression Programming) -- http://code.google.com/p/pygep/
deap (Distributed Evolutionary Algorithms) -- http://code.google.com/p/deap/

Also, see the (free) introductory book on GP by well-known GP authors Poli, Langdon and McPhee:

A Field Guide to Genetic Programming -- http://www.gp-field-guide.org.uk/

Related Solutions

Solved – Overfitting in Genetic Programming

I can "train" my GP on some time period (let's say 2000-2010) and evaluate it's fitness on another time period (let's say 2010-2014). However, won't this just introduce overfitting on the time period from 2010-2014. The GP will just gravitate towards finding good programs on the testing time period.

Yes that is a fundamental problem which is triggered by a very fundamental difference between evolution (as basis for the heuristics behind GA) and GA for this kind of optimization: in evolution, each generation consists of new unknown test cases.

Recommendation specifically for iterative optimization schemes: restrict the number of generations as much as possible

The underlying problem that causes @DikranMarsupial to conclude that optimization is the root of all evil is that algorithms searching for the maximum performance will "skim" variance, i.e. they find solutions which are a lucky combination of training & test set splits and the model hyperparameters.

For GAs, you'll accumulate such solutions in the elite, so you may want to switch off the elite or at least re-evaluate with new cross validation splits.

One general recommendation for any kind of data driven optimization that helps with these variance issues is: Use a proper scoring rule for performance evaluation. Proper scoring rules are well behaved in the sense that they react continuously to continuous changes in the model, and they often also exhibit less variance than the dichtomized (error counting) loss functions.

Two other basic cautionary steps I'd recommend are to check

Compare some kind of "baseline" model's performance (e.g. a model where you set the hyperparameters as your knowledge of data and application field suggests) against "good" results of the optimizer with a statistical test (e.g. McNemar's)
You can also check beforehand whether you have a realistic chance to obtain a model where you can actually prove an increase in performance (e.g. if your baseline model already has 90% correct, is there any chance given the total number of test cases you have even to prove that even observing 100 % correct test cases acctually corresponds to a better model?
as the GA performs a data driven optimization, you anyways need to validate the final set of hyperparameters by another validation (loop). Calculate the same performance measure you use internally for the GA also for the outer validataion: the difference is an indicator of overfitting.
While this doesn't help if there was overfitting, you can at least detect that there are problems.

Let me know if you need literature - in that case: do you read German (I used a GA in my Diplom thesis - that's how I found out about all those problems...)

Solved – Constrained assignment problem (Linear Programming, Genetic Algorithm, etc…)

EDIT: Apparently, I should have posted my original recommendation about the facility location problem as a reply to your question, not as an answer. I am new to the forum, but I get it now.

Any way, I believe your problem can be formulated as follows.

Let $y_i$ = 1 if supply center i is opened and 0 otherwise

$x_i,_j$ = 1 if supply center i feeds shop j and 0 otherwise

$c_i,_j$ = distance from supply center i to shop j

Then you want to solve the following linear programming problem.

minimize $\sum_i\sum_jc_i,_jx_i,_j$

subject to $\sum_ix_i,_j = 1$ j = 1,...,1000 (or the number of shops you have)

$\sum_jx_i,_j \le 1000y_i$ i = 1,...,28 (If you have a max capacity for each supply center, use that number instead of 1000; 1000 assumes there is no max capacity so one center could feasibly supply all shops)

$\sum_iy_i \le 20$ (max of 20 supply centers chosen)

$x_i,_j$ = 0 or 1 i=1,...,28; j =1,...,1000

$y_i$ = 0 or 1 i=1,...,14

The Rsymphony package in R can solve this problem for you. You will need a vector of the distances $c_i,_j$ and 0's for each $y_i$ as well as a matrix to take care of the constraints (the three summations starting after "subject to"). If you aren't familiar with the constraint matrix, each column contains the coefficients for that variable. The help file for the Rsymphony package has a couple examples that should help with that. Don't forget to include columns for the $y_i$ in the matrix as well. You will also need to set the $x_i,_j$ and $y_i$ to be binary by using

types = "B"

As an example, suppose we have 3 shops and 2 supply centers with distances between them as $x_1,_1$ = 2.8, $x_1,_2$ = 5.4, $x_1,_3$ = 1.4, $x_2,_1$ = 4.2, $x_2,_2$ = 3.0, $x_2,_3$ = 6.3. Then

    library(Rsymphony)
    obj <- c(2.8, 5.4, 1.4, 4.2, 3.0, 6.3, 0, 0)
    mat <- matrix(c(1,0,0,1,0,0,0,1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,-3,0,1,0,0,0,0,-3,1), nrow = 6)
    dir <- c("==", "==", "==", "<=", "<=", "<=")
    rhs <- c(1, 1, 1, 0, 0, 2)
    max <- FALSE
    types <- "B"
    fac.loc <- Rsymphony_solve_LP(obj, mat, dir, rhs, types = types, max = max, write_lp = TRUE)
fac.loc

tells us that we should open both supply centers and supply center 1 will supply shops 1 and 3 and supply center 2 will supply shop 2, for a total distance of 7.2.

I hope this helps.

Best Answer

Related Solutions

Solved – Overfitting in Genetic Programming

Solved – Constrained assignment problem (Linear Programming, Genetic Algorithm, etc…)

Related Question