Solved – How to avoid overfitting when using crossvalidation within Genetic Algorithms

data miningfeature selectiongenetic algorithmsmachine learning

This is a long set-up, but the pure intellectual challenge will make it worthwhile I promise 😉

I have marketing data where there is a treatment and a control (i.e a customer gets no treatment). The event of interest (getting a loan) is relatively rare (<1%). My objective is to model the incremental lift between the response rate of the treatment and the control (treated group book rate – control group book rate) and use the model to make decisions about who to promote to in the future.

The treated group is large (600,000 records) and the control is about 15% the size.

This is a marketing exercise and we want to target those who need to be targeted to take the action of interest and not waste funds on those who will "do it anyway".

I have hundreds of variables and have experimented with various forms of Uplift modeling AKA Net Lift Models. I have tried many of the state-of-the-art methods in the literature and common practice. None very stable on this data set unfortunately.

I know (theoretically and after some experimentation) that there are a few variables that might impact the incremental lift. So, I created a matrix with the combinations of the levels of these variables and the number of records in the treated group, the number in the control group and the number of events of interest in each. So, from each row in the matrix one can calculate the incremental lift. There are 84 rows in the matrix.

enter image description here

I was think of modeling this (difference in) proportion using a beta regression, but the counts in some rows are very spares (perhaps no records in the control and more frequently, there are no events of interest). This can be seen in the top couple rows of the sample data above.

I began thinking about searching for the optimal solution to which of the rows of the matrix to select. Rows that are selected have the number of treatedHH and treatedLoans summed, along with the control. I am looking to maximize profit which can be estimated from these numbers.

I pushed the data through a genetic algorithm to determine which rows to keep. I got a solution returned and the result was better than including everyone (which is the base case). But, when I ran that selection on the validation sample I partitioned, the result was not so.

My question: Is there a way to design cross validation into this fitness function so that the solution does not over fit – which I presume happened in my first attempt.

Here is the fitness function I used:

calcProfit<-function(selectVec=c())
{

    TreatLoans<-sum(selectVec*dat$TreatedLoans)
	ControLoans<-sum(selectVec*dat$ControlLoans)
    TreatHH<-sum(selectVec*dat$treatedHH)
	ControlHH<-sum(selectVec*dat$controlHH)


    Incre.RR<-(TreatLoans/TreatHH)-(ControLoans/ControlHH)
    Incre.Loans<- Incre.RR * TreatHH
    Incre.Rev <- Incre.Loans*1400
    Incre.Profit<- (-1)*(Incre.Rev - (0.48*TreatHH))

    Incre.Profit


}

and the call in R: rbga.results = rbga.bin(size=84, zeroToOneRatio=3,evalFunc=calcProfit,iters=5000)

Best Answer

Cross-validation will not eliminate over-fitting either, only (hopefully) reduce it. If you minimise any statistic with a non-zero variance evaluated over a finite sample of data there is a risk of over-fitting. The more choices you make, the larger the chance of over-fitting. The harder you try to minimise the statistic, the larger the chance of over-fitting, which is one of the problems with using GAs - it is trying very hard to find the lowest minimum.

Regularisation is probably a better approach if predictive performance is what is important, as it involves fewer choices.

Essentially in statistics, optimisation is the root of all over-fitting, so the best way to avoid over-fitting is to minimise the amount of optimisation you do.

Related Question