Solved – Back-testing or cross-validating when the model-building process was interactive

cross-validationmodelingoutliersoverfittingsplines

I have some predictive models whose performance I would like to back-test (i.e., take my dataset, "rewind" it to a previous point in time, and see how the model would have performed prospectively).

The problem is that some of my models were built via an interactive process. For instance, following the advice in Frank Harrell's Regression Modeling Strategies, in one model I used restricted cubic splines to handle possible nonlinear associations between features and the response. I allocated the degrees of freedom of each spline based on a combination of domain knowledge and univariate measures of strength of association. But the degrees of freedom that I want to allow my model obviously depends on the size of the dataset, which varies dramatically when backtesting. If I don't want to hand-pick degrees of freedom separately for each time at which the model is backtested, what are my other options?

For another example, I'm currently working on outlier detection via finding points with high leverage. If I were happy to do this by hand, I would simply look at each high-leverage data point, sanity-check that the data was clean, and either filter it out or clean it up by hand. But this relies on a bunch of domain knowledge, so I don't know how to automate the process.

I would appreciate advice and solutions both (a) to the general problem of automating interactive parts of the model-building process, or (b) specific advice for these two cases. Thanks!

Best Answer

FYI, this might be more appropriate for SE.DataScience, but for the time being, I'll answer it here.

It seems to me like you might be in a situation where you will have no choice but to write a script that will implement your solutions. Never having worked with splines, my knowledge of them is strictly theoretical so please bear with me and let me know if there is anything I'm not seeing.

Broadly speaking, it appears that you have a couple of different items that you will have to resolve in order to implement this.

1.) Determining the model parameters in a dynamic fashion. You have previously mentioned that you've used a combination of domain knowledge and univariate measures. That seems to me like something that you should be able to handle heuristically. You will have to agree at the outset on a set of rules which your program will implement. This may or may not be a trivial task as you will have to do some hard thinking about the potential implications of those rules. This may require you to re-visit every step of your process and cataloging not just the decisions, but also the reasons behind those decisions.

2.) Actually implementing your program. In order to make your performance testing properly dynamic and easy to maintain and modify going forward, you will have to think about how you're going to structure it. You will likely want to use some sort of loop for your main model predictive performance estimation, preferably with a user-definable length in order to allow for greater flexibility going forward. You will also likely want to write separate functions for each action that you want your program to take as this will make it easier to test functionality, and to maintain and modify your program going forward. You will, at a minimum, likely need functions for dataset selection (i.e. only time periods that have "gone by" at the moment of backtesting), cleaning and validation (which you'll really have to think about, as data munging is a critical part of model building), functions for model training parameters, and functions for model prediction and performance measure collection and storage.

Your question about outlier detection and handling also falls under those two concerns and I would go about implementing by writing smaller loops within your main program loop that would continue to "clean" and refit the model until it's reached a point where you would be happy with it (which again, you'll have to define yourself).

If this sounds like a big task, it's because it is; people have written entire software libraries (sometimes very lucratively) in order to perform this sort of task. Beyond that, it's hard to offer any more specific advice without knowing more about your processes, data structure, and the programming language you've done your work in thus far.

If any of this of useful to you and you'd like me to expand on any of it, comment, let me know, and I'd be more than happy to do so.

Related Question