Solved – Correct cross-validation procedure for single model applied to panel data

classificationcross-validationpanel datapredictive-modelstime series


What is the correct CV procedure for panel data? I've been thinking of the problem as cross-validating a model fit to multiple time series data.

Is the "population informed" CV procedure the correct one to take? (link at bottom of this question). How would the CV statistic be calculated in this case?


I am doing classification/prediction modelling exercise using a random forest and am using CV to tune my hyperparameters. I have panel data that consists of weekly data for 3 countries over a whole year. For illustration, data for the first two weeks for these countries has this structure ("lag" variables are lagged within the country):

Country      Week   lagUnemployment  Event  lagEvent
AUS          1      N/A              1      N/A
AUS          2      5                0      1
GER          1      N/A              1      N/A
GER          2      2                1      1
USA          1      N/A              0      N/A
USA          2      4                0      0

Essentially, it's panel data.

I am applying a single random forest fit to this data to make predictions about whether "Event" will occur using only the lagged variables i.e. the model does not know which country or week any given row comes from. An implication, for example, is that lagUnemployment from GER can help the model predict whether there will be an event in USA.

I would like to tune the hyperparameters of my data using cross-validation but am not sure how to apply cross-validation correctly, since my data is of multiple time series, at least that's how I'm thinking about it.


For validating a single time series the approach in the following answer is applicable (we should use nested CV rather than K-fold in order to respect temporal correlation in the error terms):

I've also looked at how to one model for multiple series here:

BUT this does not go into how to perform cross-validation.

A similar scenario to mine is described here where nested cross-validation which "population informed" (i.e. in my case, a time series CV which accounts for country boundaries too):

However, since my model is agnostic with respect to countries, I am not sure if this is needed.

Best Answer

I am facing a similar situation in finance, one way to go about it is to add Country as a feature, which would be similar to a fixed effects model, and then do standard cross-validation with time series, and is similar to what the medium post refers as “regular” nested cross-validation. I think this is appropriate for your problem.

The second approach, more appropriate with a wide panel data (data from every country for example), would be to use the cross-validation with time series, but within each period of the training set, withhold the information for say 1/3 of the countries and add it to the test set. In this case the model would have to learn to fit both cross-sectionally and across time.

I am not enthusiastic about the Population-Informed data in your case, since the observations across individuals, within the same time period are probably not independent, for example macroeconomic shocks that affect USA could also affect Germany contemporaneously.

Related Question