Solved – Cross-validating for model parameters with time series

cross-validationmodel selectionregularizationtime series

This question's context is time series forecasting using regression, with multivariate training data. With a regularization method like LARS w/ LASSO, elastic net, or ridge, we need to decide on the model complexity or regularization parameters. For example, the ridge $\lambda$ penalty or the number of steps to go along the LARS w/ LASSO algorithm before hitting the OLS solution.

My first instinct is to use cross-validation to infer a decent value of the regularization parameter. For LARS w/ LASSO, I would infer the (effective) degrees of freedom that optimizes some fitness function like $\frac{1}{n}\sum_{i{\le}n}|\hat{y}_i-y_i|$. However with time series data, we should cross-validate out-of-sample. (No peeking into the future!) Say there are two feature time series $x_1$ and $x_2$ and I am forecasting a time series $y$. For each step of time $t$, train with $x_{1,1}$ through $x_{1,t}$ and $x_{2,1}$ through $x_{2,t}$ — and then forecast $\hat{y}_{t+1}$ and compare with the actual $y_{t+1}$.

This framework makes sense from an out-of-sample perspective, but I worry that earlier cross-validation steps (low $t$) will be overemphasized when averaging over all the equally-weighted steps. Should the first few time series cross-validation steps, the ones that use much less training data, be suppressed when inferring (regularization) model parameters? I might prefer a model complexity (regularization) level that "did better" on those later cross-validation steps using more training data.

Best Answer

You can include a "minimum" number of observations that you think you need to fit your model, and exclude n< this number from cross validation. Obviously, you can't fit a model using just the 1st sample, and you can't really fit a model using the 1st 2 samples. At some reasonable point (5? 10?) you'll have enough observations to fit a valid model, so start at that point.