Solved – Three questions about support vector regression: feature pre-processing, time-series issues, and marginal accuracy contribution of each feature

regressionsvmtime series

I am familiar with using support vector machines as the base classifier in a Python re-implementation of Poselets. But I am new to the use of support vector regression (not machines for classification) for time-series data.

In my data set, I have a vector of target variables $y_{i,t}$ and a matrix of predictor variables (features) $X_{i,t}$ where $i$ denotes the individual in the population (rows of observations), and I have these cross-sections of targets and predictors for many time periods $t$. I use the notation $x_{i,t}$ to specify the entire set of features (entire row) for individual $i$ at time $t$. The notation $x^{j}_{t}$ would denote the column with the $j$-th feature across all individuals at time $t$. So $X_{i,t}^{j}$ is the $(i,j)$ entry of the feature matrix at time $t$.

In my case, all of the predictors are continuous variables, though I can imagine some categorical predictors that could be useful. We can also assume the time frequency is fixed for all target and predictor variables.

My question is threefold:

(1) Is there a set of generally accepted best practices for pre-processing the predictors (scoring, normalizing, rescaling, etc.) to provide more sensible inputs to the support vector regression step? (E.g. can you point me to a reference for this. I've already looked in the Smola and Scholkopf tutorial, the Bishop book, and Duda, Hart, and Stork, but found no useful guides or discussions of practical pre-processing concerns).

(2) Are there similar references or best-practices for handling time-series inputs to support vector regression? For example, in a linear regression setting, sometimes people are worried about cross-sectional correlation of error terms or predictor variables. This is treated by taking averages across the time dimension, generally under assumptions of no autocorrelation, and (e.g. Fama-MacBeth regression) these can be shown to be unbiased estimators of the coefficients even in the presence of cross-sectional correlation.

I've never encountered places where support vector methods were described as sensitive to cross-sectional correlation between predictors or error terms. Is this a concern? How is it handled in support vector modeling?

At one extreme end, I could imagine doing the following: For a given time period $t^{*}$, not only use the predictor variables that are formulated at the same time period, $X_{i,t^{*}}$ but also augment this set of feature vectors with lagged versions of the variables, for whatever trailing time periods I might believe are relevant. And in the scoring step of (1) I could perhaps apply a decaying weight function so that newer observations are given more weight.

But to me this sounds like a bad idea. For one, the predictors $X_{i,t}$ are generally chosen because of domain-specific knowledge, intuition, or some other method of getting prior information about what might be a successful predictor. Just dumping all lagged versions of the same thing feels a bit too much like data mining and I worry it would lead to overfitting big time. Secondly, even with support vector methods, there is still a goal of having a parsimonious model, so surely I would not want degrees of freedom for a huge set of lagged versions of predictor variables.

Given all of this, what are some examples of accepted or practically useful support vector regression treatments of time series data? How do they address these problems? (Again, a reference is desirable. I'm not looking for someone to write a graduate thesis as an answer here.)

(3) For the model being fitted, there are definitely going to be some comparisons drawn with more classical models, like plain OLS or GLS. In those models, there are two very nice features. First, the degrees of freedom is very transparent because there are generally no hyperparameters to tune and no pre-processing steps beyond linear scoring steps on the input variables. Second, after the regression functions is fitted, it is straightforward to attribute the accuracy of the classifier to the different linear components of the model.

Are there analogues for these with support vector regression? When using kernel functions, like a Gaussian RBF, to transform the inputs, is it fair to say that the only extra degrees of freedom introduced are whatever hyperparameters govern the kernel's functional form? That feels a bit wrong to me, because you're effectively allowing yourself to explore a whole space of unvetted transformations of the data, which isn't really captured in just the functional form of the kernel function. How can you fairly penalize a model like this for having much more freedom to overfit the data with non-linear transformations?

And lastly, are there analogous ways to decompose the accuracy of a fitted support vector model? My first thought here was that you would somehow need to measure something like the mutual information of the classifier trained inclusive of predictor $x^{j}$ (the $j$-th column of the feature vector inputs) with the classifier trained without that one predictor. And this measure would in some sense be the marginal contribution of accuracy when that predictor is used. But then this would require fitting at least $K+1$ models (the full model and $K$ different leave-one-predictor-out models, if there are $K$ predictors). And if you include lots of lagged variables, $K$ might be very large, and you'd be performing some type of cross-validation at each step.

It seems like getting that kind of marginal contribution statistic is vastly more expensive for a support vector model. Are there references showing different summary statistics that capture the same kind of marginal contribution statistics?

Best Answer

I can answer a couple of your questions, so here goes.

(1) Is there a set of generally accepted best practices for pre-processing the predictors (scoring, normalizing, rescaling, etc.) to provide more sensible inputs to the support vector regression step?

A general rule of thumb for SVM/SVR is to scale all inputs to the same interval. Common choices are $[-1,1]$ and $[0,1]$. The actual interval doesn't matter much, as long as all inputs are scaled to the same one. This prevents some input dimensions to completely dominate others when evaluating the kernel function. I have no reference for regression, but for classification it is listed in the LIBSVM guide (it holds for regression too).

When using kernel functions, like a Gaussian RBF, to transform the inputs, is it fair to say that the only extra degrees of freedom introduced are whatever hyperparameters govern the kernel's functional form? That feels a bit wrong to me, because you're effectively allowing yourself to explore a whole space of unvetted transformations of the data, which isn't really captured in just the functional form of the kernel function. How can you fairly penalize a model like this for having much more freedom to overfit the data with non-linear transformations?

When using SVM(/SVR), the degrees of freedom are in fact the number of training instances. Each training instance can become a support vector and as such contribute to the separating hyperplane/regressor. Although this may seem bad, this is exactly why SVM works in infinite dimensional feature spaces, for example using an RBF kernel: the actual number of degrees of freedom is always finite.

Related Question