Solved – Pipeline and data snooping in scikit-learn

cross-validationdata preprocessingelastic netscikit learn

Working with the scikit-learn library for python, consider a linear regression model such as the elastic net (ElasticNet class).

Furter assume that one wishes to work with a normalised feature space whatever the reasons. Two options naturally come to mind:

  1. Instanciate an ElasticNet object with the normalize attribute set to true (should one simultaneously set the fit_intercept attribute, he/she should make sure it is not set to false in which case the normalize argument would be ignored, see relevant docstring)

  2. Create a Pipeline consisting of a Normalizer (pre-processing method) and an ElasticNet with normalize attribute set to false.

These approaches are similar. However, it seems like the user community tends to prefer the second option.

This is because when cross-validation is applied to a pipeline object rather than a model object, for instance through cross_val_score(pipe, X, y), the feature space preprocessing is part of the full learning process (i.e. is applied appropriately for each CV fold).

Now, suppose that instead of working with the 'naive' elastic net, one were to work with an elastic net whose hyper parameters are determined by cross-validation (ElasticNetCV class for instance).

In that case, option 2 above does not seem to be the right way to go. More specifically, since the normaliser is fitted on the training set, when we'll work through the internal cross-validation (hyper-parameter determination), we will work with folds that have been normalised using data that is not part of the fold, which is typically data snooping.

In otherwords, the pipeline way of doing things seems fine for simple cross-validation but could be dangerous for nested cross-validation since it could produce optimistically biased cross-validation scores.

Can someone confirm this or am I missing something?

Best Answer

First, just a note that ElasticNet's normalize=True actually isn't quite the same as Normalizer: it first centers the data (subtracting the mean of the training set), then scales each of the centered data points to unit norm.

If you do a pipeline of Normalizer followed by ElasticNet(fit_intercept=True), it will actually normalize the data points to unit norm in the original space, then center the normalized data (which is a little weird).

Since ElasticNet always centers its inputs when you have fit_intercept=True, if you do StandardScaler(with_std=False) (which just centers), Normalizer, and then ElasticNet(fit_intercept=True) you'll actually center, normalize, and then re-center – you end up with slightly different data inside the model, though the overall model should be the same.

If you were only normalizing (replacing each data point $X_i$ with $X_i / \lVert X_i \rVert$), the transformation is independent of the other data, so the CV folds don't matter. Centering, though, is not data-independent.

So, you're correct that centering before ElasticNetCV will center the data based on the whole dataset, and thus technically the elastic net's CV is "cheating." To be totally correct, you should use normalize=True on the ElasticNetCV; if you want to do some other kind of preprocessing, you won't be able to (as far as I know) use ElasticNetCV properly at all. Honestly, the whole CV machinery in scikit-learn is not a great fit for cases that are at all complicated, and I often find myself rolling my own CV loops to handle these issues – but it's hard to do that while still taking advantage of the efficiency gains in ElasticNetCV.

In practice, as long as your dataset isn't tiny, I wouldn't really worry about the difference. Centering tends to be very stable across CV folds, and it's unlikely that your linear model's performance is going to be sensitive to the very small differences in scaling between full-dataset standardization and 9/10ths of the dataset's. The only parameter being estimated is $\hat \mu$; with a $k$-fold CV on $n$ data points, your data snooping changes the estimate from $$\hat \mu_\text{train} = \frac{k}{n (k-1)} \sum_{i \notin \text{ fold } k} X_i$$ to \begin{align} \hat \mu_\text{all} &= \frac{1}{n} \sum_{i} X_i \\&= \frac{1}{n} \sum_{i \notin \text{ fold } k} X_i + \frac{1}{n} \sum_{i \in \text{ fold } k} X_i \\&= \frac{k-1}{k} \hat\mu_\text{train} + \frac{1}{k} \hat\mu_\text{validation} .\end{align} Since $\hat\mu_\text{train}$ and $\hat\mu_\text{validation}$ are going to be extremely similar anyway unless you have a small sample size compared to your dimension, $\hat\mu_\text{all}$ is going to be very close to $\hat\mu_\text{train}$, and the difference is not going to be something that your model is likely to be able to exploit anyway.