Solved – Pipeline and data snooping in scikit-learn

cross-validationdata preprocessingelastic netscikit learn

Working with the scikit-learn library for python, consider a linear regression model such as the elastic net (ElasticNet class).

Furter assume that one wishes to work with a normalised feature space whatever the reasons. Two options naturally come to mind:

Instanciate an ElasticNet object with the normalize attribute set to true (should one simultaneously set the fit_intercept attribute, he/she should make sure it is not set to false in which case the normalize argument would be ignored, see relevant docstring)
Create a Pipeline consisting of a Normalizer (pre-processing method) and an ElasticNet with normalize attribute set to false.

These approaches are similar. However, it seems like the user community tends to prefer the second option.

This is because when cross-validation is applied to a pipeline object rather than a model object, for instance through cross_val_score(pipe, X, y), the feature space preprocessing is part of the full learning process (i.e. is applied appropriately for each CV fold).

Now, suppose that instead of working with the 'naive' elastic net, one were to work with an elastic net whose hyper parameters are determined by cross-validation (ElasticNetCV class for instance).

In that case, option 2 above does not seem to be the right way to go. More specifically, since the normaliser is fitted on the training set, when we'll work through the internal cross-validation (hyper-parameter determination), we will work with folds that have been normalised using data that is not part of the fold, which is typically data snooping.

In otherwords, the pipeline way of doing things seems fine for simple cross-validation but could be dangerous for nested cross-validation since it could produce optimistically biased cross-validation scores.

Can someone confirm this or am I missing something?

Best Answer

First, just a note that ElasticNet's normalize=True actually isn't quite the same as Normalizer: it first centers the data (subtracting the mean of the training set), then scales each of the centered data points to unit norm.

If you do a pipeline of Normalizer followed by ElasticNet(fit_intercept=True), it will actually normalize the data points to unit norm in the original space, then center the normalized data (which is a little weird).

Since ElasticNet always centers its inputs when you have fit_intercept=True, if you do StandardScaler(with_std=False) (which just centers), Normalizer, and then ElasticNet(fit_intercept=True) you'll actually center, normalize, and then re-center – you end up with slightly different data inside the model, though the overall model should be the same.

If you were only normalizing (replacing each data point $X_i$ with $X_i / \lVert X_i \rVert$), the transformation is independent of the other data, so the CV folds don't matter. Centering, though, is not data-independent.

So, you're correct that centering before ElasticNetCV will center the data based on the whole dataset, and thus technically the elastic net's CV is "cheating." To be totally correct, you should use normalize=True on the ElasticNetCV; if you want to do some other kind of preprocessing, you won't be able to (as far as I know) use ElasticNetCV properly at all. Honestly, the whole CV machinery in scikit-learn is not a great fit for cases that are at all complicated, and I often find myself rolling my own CV loops to handle these issues – but it's hard to do that while still taking advantage of the efficiency gains in ElasticNetCV.

In practice, as long as your dataset isn't tiny, I wouldn't really worry about the difference. Centering tends to be very stable across CV folds, and it's unlikely that your linear model's performance is going to be sensitive to the very small differences in scaling between full-dataset standardization and 9/10ths of the dataset's. The only parameter being estimated is $\hat \mu$; with a $k$-fold CV on $n$ data points, your data snooping changes the estimate from $$\hat \mu_\text{train} = \frac{k}{n (k-1)} \sum_{i \notin \text{ fold } k} X_i$$ to \begin{align} \hat \mu_\text{all} &= \frac{1}{n} \sum_{i} X_i \\&= \frac{1}{n} \sum_{i \notin \text{ fold } k} X_i + \frac{1}{n} \sum_{i \in \text{ fold } k} X_i \\&= \frac{k-1}{k} \hat\mu_\text{train} + \frac{1}{k} \hat\mu_\text{validation} .\end{align} Since $\hat\mu_\text{train}$ and $\hat\mu_\text{validation}$ are going to be extremely similar anyway unless you have a small sample size compared to your dimension, $\hat\mu_\text{all}$ is going to be very close to $\hat\mu_\text{train}$, and the difference is not going to be something that your model is likely to be able to exploit anyway.

Related Solutions

Solved – Cross validation with two parameters: elastic net case

The method to use in this case is exactly the same, though e.g. the glmnet package doesn't provide it out of the box.

Instead of working over 1 discrete set of parameter values (lambda), you now crossvalidate for a grid of parameter values, (lambda and alpha), then pick the best value (lambda.min and alpha.min), and then the lambda and alpha so that lambda is the biggest possible but its predictive measure is within 1 SE of that of lambda.min and alpha.min.

If you use R, you can probably do something like:

alphasOfInterest<-seq(0,1,by=0.1) #or something similar
#step 1: do all crossvalidations for each alpha
cvs<-lapply(alphasOfInterest, function(curAlpha){
  cv.glmnet(myX, myY, alpha=curAlpha, some more parameters)
})
#step 2: collect the optimum lambda for each alpha
optimumPerAlpha<-sapply(seq_along(alphasOfInterest), function(curi){
  curcvs<-cvs[[curi]]
  curAlpha<-alphasOfInterest[curi]
  indOfMin<-match(curcvs$lambda.min, curcvs$lambda)
  c(lam=curcvs$lambda.min, alph=curAlpha, cvup=curcvs$cvup[indOfMin])
})
#step 3: find the overall optimum
posOfOptimum<-which.min(optimumPerAlpha["lam",])
overall.lambda.min<-optimumPerAlpha["lam",posOfOptimum]
overall.alpha.min<-optimumPerAlpha["alph",posOfOptimum]
overall.criterionthreshold<-optimumPerAlpha["cvup",posOfOptimum]
#step 4: now check for each alpha which lambda is the best within the threshold
corrected1se<-sapply(seq_along(alphasOfInterest), function(curi){
  curcvs<-cvs[[curi]]
  lams<-curcvs$lambda
  lams[lams<overall.lambda.min]<-NA
  lams[curcvs$cvm > overall.criterionthreshold]<-NA
  lam1se<-max(lams, na.rm=TRUE)
  c(lam=lam1se, alph=alphasOfInterest[curi])
})
#step 5: find the best (lowest) of these lambdas
overall.lambda.1se<-max(corrected1se["lam", ])
pos<-match(overall.lambda.1se, corrected1se["lam", ])
overall.alpha.&se<-corrected1se["alph", pos]

All this code is untested + needs attention if you use auc as your criterion (because then you need to look for the maximum of the criterion and some other details change), but the ideas are there.

Note: in the last step, you could, instead of going for the highest lambda, find the one that has the most parsimonious model (because higher lambda does not guarantee more parsimony over different alphas)

You may also want to collect all lambdas up front, and pass the collection of all those to every crossvalidation, so that you can ensure that each crossvalidation uses the same set of lambdas. This is easy to do but requires some extra steps. I'm not certain whether it is necessary...

Solved – Feature Importance in each fold and repeat after repeated cross validation in caret

train doesn't save the model information within a fold. You can save the models out to the file system using a custom model:

glmn_funcs <- getModelInfo("glmnet", regex = FALSE)[[1]]
glmn_funcs$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
    theDots <- list(...)  
    if(all(names(theDots) != "family")) theDots$family <- "multinomial"   
    modelArgs <- c(list(x = as.matrix(x), y = y, alpha = param$alpha),
                        theDots)

    out <- do.call("glmnet", modelArgs) 
    if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
        save(out, file = paste("~/tmp/glmn", param$alpha, 
                          floor(runif(1, 0, 1)*100), ## to help uniqueness
                          format(Sys.time(), "%H_%M_%S.RData"),
                          sep = "_")
    out 
  }

model <- train(x = iris[,-5],
               y = iris$Species,
               method = glmn_funcs,
               type.gaussian = "naive",
               tuneGrid = grid,
               trControl = ctrl,
               preProc = c("center", "scale"))

You can use the coef function on each model to get the slopes. Note that train did not fit all possible models, which is

> length(model$control$index)*nrow(grid)
[1] 5500

(omitting the one for the final model). It fits one per unique alpha per fold:

> length(unique(grid$.alpha))*length(model$control$index)
[1] 275
> length(list.files("~/tmp", pattern = "glmn_")) ##includes the final model
[1] 276

So you will have to do some looping using something like:

> params <- coef(out, s = unique(grid$.lambda), type = "nonzeo")
    > names(params) ## a matrix per class
    [1] "setosa"     "versicolor" "virginica" 
    > lapply(params, dim)
    $setosa
[1]  5 20

$versicolor
[1]  5 20

$virginica
[1]  5 20

Lastly, you don't need to prefix a period before the parameter names using recent versions of caret.

Max

Best Answer

Related Solutions

Solved – Cross validation with two parameters: elastic net case

Solved – Feature Importance in each fold and repeat after repeated cross validation in caret

Related Question