Give this a try (modify the details as needed)
library(caret)
library(mlbench)
data(Sonar)
set.seed(1)
splits <- createFolds(Sonar$Class, returnTrain = TRUE)
results <- lapply(splits,
function(x, dat) {
holdout <- (1:nrow(dat))[-unique(x)]
data.frame(index = holdout,
obs = dat$Class[holdout])
},
dat = Sonar)
mods <- vector(mode = "list", length = length(splits))
## foreach or lapply would do this faster
for(i in seq(along = splits)) {
in_train <- unique(splits[[i]])
set.seed(2)
mod <- train(Class ~ ., data = Sonar[in_train, ],
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 8)
results[[i]]$pred <- predict(mod, Sonar[-in_train, ])
mods[[i]] <- mod
}
lapply(results, defaultSummary)
First, just a note that ElasticNet's normalize=True
actually isn't quite the same as Normalizer
: it first centers the data (subtracting the mean of the training set), then scales each of the centered data points to unit norm.
If you do a pipeline of Normalizer
followed by ElasticNet(fit_intercept=True)
, it will actually normalize the data points to unit norm in the original space, then center the normalized data (which is a little weird).
Since ElasticNet
always centers its inputs when you have fit_intercept=True
, if you do StandardScaler(with_std=False)
(which just centers), Normalizer
, and then ElasticNet(fit_intercept=True)
you'll actually center, normalize, and then re-center – you end up with slightly different data inside the model, though the overall model should be the same.
If you were only normalizing (replacing each data point $X_i$ with $X_i / \lVert X_i \rVert$), the transformation is independent of the other data, so the CV folds don't matter. Centering, though, is not data-independent.
So, you're correct that centering before ElasticNetCV
will center the data based on the whole dataset, and thus technically the elastic net's CV is "cheating." To be totally correct, you should use normalize=True
on the ElasticNetCV
; if you want to do some other kind of preprocessing, you won't be able to (as far as I know) use ElasticNetCV
properly at all. Honestly, the whole CV machinery in scikit-learn is not a great fit for cases that are at all complicated, and I often find myself rolling my own CV loops to handle these issues – but it's hard to do that while still taking advantage of the efficiency gains in ElasticNetCV
.
In practice, as long as your dataset isn't tiny, I wouldn't really worry about the difference. Centering tends to be very stable across CV folds, and it's unlikely that your linear model's performance is going to be sensitive to the very small differences in scaling between full-dataset standardization and 9/10ths of the dataset's. The only parameter being estimated is $\hat \mu$; with a $k$-fold CV on $n$ data points, your data snooping changes the estimate from
$$\hat \mu_\text{train} = \frac{k}{n (k-1)} \sum_{i \notin \text{ fold } k} X_i$$ to
\begin{align}
\hat \mu_\text{all}
&= \frac{1}{n} \sum_{i} X_i
\\&= \frac{1}{n} \sum_{i \notin \text{ fold } k} X_i
+ \frac{1}{n} \sum_{i \in \text{ fold } k} X_i
\\&= \frac{k-1}{k} \hat\mu_\text{train}
+ \frac{1}{k} \hat\mu_\text{validation}
.\end{align}
Since $\hat\mu_\text{train}$ and $\hat\mu_\text{validation}$ are going to be extremely similar anyway unless you have a small sample size compared to your dimension, $\hat\mu_\text{all}$ is going to be very close to $\hat\mu_\text{train}$, and the difference is not going to be something that your model is likely to be able to exploit anyway.
Best Answer
It is generally a better practise to use cross-validation (e.g. 10-fold CV) that just a random split to your data. It would be even better if you could use CV and then test your model's performance on a completely independent validation test.You have enough instances to do the later.
Hope this helps.