Solved – When is simple train/test split better than cross-validation or train/validate/test

cross-validationmodel-evaluation

So, for the purpose of my master thesis I'm trying to predict pfofitabilty on times series data using Elastic net and XGBoost. I split the data 80/20 (50k instances, 3k+ features). I do not use cross validation (tried it, system crashed everytime) or a validation set. I train and tune the model on the training set and evaluate the performance on the test set.

As i have quite a lot of data I was wondering would my simple technique be appropriate or should I try to have an extra validation set for hyper parameter tuning? I would really appreciate any helpful answers or pointers to relevent papers.

Best Answer

It is generally a better practise to use cross-validation (e.g. 10-fold CV) that just a random split to your data. It would be even better if you could use CV and then test your model's performance on a completely independent validation test.You have enough instances to do the later.

Hope this helps.

Related Solutions

Solved – Train / Validate / Test sets in Caret

Give this a try (modify the details as needed)

library(caret)

library(mlbench)
data(Sonar)

set.seed(1)
splits <- createFolds(Sonar$Class, returnTrain = TRUE)
    results <- lapply(splits, 
                      function(x, dat) {
                        holdout <- (1:nrow(dat))[-unique(x)]
                        data.frame(index = holdout, 
                                   obs = dat$Class[holdout])
                  },
                  dat = Sonar)
mods <- vector(mode = "list", length = length(splits))

## foreach or lapply would do this faster
for(i in seq(along = splits)) {
  in_train <- unique(splits[[i]])
  set.seed(2)
  mod <- train(Class ~ ., data = Sonar[in_train, ],
               method = "svmRadial",
               preProc = c("center", "scale"),
               tuneLength = 8)
  results[[i]]$pred <- predict(mod, Sonar[-in_train, ])
  mods[[i]] <- mod
}

lapply(results, defaultSummary)

Solved – Pipeline and data snooping in scikit-learn

First, just a note that ElasticNet's normalize=True actually isn't quite the same as Normalizer: it first centers the data (subtracting the mean of the training set), then scales each of the centered data points to unit norm.

If you do a pipeline of Normalizer followed by ElasticNet(fit_intercept=True), it will actually normalize the data points to unit norm in the original space, then center the normalized data (which is a little weird).

Since ElasticNet always centers its inputs when you have fit_intercept=True, if you do StandardScaler(with_std=False) (which just centers), Normalizer, and then ElasticNet(fit_intercept=True) you'll actually center, normalize, and then re-center – you end up with slightly different data inside the model, though the overall model should be the same.

If you were only normalizing (replacing each data point $X_i$ with $X_i / \lVert X_i \rVert$), the transformation is independent of the other data, so the CV folds don't matter. Centering, though, is not data-independent.

So, you're correct that centering before ElasticNetCV will center the data based on the whole dataset, and thus technically the elastic net's CV is "cheating." To be totally correct, you should use normalize=True on the ElasticNetCV; if you want to do some other kind of preprocessing, you won't be able to (as far as I know) use ElasticNetCV properly at all. Honestly, the whole CV machinery in scikit-learn is not a great fit for cases that are at all complicated, and I often find myself rolling my own CV loops to handle these issues – but it's hard to do that while still taking advantage of the efficiency gains in ElasticNetCV.

In practice, as long as your dataset isn't tiny, I wouldn't really worry about the difference. Centering tends to be very stable across CV folds, and it's unlikely that your linear model's performance is going to be sensitive to the very small differences in scaling between full-dataset standardization and 9/10ths of the dataset's. The only parameter being estimated is $\hat \mu$; with a $k$-fold CV on $n$ data points, your data snooping changes the estimate from $$\hat \mu_\text{train} = \frac{k}{n (k-1)} \sum_{i \notin \text{ fold } k} X_i$$ to \begin{align} \hat \mu_\text{all} &= \frac{1}{n} \sum_{i} X_i \\&= \frac{1}{n} \sum_{i \notin \text{ fold } k} X_i + \frac{1}{n} \sum_{i \in \text{ fold } k} X_i \\&= \frac{k-1}{k} \hat\mu_\text{train} + \frac{1}{k} \hat\mu_\text{validation} .\end{align} Since $\hat\mu_\text{train}$ and $\hat\mu_\text{validation}$ are going to be extremely similar anyway unless you have a small sample size compared to your dimension, $\hat\mu_\text{all}$ is going to be very close to $\hat\mu_\text{train}$, and the difference is not going to be something that your model is likely to be able to exploit anyway.

Best Answer

Related Solutions

Solved – Train / Validate / Test sets in Caret

Solved – Pipeline and data snooping in scikit-learn

Related Question