Solved – all Cross-Validation results be higher than the result on the test dataset

boostingcartcross-validation

Sorry, I'm not an expert and my question could be fundamentally wrong.
I've read this interesting question because I also was wondering whether to train the model again after cross-validation.

Now, after boosting regression trees with this library over 4/5 of the training set, I'm doing a 5-fold CV against a particular objective function.

After re-training on the whole training dataset, I use that final model to predict the binary classification of the test dataset. I'm a bit surprised to see that the evaluation metric for the test set is significantly worse than expected from the validation results.

Do you have any suggestions about what could be going wrong? My second (maybe related) question is why does increasing the rounds of boosting result in worse predictions?

Best Answer

My second (maybe related) question is why does increasing the rounds of boosting result in worse predictions?

Yes this is most probably related: boosting re-weights models based on their performance, so any kind of performance measurement that is done during the weighting process is part of the model training, and not independent of the training data.

That is, I assume the cross validation you refer to is part of the boosting - if you are talking of cross validation of 5 completely trained boosted models we'll have to dig deeper for the reason.


In general, if cross validation is consistently and significantly (do you have enough test cases to actually distinguish cross validation and test results in a statistically sound fashion?) too optimistic, this is a sign that there is some problem in the cross validation procedure. Maybe the most typical problem is a data leak between testing and training cases.

Such a leak can e.g. happen if the data is clustered/hierarchical. I.e., something like (almost) repeated measurements (or any other confounding factor that links some cases more closely together than other cases - for my data usually many measurements of one patient or measurements of solutions from the same stock solution or measurements taken at the same day, ...), and the test data are new clusters.

One way of dealing with that is to make sure the splitting for the cross validation happens at the highest level of this data hierarchy. Many off-the-shelf classifiers do not offer this. In that case, it may be better to stay away from aggressively optimizing methods (such as boosting), as they tend to overfit badly. A symptom of that would be that the internal performance estimate of the boosting algorithm is much overoptimistic compared already with the cross validation.
The other option would be to implement random forest / boosting from cart + a resampling procedure that obeys the data structure.


Update 2: what to do with clustered data?

So far, I have worked with data where knowledge about the data generating process and the application allows to identify possible important causes of clustered data structure (e.g. patient data may be subject to between-patient-variance) - you can deal with this type of clustering as described above.

In addition, you may try to do a cluster analysis to see whether there are groups within the data. Such a finding may influence the choice of model. However, unless you can trace the clusters to some "cause" (e.g. data turns out to be grouped by day, individual, or the like, I'm not sure how to deal with that in the validation: splitting by groups found within the data is everything but independent. Yet it may be worth while to check how the predictive ability deteriorates for (ehich) out-of-training groups of the data.