Machine Learning – How to Create a Holdout Set Just for Feature Engineering

feature-engineeringmachine learningout-of-sample

I recently encountered a feature engineering technique that I haven't seen before:

  1. Create the usual training, validation, and test sets.
  2. Create another set by splitting the train set; call this the "feature engineering" set.
  3. Use the feature engineering set tp fit any pre-processing that needs to be fitted, e.g. target encoding or median imputation.
  4. Apply the transformations that were obtained in (3) to the main training set, then proceed with model building.

Is there any additional benefit to the extra feature engineering split in this procedure?

My guess was that it should result in more "realistic" results for fitting the main model, by avoiding overfitted "in-sample" outputs from the feature engineering stages.

Is this common practice? Has it been shown to produce better results?

The example I saw was in a target encoding tutorial, (cells 4-5). Did I misunderstand the example?

Best Answer

I haven't seen such a setup before, but it might make sense in some scenarios. Notice that if you make decisions based on exploratory data analysis, or adjust your feature engineering pipeline, you are in fact manually doing the similar kind of job machine learning algorithms would do. For example, you could manually extract some features from the textual data for a standard machine learning algorithm or use a deep learning algorithm that learns the features by itself from the "raw" data. If that is the case, even such trivial tasks as exploratory data analysis or feature engineering can "overfit" the data that you used, so you should validate them on external data. You can also check the talk by Cassie Kozyrkov from Google who makes similar remarks. So if there is a risk that decisions made during features engineering would affect the downstream tasks, it might be wise to validate them independently.

Also, keep in mind that with target encoding you are leaking information about the labels to the features, so it is a rather specific scenario where more precautions may be needed. With most of the trivial features engineering (take the logarithm of this feature, one-hot encode this, multiply those two columns, etc) I can't see how this would be useful.

Related Question