Feature Engineering – How to One Hot Encode Before or After Train/Test Split in R and Python

data-leakagefeature-engineeringmachine learningpythonr

I've seen quite a lot of conflicting views on if one-hot encoding (dummy variable creation) should be done before/after the training/test split.

Responses seem to state that one-hot encoding before leads to "data leakage".

This example states it's industry norm to do one-hot encoding on the entire data before training/test split:

Industry Example

This example from kaggle states that it should be done after the training/test split to avoid data leakage:

kaggle response – after split

My question is the following;

  1. Do we perform one-hot encoding before or after the Train/Test Split?
  2. Where is the data leakage occuring in the following example?

If we take the following example, we have two columns – web_views and website (non-ordinal categorical feature) (assuming we are one-hot encoding across the entire column, not dropping any dummies)

Our dataframe:

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

df = pd.DataFrame({'web_views': [100,200,300,400], 
                  'website': ['Youtube','Facebook','Instagram', 'Google']})

Scenario 1: One-Hot Encoding/Dummy Variables before splitting into Train/Test:

np.random.seed(123)


df_before_split = pd.concat([df.drop('website', axis = 1), pd.get_dummies(df['website'])], axis=1) 

# create your X and y dataframes
X_before_split = df_before_split.drop('web_views', axis = 1)
y_before_split = df_before_split['web_views']

# perform train test split
X_train_before_split, X_test_before_split, y_train_before_split, y_test_before_split = train_test_split(X_before_split, y_before_split, test_size = 0.20)

Now viewing the dataframes we have:

# view X train dataset (this is encoding before split)
X_train_before_split

and then for test

# View X test dataset dataset (this is encoding before split)
X_test_before_split

Scenario 2: One-Hot Encoding/Dummy Variables AFTER splitting into Train/Test:

# Perform One Hot encoding after the train/test split instead

X = df.drop('web_views', axis = 1)
y = df['web_views']

# perform data split: 
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

# perform one hot encoding on the train and test dataset datasets: 
X_train = pd.concat([X_train.drop('website', axis = 1), pd.get_dummies(X_train['website'])], axis=1)
X_test = pd.concat([X_test.drop('website', axis = 1), pd.get_dummies(X_test['website'])], axis=1)

Viewing the X_train and X_test dataframes:

# encode after train/test split - train dataframe
X_train

# encode after train/test split - test dataframe
X_test

Performing Linear Regression Modelling

Now that we have split our data to demonstrate we will create a simple linear model:

from sklearn.linear_model import LinearRegression

Before split linear model

regressor_before_split = LinearRegression()  
regressor_before_split.fit(X_train_before_split, y_train_before_split)

y_pred_before_split = regressor_before_split.predict(X_test_before_split)
y_pred_before_split

y_pred_before_split returns a predicting value what we would expect.

After split linear model

regressor_after_split = LinearRegression()  
regressor_after_split.fit(X_train, y_train)

y_pred_after_split = regressor_after_split.predict(X_test)
y_pred_after_split

Error message from Scenario 2:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-92-c63978a198c8> in <module>()
      2 regressor_after_split.fit(X_train, y_train)
      3 
----> 4 y_pred_after_split = regressor_after_split.predict(X_test)
      5 y_pred_after_split

C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    254             Returns predicted values.
    255         """
--> 256         return self._decision_function(X)
    257 
    258     _preprocess_data = staticmethod(_preprocess_data)

C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
    239         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    240         return safe_sparse_dot(X, self.coef_.T,
--> 241                                dense_output=True) + self.intercept_
    242 
    243     def predict(self, X):

C:\Anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (1,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)

My thoughts:

  1. Encoding with dummies before splitting ensures that the test data that we pass in e.g. X_test to perform the predicitions has the same shape as the training data that the model was trained on therefore understands how to predict values when it encounters these features – unlike with encoding after splitting, since the X_test data has only one feature we are using to make predicitions with whereas the X_train has 3 features
  2. Maybe I've introduced data leakage?

I'd be happy for someone to correct me if i've got things wrong or misinterpreted anything, but i'm stuck scratching me head if you encode before or after splitting!

Update as of 04.01.2023

You can perform categorical encoding of the feature column after you split the data into training and test sets. This may help for testing purposes because test data is supposed to be "unseen", at which point if a category is unseen you can iterate the model and add it as a new category to that particular feature etc.

In order to to do this in Python using sci-kit learn there is a parameter from the OneHotEncoder() – titled handle_unknown – when set to ignore it will do the following:

ignore’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

The default is error which is why an error message would be returned when it encounters a new categorical feature during fitting.

More information can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Also to add – when you do want to perform one hot encoding on the training/test split of the data remember to:

  1. Apply both fit and transform (or fit_transform) to the training data
  2. Apply only transform to the test data – this is due to it having been fitted on the training data already!

Best Answer

I can see no reason that one-hot encoding (dummy encoding) could lead to data leakage. This is really a simple data transformation, fully determined by the predictor variables, and in no need of estimation or training. It is comparable to calculating the logarithm of a positive variable, no need of any training. There is no choices involved, no use of the $Y$ variable. So doing this before or after train/test split should give exactly the same result.

But if the transformations are more involved, maybe target encoding Strange encoding for categorical features which uses the response variable $Y$. Then doing this before split would use the response variable also from the test set, so data leakage.

So maybe what to look at is if your encoding uses the response variable, such encodings might be more used in machine learning than in traditional statistics, explaining this strange advice you refer to.

Related Question