Solved – Sklearn: Should I create a MinMaxScaler for the target and one for the input

data transformationmachine learningnormalizationscalesscikit learn

Suppose I have some input data X with shape [n,m].
I also have some target data y with shape [s,p].

I want to train a model with some train data and then compare the results on the test data, as usual. I came across the idea of normalizing the data. However, I don't understand (although the documentation is pretty good here) whether I should use one min max scaler for the input and one for the ouput or just one? Also, should I apply it on the training set only right? Not training + test?

SCALER FOR X_TRAIN, SCALER FOR Y_TRAIN

# divide into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, shuffle = False, tests_size = 0.33)
# create scalers
scalerX = MinMaxScaler(feature_range = (0,1))
scalery = MinMaxScaler(feature_range = (0,1))
# fit and transform
X_train = scalerX.fit_transform(X_train)
y_y_train = scalery.fit_transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
# then once I will have the predictions
y_pred = scalery.inverse_transform(y_pred)

SCALER FOR X , SCALER FOR Y

scalerX = MinMaxScaler(feature_range = (0,1))
scaley = MinMaxScaler(feature_range = (0,1))
X = scalerX.fit_transform(X)
y = scalery.fit_transform(y)
# training and testing
X_train, X_test, y_train, y_test = train_test_split(X,y, shuffle = False, test_size = 0.33)
# then once I will have the predictions
y_pred = scalery.inverse_transform(y_pred)

ONE SCALER FOR XY MATRIX

Xy = np.hstack((X, y))
scaler = MinMaxScaler(feature_range = (0,1))
Xy = scaler.fit_transform(Xy)
# then separate training and test
X_train, X_test, y_train, y_test = train_test_split(Xy,shuffle = False, test_size = 0.33)
# once I will have the predictions
y_pred = scaler.transform(y_pred)

Which one of these three options is the correct one? Or are they all wrong?

Best Answer

Use a single scaler, fit on the train set. It's best to pretend that you are in production, and don't actually have the test dataset. If you fit a separate scaler, you are using information you shouldn't have.

Related Question