Use a single scaler, fit on the train set. It's best to pretend that you are in production, and don't actually have the test dataset. If you fit a separate scaler, you are using information you shouldn't have.
My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works
import numpy as np
import pandas as pd
import math
from sklearn import preprocessing
dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)
apply_scaler = True
# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)
n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)
permutated_indices = rng.permutation(n_rows)
train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat = dat.loc[permutated_indices[n_train:],:]
# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]
x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]
# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)
if apply_scaler:
x_train = pd.DataFrame(scaler.transform(x_train))
# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)
# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)
for iteration in range(n_iterations):
gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
theta = theta - eta * gradients
# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
x_test = pd.DataFrame(scaler.transform(x_test))
# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)
# apply fitted parameters
y_predict =x_test.dot(theta)
# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))
This leads to this output
0 1
0 120.573 127.108268
1 127.220 123.492931
2 113.045 122.393120
3 119.606 122.570836
4 131.971 127.270743
error 6.175637
which is fine.
It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.
This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.
Best Answer
If you're not trying to generalise on new data, then you don't need to.
If you are trying to generalise to new data, and if your algorithm has no hyper-parameters (i.e. settings you can tweak), then you don't need to.
If you are trying to generalise to new data, and (as is usual), you have hyper-parameters to tune, then you need to.
For example, if you were using regularised linear regression (a.k.a. "ridge" regression), then you would need to have some way of choosing the regularlisation parameter, such that it will be valid when testing on new data, rather than just fitting the "training" data perfectly.