Solved – feature scaling giving reduced output (linear regression using gradient descent)

gradient descentmachine learningpythonregressionscikit learn

I am implementing linear regression using gradient descent algorithm in python. The closed form solution as well as gradient descent (without feature scaling) was giving satisfactory results. However, the moment i started using feature scaling (StandardScaler class in sklearn's preprocessing module), things have started to look a bit confusing.

I am following "Hands on Machine learning with scikit-learn & tensorflow" by Arelien Geron as well as tutorial available on http://scikit-learn.org/stable/modules/preprocessing.html

In the above references, it is clearly given that

  1. feature scaling is done when some of the features in the dataset are having large values compared to others
  2. and that feature scaling is done on the training data (x_train) and the same scaler is applied to testing data (x_test) as well so that test data is scaled the same way as training data
  3. No where was it mentioned that outputs need to be scaled as well. That is why, I have left y_train and y_test as unchanged

Now, taking care of the above facts, when I build a linear regressor model, the predicted values (y_predict) are far less compared to true values (y_test). In the first place, it looks like y_predict has been reduced by some factor. Am I missing out on something ?

The dataset that I am using is available at http://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat

The first few lines of this dataset are as follows: –

    800 0   0.3048  71.3    0.00266337  126.201
   1000 0   0.3048  71.3    0.00266337  125.201
   1250 0   0.3048  71.3    0.00266337  125.951
   1600 0   0.3048  71.3    0.00266337  127.591
   2000 0   0.3048  71.3    0.00266337  127.461
   2500 0   0.3048  71.3    0.00266337  125.571

where the last column is output value. Clearly, the features have big difference in the values that they take(first feature is taking values in 1000s while the third feature is having values around 0.3 only). So, in my understanding, feature scaling is applicable here

now, when I build the model and compare y_test with y_predict, significant differences are obtained. First few compares as follows

y_test       y_predicted

    123.965  1.730859
    124.835  2.659574
    125.625  0.581208
    123.807  0.218661
    127.127  3.279522
    122.724 -3.943073
    126.160  4.236322

As can be seen, y_predicted is significantly smaller than y_test.

I am sharing my code snippets just in case it could help you:

scaler=preprocessing.StandardScaler().fit(x_trainn)
x_train=scaler.transform(x_trainn) #x_trainn is unscaled version of training data
x_test=scaler.transform(x_testt) #x_testt is unscaled test data

 for iteration in range(n_iterations):
        gradients=(2/m)*x_train.T.dot(x_train.dot(theta)-y_train)
        theta=theta-eta*gradients


y_predict=x_test.dot(theta)

out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out))

Best Answer

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np
import pandas as pd
import math
from sklearn import preprocessing

dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)

apply_scaler = True

# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)

n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)

permutated_indices = rng.permutation(n_rows)

train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat =  dat.loc[permutated_indices[n_train:],:]

# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]

x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]

# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)

if apply_scaler:
    x_train = pd.DataFrame(scaler.transform(x_train))

# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)

# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)

for iteration in range(n_iterations):
    gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
    theta = theta - eta * gradients

# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
    x_test = pd.DataFrame(scaler.transform(x_test))

# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)

# apply fitted parameters
y_predict =x_test.dot(theta)

# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))

This leads to this output

         0           1
0  120.573  127.108268
1  127.220  123.492931
2  113.045  122.393120
3  119.606  122.570836
4  131.971  127.270743
error 6.175637

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Related Question