Solved – Why use a train/test split with linear regression

machine learningpythonregressionscikit learn

I am using linear regression to draw a y = mx + b line between my data, I just want to know how much of a good fit line my best linear line is. So I thought I would just use clf.score(X_train, y_train) on the points I've already used to train my algorithm. I just want to see how my line compares to the average y-line. Do I need to split my data into train and test data, and then run it. Or should I just use my train data to test, beacuse it can't deviate from the line anyways? And why?

Best Answer

If you're not trying to generalise on new data, then you don't need to.

If you are trying to generalise to new data, and if your algorithm has no hyper-parameters (i.e. settings you can tweak), then you don't need to.

If you are trying to generalise to new data, and (as is usual), you have hyper-parameters to tune, then you need to.

For example, if you were using regularised linear regression (a.k.a. "ridge" regression), then you would need to have some way of choosing the regularlisation parameter, such that it will be valid when testing on new data, rather than just fitting the "training" data perfectly.

Related Solutions

Solved – Sklearn: Should I create a MinMaxScaler for the target and one for the input

Use a single scaler, fit on the train set. It's best to pretend that you are in production, and don't actually have the test dataset. If you fit a separate scaler, you are using information you shouldn't have.

Solved – feature scaling giving reduced output (linear regression using gradient descent)

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np
import pandas as pd
import math
from sklearn import preprocessing

dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)

apply_scaler = True

# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)

n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)

permutated_indices = rng.permutation(n_rows)

train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat =  dat.loc[permutated_indices[n_train:],:]

# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]

x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]

# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)

if apply_scaler:
    x_train = pd.DataFrame(scaler.transform(x_train))

# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)

# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)

for iteration in range(n_iterations):
    gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
    theta = theta - eta * gradients

# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
    x_test = pd.DataFrame(scaler.transform(x_test))

# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)

# apply fitted parameters
y_predict =x_test.dot(theta)

# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))

This leads to this output

         0           1
0  120.573  127.108268
1  127.220  123.492931
2  113.045  122.393120
3  119.606  122.570836
4  131.971  127.270743
error 6.175637

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Best Answer

Related Solutions

Solved – Sklearn: Should I create a MinMaxScaler for the target and one for the input

Solved – feature scaling giving reduced output (linear regression using gradient descent)

Related Question