Model Performance – Will a Model Always Score Better on Training Dataset Than Test Dataset?

regression

I am learning LinearRegression (specifically in sklearn; Python's SciKit library) We are making models, fitting them with training datasets, then scoring them against datasets:

model = LinearRegression()
model.fit(X_train, y_train)
score_on_train = model.score(X_train, y_train)
score_on_test = model.score(X_test, y_test)

My class materials materials say:

the model should always perform better on the training set than the testing set. This because the model was trained on the training data and not on the testing data. Intuitively, the model should perform better on data that it has seen before versus data it has not seen.

But this is not true for my datasets; the model doesn't perform better on training data;

the model.score(...) on the training dataset was lower than scoring the test dataset! score_on_train < score_on_test

But I am tempted by this "Intuitively…" explanation.

Is it always true that a model will perform better on its training data than some test data ?
Why or why not ? Maybe the text I quoted is trying to describe a different phenomenon.

EDIT

So far, responses suggest the model should perform better on training data most of the time. But I tried this suggestion:
"Try different train/test splits and see if the problem persists." when I run 1000 trials of 1000 make_regression simulated data : the training data scores higher in only ~50% of cases; hardly most of the time.

Am I doing something wrong? How can I avoid "information leaking"?

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import r2_score, mean_squared_error
import math

results=[]
#~100 trials
for i in range(1,1000):

    #In each trial, generate 1000 random observations
    X, y = make_regression(n_features=1, n_samples=1000, noise = 4, random_state=i)
    y=y.reshape(-1, 1) 
    #split observations into training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=i, train_size=0.8)#42)

    #Scale... (am I doing this properly?)
    X_scaler = StandardScaler().fit(X_train)
    y_scaler = StandardScaler().fit(y_train)


    X_train_scaled = X_scaler.transform(X_train)
    X_test_scaled = X_scaler.transform(X_test)
    y_train_scaled = y_scaler.transform(y_train)
    y_test_scaled = y_scaler.transform(y_test)

    mdl = LinearRegression()

    #Train the model to the training data
    mdl.fit(X_train_scaled, y_train_scaled)

    #But score the model on the training data, *and the test data*
    results.append((
        #mdl.score does R-squared coefficient, so this code is equivalent:
        r2_score(y_train_scaled, mdl.predict(X_train_scaled)),
        r2_score(y_test_scaled, mdl.predict(X_test_scaled)),
        #             mdl.score(X_train_scaled, y_train_scaled),
        #             mdl.score(X_test_scaled, y_test_scaled)

        # https://stackoverflow.com/a/18623635/1175496
        math.sqrt(mean_squared_error(y_train_scaled, mdl.predict(X_train_scaled))),
        math.sqrt(mean_squared_error(y_test_scaled, mdl.predict(X_test_scaled)))
    ))

train_vs_test_df = pd.DataFrame(results,  columns=('r2__train', 'r2__test', 'rmse__train', 'rmse__test'))

# Count how frequently the winner is the model's score on training data set
train_vs_test_df['r2__winner_is_train'] = train_vs_test_df['r2__train'] > train_vs_test_df['r2__test']
train_vs_test_df['rmse__winner_is_train'] = train_vs_test_df['rmse__train'] > train_vs_test_df['rmse__test']
train_vs_test_df.head(10)

And when I check how many times the training data scored better:(497, 505)

(
train_vs_test_df['r2__winner_is_train'].sum(),
train_vs_test_df['rmse__winner_is_train'].sum()
)

… training data scores a higher R-squared score in only 497 cases!
And the training data scores a higher RMSE-score in only 507 cases! (meaning it's only better in 493 cases). In other words, roughly half! (This is very different than "always" / "almost always" which I am led to believe)

When I change the above parameters, (like changing what amount is used as training data vs amount used as test data… or changing the sample size… or changing the random_state… the test data performs better only about half the time?

Best Answer

If your training data is a very good representation of your sample space, then there will be little difference in performance measures between the training and test data. With enough coverage of the sample space, your test data is well-represented in the training set, and looks very much like something the model has "seen before". Numerically, your RMSE estimates on the training and test data look very close, I'd be interested to check if there's any significant difference between them. It's a coin flip whether training or test looks better by RMSE, which indicates that your training data is a very good representation of the test data.

Looking at the model you're fitting, it's not too hard to see why this is the case. You're building a regression model to predict an output using just one single input feature. Even with noise, it's very easy to find a linear model that fits well, especially when given 800 data points to train on. When you go to the test set, there's nothing there that wasn't adequately represented in the training, and the model is simple enough that overfitting isn't really an issue. For this simple case, your training and test data are reasonably equivalent, which is why it's a 50-50 chance of which one performs better.

Related Solutions

Solved – Linear model- Understanding performances on training and test sets

I'm afraid you're incorrect - a linear model can over-fit the data. You have 30 observations and 18 predictors. That is less than 2 observations per predictor!

The classic rule of thumb is one predictor for every ten observations (or for logistic regression one per 10 events).

I'm afraid the graphs you've included are confusing the issue. These are discussing a y~x (one predictor) equation where you are over-fitting the data with the generation of polynomials (you're expanding the x predictor to x^2, x^3...).

If you look at the linear equation you're estimating.... it is large and looks like the one under the over-fit model to the right, but in your case it is y ~ x + t + g + f + ..... You're overfitting in a different way. Rather than taking x and making x, x^2, x^3.... to "over-parameterize" the model you are simply using all 18 predictors.

Hope that makes some sense

Solved – feature scaling giving reduced output (linear regression using gradient descent)

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np
import pandas as pd
import math
from sklearn import preprocessing

dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)

apply_scaler = True

# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)

n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)

permutated_indices = rng.permutation(n_rows)

train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat =  dat.loc[permutated_indices[n_train:],:]

# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]

x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]

# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)

if apply_scaler:
    x_train = pd.DataFrame(scaler.transform(x_train))

# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)

# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)

for iteration in range(n_iterations):
    gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
    theta = theta - eta * gradients

# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
    x_test = pd.DataFrame(scaler.transform(x_test))

# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)

# apply fitted parameters
y_predict =x_test.dot(theta)

# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))

This leads to this output

         0           1
0  120.573  127.108268
1  127.220  123.492931
2  113.045  122.393120
3  119.606  122.570836
4  131.971  127.270743
error 6.175637

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Best Answer

Related Solutions

Solved – Linear model- Understanding performances on training and test sets

Solved – feature scaling giving reduced output (linear regression using gradient descent)

Related Question