Model Performance – Will a Model Always Score Better on Training Dataset Than Test Dataset?

regression

I am learning LinearRegression (specifically in sklearn; Python's SciKit library) We are making models, fitting them with training datasets, then scoring them against datasets:

model = LinearRegression()
model.fit(X_train, y_train)
score_on_train = model.score(X_train, y_train)
score_on_test = model.score(X_test, y_test)

My class materials materials say:

the model should always perform better on the training set than the testing set. This because the model was trained on the training data and not on the testing data. Intuitively, the model should perform better on data that it has seen before versus data it has not seen.

But this is not true for my datasets; the model doesn't perform better on training data;

the model.score(...) on the training dataset was lower than scoring the test dataset! score_on_train < score_on_test

But I am tempted by this "Intuitively…" explanation.

Is it always true that a model will perform better on its training data than some test data ?
Why or why not ? Maybe the text I quoted is trying to describe a different phenomenon.

EDIT

So far, responses suggest the model should perform better on training data most of the time. But I tried this suggestion:
"Try different train/test splits and see if the problem persists." when I run 1000 trials of 1000 make_regression simulated data : the training data scores higher in only ~50% of cases; hardly most of the time.

Am I doing something wrong? How can I avoid "information leaking"?

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import r2_score, mean_squared_error
import math

results=[]
#~100 trials
for i in range(1,1000):

    #In each trial, generate 1000 random observations
    X, y = make_regression(n_features=1, n_samples=1000, noise = 4, random_state=i)
    y=y.reshape(-1, 1) 
    #split observations into training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=i, train_size=0.8)#42)

    #Scale... (am I doing this properly?)
    X_scaler = StandardScaler().fit(X_train)
    y_scaler = StandardScaler().fit(y_train)


    X_train_scaled = X_scaler.transform(X_train)
    X_test_scaled = X_scaler.transform(X_test)
    y_train_scaled = y_scaler.transform(y_train)
    y_test_scaled = y_scaler.transform(y_test)

    mdl = LinearRegression()

    #Train the model to the training data
    mdl.fit(X_train_scaled, y_train_scaled)

    #But score the model on the training data, *and the test data*
    results.append((
        #mdl.score does R-squared coefficient, so this code is equivalent:
        r2_score(y_train_scaled, mdl.predict(X_train_scaled)),
        r2_score(y_test_scaled, mdl.predict(X_test_scaled)),
        #             mdl.score(X_train_scaled, y_train_scaled),
        #             mdl.score(X_test_scaled, y_test_scaled)

        # https://stackoverflow.com/a/18623635/1175496
        math.sqrt(mean_squared_error(y_train_scaled, mdl.predict(X_train_scaled))),
        math.sqrt(mean_squared_error(y_test_scaled, mdl.predict(X_test_scaled)))
    ))

train_vs_test_df = pd.DataFrame(results,  columns=('r2__train', 'r2__test', 'rmse__train', 'rmse__test'))

# Count how frequently the winner is the model's score on training data set
train_vs_test_df['r2__winner_is_train'] = train_vs_test_df['r2__train'] > train_vs_test_df['r2__test']
train_vs_test_df['rmse__winner_is_train'] = train_vs_test_df['rmse__train'] > train_vs_test_df['rmse__test']
train_vs_test_df.head(10)

the first 10 trials testing the model score of train vs model score of test show model score of train are only winners ~ half the time

And when I check how many times the training data scored better:(497, 505)

(
train_vs_test_df['r2__winner_is_train'].sum(),
train_vs_test_df['rmse__winner_is_train'].sum()
)

… training data scores a higher R-squared score in only 497 cases!
And the training data scores a higher RMSE-score in only 507 cases! (meaning it's only better in 493 cases). In other words, roughly half! (This is very different than "always" / "almost always" which I am led to believe)

When I change the above parameters, (like changing what amount is used as training data vs amount used as test data… or changing the sample size… or changing the random_state… the test data performs better only about half the time?

Best Answer

If your training data is a very good representation of your sample space, then there will be little difference in performance measures between the training and test data. With enough coverage of the sample space, your test data is well-represented in the training set, and looks very much like something the model has "seen before". Numerically, your RMSE estimates on the training and test data look very close, I'd be interested to check if there's any significant difference between them. It's a coin flip whether training or test looks better by RMSE, which indicates that your training data is a very good representation of the test data.

Looking at the model you're fitting, it's not too hard to see why this is the case. You're building a regression model to predict an output using just one single input feature. Even with noise, it's very easy to find a linear model that fits well, especially when given 800 data points to train on. When you go to the test set, there's nothing there that wasn't adequately represented in the training, and the model is simple enough that overfitting isn't really an issue. For this simple case, your training and test data are reasonably equivalent, which is why it's a 50-50 chance of which one performs better.

Related Question