Solved – Why use a train/test split with linear regression

machine learningpythonregressionscikit learn

I am using linear regression to draw a y = mx + b line between my data, I just want to know how much of a good fit line my best linear line is. So I thought I would just use clf.score(X_train, y_train) on the points I've already used to train my algorithm. I just want to see how my line compares to the average y-line. Do I need to split my data into train and test data, and then run it. Or should I just use my train data to test, beacuse it can't deviate from the line anyways? And why?

Best Answer

If you're not trying to generalise on new data, then you don't need to.

If you are trying to generalise to new data, and if your algorithm has no hyper-parameters (i.e. settings you can tweak), then you don't need to.

If you are trying to generalise to new data, and (as is usual), you have hyper-parameters to tune, then you need to.

For example, if you were using regularised linear regression (a.k.a. "ridge" regression), then you would need to have some way of choosing the regularlisation parameter, such that it will be valid when testing on new data, rather than just fitting the "training" data perfectly.