Solved – How to evaluate results of linear regression

linear modelregression

I have a linear regression problem. In short, I have a dataset, I divided it into two subsets. One subset is used to find the linear regression (training subset), another is used to evaluate it (evaluation subset). My question is how to evaluate the result of this linear regression after applying it to the evaluation subset of data?

Here are the details:

In the training subset, I do linear regression: $y = ax + b$, where $y$ is groundtruth (also known as target), $x$ is an independent variable. Then I found $a$ and $b$. ($x$ and $y$ are given in the training subset).

Now, using $a$ and $b$ found above from the training subset, apply them to the evaluation subset, I found $y' = ax' + b$. In other words, these $y'$ are found from linear regression with $x'$. Now, in addition to $y'$, I also have $y$ from the evaluation set. How do I evaluate my result (how much $y'$ differ from $y$)? Any general mathematical model to do that? It needs to be some sort of math model/formula. I can think of different ways to do it, but they are all kinda ad-hoc or simple, but this is for a scientific work, so things that sound ad-hoc cannot be used here, unfortunately.

Any idea?

Best Answer

I'd agree with @Octern that one rarely sees people using train/test splits (or even things like cross-validation) for linear models. Overfitting is (almost) certainly not an issue with a very simple model like this one.

If you wanted to get a sense for your model's "quality", you may want to report confidence intervals (or their Bayesian equivalents) around your regression coefficients. There are several ways to do this. If you know/can assume that your errors are normally distributed, there's a simple formula (and most popular data analysis packages will give you these values). Another popular alternative is to compute them through resampling (e.g., bootstrapping or jackknifing), which makes fewer assumptions about the distribution of errors. In either case, I'd use the complete data set for the computation.