Solved – Linear model- Understanding performances on training and test sets

cross-validationlinear modelregression

I have a small normalized data set, 30 observations and 18 Predictors. All are continuous and some variable are related. I ran linear regression on it using Weka. The model automatically dropped some collinear predictors reducing the number of predictors from 18 to 7. I then plotted model's prediction for target variable versus actual target variable data (training+test set). The correlation coefficient was 0.92.

prediction vs actual target variable data

The mean squared error on test was 12% if I split the data so that 10% (3 instances) of it was used as test set. Error increases as I increase the test points which is expected as the data set is small.

QUESTIONS-

  • A quick look at the figure below, seems to convey that the model should perform well on most of the test set points too as the points don't seem that scattered and the model is linear. Why does this graph looks so 'deceptionally' good, as just changing the 'random seed' in Weka, that is using a different set of 3 training set points using the same the model, drops the correlation coefficient to 0.33 and increases the Mean Squared Error to 82%.

  • In some cases the corr was still high (+0.80) but mean error was +100%! I believe that a 'linear model' cannot possibly 'overfit' data since it is just a straight line trying to fit the data points (correct me if I am wrong). Overfitting is usually an issue when dealing with high order polynomials. It cannot underfit too as it is performing quite well on the training set for this case, as indicated by high corr. So what is it doing?

(The figures below are not from the data above. For illustration purpose only)
enter image description here

Many thanks for any help!

Best Answer

I'm afraid you're incorrect - a linear model can over-fit the data. You have 30 observations and 18 predictors. That is less than 2 observations per predictor!

The classic rule of thumb is one predictor for every ten observations (or for logistic regression one per 10 events).

I'm afraid the graphs you've included are confusing the issue. These are discussing a y~x (one predictor) equation where you are over-fitting the data with the generation of polynomials (you're expanding the x predictor to x^2, x^3...).

If you look at the linear equation you're estimating.... it is large and looks like the one under the over-fit model to the right, but in your case it is y ~ x + t + g + f + ..... You're overfitting in a different way. Rather than taking x and making x, x^2, x^3.... to "over-parameterize" the model you are simply using all 18 predictors.

Hope that makes some sense