Solved – How to validate the multiple linear regression model

cross-validationmultiple regressionregressionvalidation

I have split my data into two parts. I have used my 80% data to build a multiple regression linear model. Now I want to test it using my rest 20% data. What tools on Minitab do I have to make this checking process?

Edit: I think I can use PRESS statistic. My PRESS for 80% data was 6000 and now based on this model I have calculated the PRESS for 20% data, it is 1000. Now, should I compare 6000 with 1000 or 6000 with 5 times 1000, ie 5000?

Best Answer

Note that the predicted residual sum of squares, PRESS, is got by jack-knifing the sample: there's no sense in calculating it for training & test sets. Calculate it for a model fitted to the whole sample (& compare it to the RSS to assess the amount of over-fitting). For ordinary least-squares regression there's an analytic solution:

$$\sum_i \left(\frac{e_i}{1-h_{ii}}\right)^2$$

where $e_i$ is the $i$th residual & $h_{ii}$ its leverage—from the diagonal of the hat matrix

$$H=X(X^\mathrm{T}X)^{-1}X^\mathrm{T}$$

(where $X$ is the design matrix).

In general cross-validation & bootstrap validation are preferable to splitting a sample into training & test sets: you don't lose precision in the estimates as when fitting on a smaller training set, & the performance measure on the test set will be less variable. How preferable depends on sample size.

Related Question