Solved – How to interpret the results from cross validation using cv.lm function in r

cross-validationlinear modelpredictive-modelsrregression

I've generated a linear model and I now want to use it to generate predictions. However, I want to first somehow test the predictive accuracy of my model. So far I've generated the model and have run a 5-fold cross validation with the following code and sample output:

reg<-lm(logWet.weight~logAverageBL)
cv.lm(mtross, reg, m=5)

Analysis of Variance Table

Response: logWet.weight
             Df Sum Sq Mean Sq F value              Pr(>F)    
logAverageBL  1  10.42   10.42     808 <0.0000000000000002 ***
Residuals    38   0.49    0.01                                

fold 1 
Observations in test set: 8 
                   2       3    9     11      15      19     34
logAverageBL  1.6911  1.1949 1.44  1.083  1.1236  1.2682 1.4668
cvpred        1.0956 -0.3033 0.39 -0.619 -0.5042 -0.0968 0.4631
logWet.weight 1.1144 -0.3861 0.82 -0.678 -0.5993 -0.1207 0.5074
CV residual   0.0189 -0.0828 0.43 -0.059 -0.0951 -0.0239 0.0442
                   36
logAverageBL   1.4804
cvpred         0.5015
logWet.weight  0.4718
CV residual   -0.0297

Sum of squares = 0.21    Mean square = 0.03    n = 8 

fold 2 .....


....Overall (Sum over all 8 folds) 
    ms 
0.0133

There seem to be many questions here about cross validation but I don't understand what information from this output is important. If the Overall ms is similar to the ms of each k-fold, does that suggest that the predictive accuracy is good? I understand that the overall ms or the cvms is the error of my predictions but I don't know what would indicate an acceptable amount of error.

Best Answer

An acceptable amount of error depends on many things, like the total sum of squares in the model. To relate the MSE (Mean Squared Error) to something tangible, it is usually compared to the Cross-validated MSE of other versions/variations of the model.

From what I can gather, you first performed your model building, ended up with some model and cross-validated that. But instead, if your aim is to build a model for prediction, it is customary to build your model actually by cross-validating.

For example, take the full model, with all variables included and see what MSE that gives. Repeat this for every combination of variables (including the null-model, with just an intercept) and cross-validate each of those models. That should give you are list of models and their MSE. Now you have an idea of what range of MSE you can expect and which model offers the lowest error.