Solved – How to interpret the results from cross validation using cv.lm function in r

cross-validationlinear modelpredictive-modelsrregression

I've generated a linear model and I now want to use it to generate predictions. However, I want to first somehow test the predictive accuracy of my model. So far I've generated the model and have run a 5-fold cross validation with the following code and sample output:

reg<-lm(logWet.weight~logAverageBL)
cv.lm(mtross, reg, m=5)

Analysis of Variance Table

Response: logWet.weight
             Df Sum Sq Mean Sq F value              Pr(>F)    
logAverageBL  1  10.42   10.42     808 <0.0000000000000002 ***
Residuals    38   0.49    0.01                                

fold 1 
Observations in test set: 8 
                   2       3    9     11      15      19     34
logAverageBL  1.6911  1.1949 1.44  1.083  1.1236  1.2682 1.4668
cvpred        1.0956 -0.3033 0.39 -0.619 -0.5042 -0.0968 0.4631
logWet.weight 1.1144 -0.3861 0.82 -0.678 -0.5993 -0.1207 0.5074
CV residual   0.0189 -0.0828 0.43 -0.059 -0.0951 -0.0239 0.0442
                   36
logAverageBL   1.4804
cvpred         0.5015
logWet.weight  0.4718
CV residual   -0.0297

Sum of squares = 0.21    Mean square = 0.03    n = 8 

fold 2 .....


....Overall (Sum over all 8 folds) 
    ms 
0.0133

There seem to be many questions here about cross validation but I don't understand what information from this output is important. If the Overall ms is similar to the ms of each k-fold, does that suggest that the predictive accuracy is good? I understand that the overall ms or the cvms is the error of my predictions but I don't know what would indicate an acceptable amount of error.

Best Answer

An acceptable amount of error depends on many things, like the total sum of squares in the model. To relate the MSE (Mean Squared Error) to something tangible, it is usually compared to the Cross-validated MSE of other versions/variations of the model.

From what I can gather, you first performed your model building, ended up with some model and cross-validated that. But instead, if your aim is to build a model for prediction, it is customary to build your model actually by cross-validating.

For example, take the full model, with all variables included and see what MSE that gives. Repeat this for every combination of variables (including the null-model, with just an intercept) and cross-validate each of those models. That should give you are list of models and their MSE. Now you have an idea of what range of MSE you can expect and which model offers the lowest error.

Related Solutions

Solved – How to extract residuals from function cv.lm in R

Looking at the R code, computation for individual fold are done in the inner loop, starting with

for (i in sort(unique(rand))) { # line 37

but results are just returned with a print statement (line 67-68), if printit=TRUE (which is the default). So, you can use what I suggested for a related question and edit the function in place so that it returns the SS for each fold in a list. That is, use

fix(cv.lm)

at the R prompt, then add the following three lines in the code

...
sumss <- 0
sumdf <- 0
ssl <- list()            # (*)
...
    ms <- ss/num
    ssl[[i]] <- ss       # (*)
    if (printit)
        cat("\nSum of squares =", round(ss, 2), "   Mean square =",
...
invisible(c(ss = sumss, df = sumdf, 
          ss.fold=ssl))  # (*)
}

To check that it worked, try

> res <- cv.lm(printit=FALSE, plotit=FALSE)
> str(res)
List of 5
  ss      : num 59008
  df      : num 15
  ss.fold1: num 24351
  ss.fold2: num 20416
  ss.fold3: num 14241

You can also returned a list of the fold SS by replacing ss.fold=ssl with ss.fold=list(ssl), so that the output would look like

List of 3
  ss     : num 59008
  df     : num 15
  ss.fold:List of 3
  ..$ : num 24351
  ..$ : num 20416
  ..$ : num 14241

Solved – How to compute F-measure and accuracy for repeated cross-validation

F-measure: Should I sum the quantities (i.e., TP, FP, FN) over the N x K runs and compute F-measure using these sums?

Yes! Calculate one f1 for each run of cross-validation and average over the N runs. This is also a great opportunity to see the difference between this approach and calculating f1 for each fold and averaging over different folds differ from each other.

For accuracy, should we sum the accuracy for each of the N x K runs and simply take their average for the overall estimate?

Also Yes! Good approach. In applications, it is sometimes not about being 100% correct but applying methods and techniques according to their ease of use.

However, whenever you are reporting the mean, please also report the variance or the standard deviation.

Best Answer

Related Solutions

Solved – How to extract residuals from function cv.lm in R

Solved – How to compute F-measure and accuracy for repeated cross-validation

Related Question