Solved – pls: %variance explained for training, CV, and LOO identical for all components

partial least squaresrregression

My question is more theoretical, but I'll walk you through how I got there.

I fit a PLS regression model on the training set (n=22, 8 variables) and performed 10-fold and LOO CV (no external test set):

library(pls)    
train <- plsr(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=mydata, scale=T, validation="none")
tenfold <- plsr(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=mydata, scale=T, validation="CV")
loo <- plsr(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=mydata, scale=T, validation="LOO")

Doing explvar() of the above models gives the %explained variance of each component e.g.

           Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
training     42     12     12     15    7.7    1.7    9.3   0.43
tenfold      42     12     12     15    7.7    1.7    9.3   0.43
loo          42     12     12     15    7.7    1.7    9.3   0.43

Does it make sense that %expl var is identical (even without rounding, I checked) for training, tenfold, and loo? Or is it because my dataset is so small that 10-fold and LOO are almost the same (test set = 2 and 1 for each fold resp.) and so this is expected…? But then, why the similarity with the training set?

Best Answer

You can not get a single explained variance for each component for LOO and k-fold cross-validation since one has to create a PLS model from scratch for each time a sample is left out in LOO or for each time n/k samples are left out in k-fold cross-validation. Thus, for example, if you have 22 samples and carry out LOO CV, there will 22 different PLS models and 22 different explained variance for each component.

So, It is very likely that the results you are seeing are for PLS models obtained by using all the data(complete training set). Declaring a CV parameter may only add another object, which shows CV errors, to resulting model's structure. In other words, regardless of which type of CV you choose or even enabling CV in first place does not affect the model you obtain; instead you get additional information which helps choosing number of components, for example.

Best Answer

Related Solutions

Solved – How to do cross-validation with cv.glmnet (LASSO regression in R)

Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?

Do I have to do so manually, or is perhaps the caret function useful for glmnet models?

Do I use two concentric "loops" of cross validation?... Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?

If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?

Solved – RMSEP values for choosing number of components in PLSr

Related Question