Solved – pls: %variance explained for training, CV, and LOO identical for all components

partial least squaresrregression

My question is more theoretical, but I'll walk you through how I got there.

I fit a PLS regression model on the training set (n=22, 8 variables) and performed 10-fold and LOO CV (no external test set):

library(pls)    
train <- plsr(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=mydata, scale=T, validation="none")
tenfold <- plsr(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=mydata, scale=T, validation="CV")
loo <- plsr(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data=mydata, scale=T, validation="LOO")

Doing explvar() of the above models gives the %explained variance of each component e.g.

           Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
training     42     12     12     15    7.7    1.7    9.3   0.43
tenfold      42     12     12     15    7.7    1.7    9.3   0.43
loo          42     12     12     15    7.7    1.7    9.3   0.43

Does it make sense that %expl var is identical (even without rounding, I checked) for training, tenfold, and loo? Or is it because my dataset is so small that 10-fold and LOO are almost the same (test set = 2 and 1 for each fold resp.) and so this is expected…? But then, why the similarity with the training set?

Best Answer

You can not get a single explained variance for each component for LOO and k-fold cross-validation since one has to create a PLS model from scratch for each time a sample is left out in LOO or for each time n/k samples are left out in k-fold cross-validation. Thus, for example, if you have 22 samples and carry out LOO CV, there will 22 different PLS models and 22 different explained variance for each component.

So, It is very likely that the results you are seeing are for PLS models obtained by using all the data(complete training set). Declaring a CV parameter may only add another object, which shows CV errors, to resulting model's structure. In other words, regardless of which type of CV you choose or even enabling CV in first place does not affect the model you obtain; instead you get additional information which helps choosing number of components, for example.

Related Question