Solved – Choosing number of components in PLS – without minimum in RMSEP

cross-validationpartial least squares

I use the plsr formula in R and the oscorespls algoritm for analysing my datasets. The datasets are characterized by relatively few number of observations (22), one response variable and different numbers of predictors (from only analysing 4 predictors and upwards). However, it is difficult to choose the number of components to use for the interpretation of the results. Cross-validation using leave-one-out is used, but the resulting RMSEP has several minimum values. The results vary a lot depending on the choice of number of components.

Hence, does anyone know how to do this choice in a scientific way?

These are the training results and RMSEP for one of the analyses:

VALIDATION: RMSEP
Cross-validated using 22 leave-one-out segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV           1.024   0.9789    1.012    1.033   0.9624   0.8659
adjCV        1.024   0.9767    1.007    1.025   0.9491   0.8554
       6 comps  7 comps  8 comps  9 comps
CV      0.8088   0.9736   0.9478   0.9790
adjCV   0.8006   0.9600   0.9330   0.9631

TRAINING: % variance explained
     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X      82.17    92.58    96.20    97.21    99.49    99.99   100.00
SEC    22.28    38.36    55.75    74.67    77.40    77.65    80.72
     8 comps  9 comps
X     100.00   100.00
SEC    87.86    88.72

Best Answer

Welcome to cross validated!

Approach 1

Have a look at chapters 7.10 and 7.11 of The Elements of Statistical Learning.

I think the basic idea is to calculate the uncertainty on the test results for the different numbers of latent variables. That gives you an idea which differences you cannot trust to be real differences.

Do not forget that choosing the number of latent variables from test results is a data-driven model optimization, so you need an outer validation loop to measure the predictive performance of the model you obtain that way.

I'd also suggest to switch from LOO-cross validation to iterated/repeated $k$-fold cross validation or some version of out-of-bootstrap validation (see the book and answers here on cross validated to that topic).
You can also directly bootstrap the RMSE = f (# latent variables) plot.

Approach 2

Here's a second approach, that works very well for certain types of data: I work with spectroscopic data. Good spectra have a high correlation between neighbouring measurement channels, they look smooth in a parallel coordinate plot. For such data, I look at the X loadings. Similar to PCA loadings, higher PLS X loadings are usually more noisy than the first ones. So I decide the number of latent variables by looking how noisy the loadings are. For the data I deal with, this usually leads to far fewer latent variables than RMSECV (at least without calculating uncertainty) suggests.

Rule of Thumb

A rule of thumb I learned when I was first developing PLS models for industry as a student is: decide a number of PLS latent variables the way you learnt in lectures (e.g. with RMSE without uncertainty). Use at most 2 or 3 latent variables less than that would suggest.
My experience is that this rule of thumb did not only work for the UV/Vis data I had there, but also for other spectroscopic techniques.

Also, I find it very helpful to sit down and think about the application: what influencing factors do you expect and to how many components would that correspond. Again, this is not applicable to all kinds of problems and applications, but if you can take this approach it should give a reasonable starting point.


edit: references for approach 2

I know papers where we did it that way (for PCA, not PLS though), but IIRC we never showed the chosen loadings plus some noisy loadings we didn't choose, and we did not really discuss the criterium in detail. However:

Related Question