Solved – Choosing the number of PLSR components

model selectionpartial least squaresregression

I am trying to choose how many components to retain in my PLSR. My total variance explained for the response variable is only about 30%, and the first 2 components explain 99% of this. Intuitively I know I should just run the model with the first 2 components, but is there a more rigorous method I can apply? Does the Kaiser-Guttman or scree test apply here since the PLSR computes singular values rather than eigenvalues?

Best Answer

Use cross-validation.

A quick example: for each number of components(by the way they are called latent variables in PLS unlike principle component in PCA), remove the first sample from your data, and with the remaining data set construct a full PLS model. Use that model to predict the removed sample. Then put the removed sample back remove the next one and predict it and so on. This type of cross-validation is called leave-one-out cross validation.

Finally you will have a prediction for each sample and their real values. Using that information calculate an error which is basically the difference of predicted vs real value or the square of that difference.

Repeat this for each number of components. You can now draw a graph which has number of components in X axis and corresponding errors on Y axis.

Number of LV vs RMS graph

As a rule of thumb you should NOT go ahead and select the component(latent variable) yielding the least error because it may overfit. You should choose where the drop in error values are not significant any more. Another common case is that: there might be a point where the error values increase. Choose a point right before that increment.

Reference for graph: Brereton, Richard G. Applied chemometrics for scientists. John Wiley & Sons, 2007.

Related Question