Solved – Interpret reuslts of PLS regression coefficients

partial least squarespythonregression coefficientsscikit learn

I have performed PLS regression using sklearn library (python 2.7) over three types of soil (PLS model per soil type) and I plotted the regression coefficients, but in the most right plot in the picture, the bars seem a bit bizarre, where one band is positive and the next is negative. Does it mean something? or this is a valid result?
(the black curve is an average spectrum, you can ignore it)

enter image description here

Best Answer

(This answer for now is preliminary, pending OP's explanation what exactly the data is)

Well acquired spectra with good spectroscopic resolution are typically expected to be smooth continuous functions of the wavelength.
Thus, doing a regression for a given analyte, one usually* expects that one wavelength channel is almost as good as its neighbour wavelength channel for predicting, and in fact that their coefficients should be almost the same. In other words, we expect high correlation between neighboring wavelength channels/variates (that's the smooth spectrum we have) and also high correlation between neighbour coefficients: the coefficient pattern is expected to be smooth and continuous as well.

Coefficient patterns that are visibly noisy, i.e. are (randomly) fluctuating between positive and negative from one channel/variate/wavelength to the next mean that the model tries to pick tiny differences between neighboring wavelength channels. Typcially, this is a sign of overfitting. An accompanying symptom is usually that the absolute values of the coefficients get very large.

From the left to the right I'd get more and more suspicious of overfitting, though for the vibrational spectra I work with I'd say that already the leftmost model would need some closer inspection wrt. overfitting.

Things to do:

  • Check stability of your PLS models. For PLS, that can easily be done directly as the coefficients $\mathbf Y_c = \mathbf X_c \mathbf B$ (not the loadings!) e.g. of the surrogate models trained during cross validation should be equal or at least very similar.

  • Compare quality of spectra (noise level), number of available spectra and "ease" of the regression problem (is the analyte/property in question plainly visible in the spectra) with the chosen number of latent variables.
    Reducing the number of latent variables corresponds to stronger regularization and thus to less overfitting (classical bias-variance tradeoff).

  • PLS gives a regularization and as such already helps a lot stabilizing the regression. However, it cannot work miracles. It may help to trade in some spectral resolution in order to gain signal-to-noise ratio for your spectra. This can be done in a way that also lowers the number of wavelength channels/variates, which in turn helps the PLS estimation additionally due to lower dimensionality.


* there are some exceptions, notably, if you are after some analyte with very narrow signal that is covered only by a single wavelength channel. But this would correspond to a coefficient pattern where a single channel sticks out, not the noisy pattern you have in figure 3.

Or, of course if the spectra are acquired at low resolution, so neighboring wavelength channels are not (highly) correlated.