Solved – Is it appropriate to perform feature selection before running Partial Least Square (PLS) regression

feature selectionpartial least squaresrregression

I'm doing a series of PLS analysis to test the contribution of a set of environmental variables to the invertebrate community of a river.

  • I am introducing into the model 23 environmental variables (2 of them are dummy variables)
  • The invertebrate community is introduced into the PLS analysis as a biological metric (e.g. diversity index)

My doubt is if it is necessary to do an a priori selection of the variables, before doing the PLS analysis. I read that the PLS is unaffected by data collinearity but multicollinear variables or dummy variables could affect the robustness of the analysis… isn't it?

As you see I am not an expert on statistics!!! so, should I do this variable selection by for example performing an RDA constraining biological community (species) to the environmental variables?

Best Answer

That depends, not only for PLS but also for almost any chemometric/machine learning method.

There is always a risk of any model giving a high "weight" to a variable which has actually no correlation with the responses. In other words, what you are measuring may have nothing to do with what you are observing but they may behave together by chance.

Thus, more sample means less risk. Looking at the correlation of each variable to the responses may help in some cases yet I often find it useless. Moreover, which combination of variables yields to the best model is a hard question to answer exactly since the number of models (developed with each combination of variables) increases VERY fast as the number of features increase.

To get the intuition, you may try to introduce a few new variables to your data set that are completely random. Interestingly, PLS will probably assign some weight to them as well.

So yeah, prior to PLS, you may try feature selection. You may also use, for instance, genetic algorithm which employs PLS itself during feature selection. It all comes to your needs, your data, and your understanding of the data.

Edit: In the light of the discussion in the comments I would like to give further information. In my opinion, feature selection is a tricky business in general.

Removing some variables based on prior knowledge can be useful. If, for example, one variable contains some measurement noises, removing it might improve your model.

Use of feature selection algorithms before PLS can be useful too. The problem is, cross-validation errors and autoprediction errors can be deceiving. I have encountered many times that as a feature selection algorithm selects less number of variables CV errors decreases as well as autopredictions errors. In reality, the models were overfitting most of the time.

Comparing autoprediction, cross-validation predictions and independent validation set predictions among the models is the way to go for me. If the algorithm, such as genetic algorithm as mentioned in comments, doesn't allow the calculation of comparable CV errors then experimenting on that type of data is necessary before blindly relying on that algorithm.