Solved – Variable selection using cross-validated PLS model when permutation test shows lack of significance

cross-validationdiscriminant analysisfeature selectionpartial least squarespermutation-test

I understand that the permutation test on PLS can help to detect overfitting of the PLS model. Usually if the p-value is greater than a criterion, say 0.05, it means that the model is overfitting and I won't use the PLS model to predict because of overfitting.

If I am interested in variable selection using PLS (using Variable Importance in Projection (VIP) scores), can I still use the PLS model for it even though the permutation test shows overfitting?

My thought is that even though my PLS model is overfitting, at least I can select variables that are relevant to the dependent variable.

I have only one binary dependent variable (control and treatment) and 1000 independent variables. Sample size is 25 per group. I use cross-validation to choose the number of PLS-DA components, but the subsequent permutation test shows lack of significant prediction power. The concept of permutation test that I am talking about is described in SzymaƄska et al., 2012, Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. (In a permutation test the labels of the samples are randomly permuted and a new classification model is calculated.)

Best Answer

For the benefit of other readers I will briefly explain what the permutation test is in this context.

In this specific example there is a binary dependent variable $y$, a large number of independent variables $X$ with $n\ll p$, and a specific algorithm (PLS-DA) to predict $y$ from $X$ with one hyperparameter (number $k$ of PLS components). To find the optimal value of $k$ one uses cross-validation. To estimate the performance of this model-building procedure one uses nested cross-validation. Imagine that nested cross-validation is done and the performance is, say, $43\%$ error rate or $R_\mathrm{CV}^2=0.2$. Question: are these numbers significantly different from chance? Is 43% significantly less than 50%? Is 0.2 significantly larger than 0.0? Answer: Shuffle the labels and repeat the entire procedure; do this many times to obtain the null distribution of performance; compare your actual performance with this null distribution to get the $p$-value.

What does insignificant $p$-value mean here?

Saying that it means that "the model is overfitting" is a bit of an understatement. The model includes regularization and cross-validation is performed to choose the optimal value of the regularization parameter (here, $k$). So the model is as non-overfitting as it could possibly be. So what is going on?

  1. It might be that the regularization is too weak and the model still does overfit. One can try another model altogether, allowing stronger regularization (see below).
  2. It might be that there is no or little overfitting, but the $y$ is not related to $X$ (at least not linearly) so the model cannot predict anything. You get 43% and not 50% due to finite sampling and random fluctuations.
  3. It might be that the model works fine but has little predictive power and you don't have enough data to actually be sure that predictive power is non-zero.

As you cannot be sure in #3 (that's the outcome of the permutation test), you have to consider that #2 is a real possibility. In other words: the model is absolutely useless or at least you failed to show otherwise.

Using the absolutely useless model for variable selection

There is clearly no sense in doing that. You could as well select variables at random. Or cherry pick variables that are most correlated with $y$.


I should add that PLS, PLS-DA and related approaches are popular in some applied fields but practically unknown in the mainstream statistics and machine learning. The most standard approach for binary classification together with variable selection would be logistic regression with elastic net (ridge+lasso) regularization. It might be worth considering.

Disclaimer: I have no practical experience with PLS or PLS-DA models.

Related Question