Solved – the best strategy to train and validate classification using PLS-[classifier] in caret package

caretclassificationcross-validationrtrain

I have 4 clusters (see plot below) extracted from data of medical samples N=218 measured for 11 genes/predictors P=11 by this method: first PCA analysis validated to have 2 important PCs that explained 75% of the data, then different clustering algorithms, distances, linkages (in hierarchical approach only) were compared: the majority support the presence of 4 distinct clusters. Taking the scientific hypothesis into consideration, clusters out of $K$-means algorithm were found the most plausible and were the most balanced clusters too: class #1 n=12, class #2 n=21, class #3 n=79, and class #4 n=106.

Projecting the observations on plane 1-2 component scores, revealed the below scatter plot with each cluster color coded.
enter image description here

The aim is to find a global optimum classifier using R after doing PLS to the data.

Knowing that these 4 clusters were actually the product of latent PCA components, it was natural to think of PCR as a next step to predict classes, but that approach turned out to be sub-optimal for two reasons: first, results do not related to probibilities (0-1), second, it does not relate well with the classes as the outcome variable. As many know, this would be better solved with PLS-DA method + softmax to find probabilities of class (0-1).

However, many reports confirm the superiority of using LDA as a second step using the scores of PLS, given that same standardization parameters (mean, sd) be used of the training set on the holdout-test set, even using the PLS projections out of the training set on the test set in order to get the scores which would be the actual holdout-test set to validate the classifier in question.

From the methodology point of view, this path is potentially encompassed with many dangers and subtle errors when one is un/misinformed about the tools used in context.

The caret package which is unique of its kind given the consistent infrastructure it provides to train and validate an array of different models making use of de facto standard respective R-packages, and hence caret promotes itself as a road map to a validated modeling leveraging off R rich libraries. As heart to blood vessels, so caret to other packages in my opinion. That being said, unwatchful playing with the heart could cost you dearly, and might lead also to a stand-still or a model-arrest of your data. R is free, many free books out there, but buying caret only book paid off, i.e. Applied Predictive Modeling. The help files, companion website (very appealing btw), are great resources but they won't substitute the text inside the book IMHO. However, in the book, I couldn't find a direct answer to the PLS-Classifier two step method amid others. The potential with caret is immense, thanks to Max Kuhn and his colleagues, that primarily encouraged me to post this question.

Back to the example above and the methodology of wish:
Data splitting:

Training set (77% n=168) for 10K-cross-validation: tuning (model-specific parameters, feature selection P=11, and cost to deal with imbalanced clusters). For CV this would be roughly n=150 for fitting the model using differnt parameters of wish and 'n=17for evaluation of parameters (I would call then=17' the CV-test to avoid confusion later on). Repetition = 5, so this will make 10 folds x 5 times = 50 training folds (n=150 each) and 50 CV-test folds (n=17 each). Holdout-test set (23% n=50).

Q1 I know that one can do parameters' tuning along with feature selection at the same time (i.e., parallel), but how to evaluate the cost/weights if one would like to evaluate cost-sensitive models (SVM, CART, C5.0) using the PLS scores to counteract class imbalance?

Q2 What is the alternative approach when reserving a separate data set for cost evaluation (i.e., evaluation set), as recommended, is not possible given the small sample size in this case? can one do tuning of model parameter, feature selection, and cost for imbalance all three at the same time? if not what is the best practice in this case?

Q3 Given the small sample size, is bootstrapping preferred to CV? if yes how would it be implemented to do exhaustive tuning like above for the PLS scores?

Q4 Given the imbalance above, is there a way to ensure that each CV training fold would include the minimum number of hard class(es) in order to have good estimation on the CV-test fold? is there any argument to pass to ensure presence of the small classes each time fold would be generated?

PLS special notes

This is the approach in my mind (please correct me if I am missing something somewhere during the course):

In each CV iteration on the many CV-traning folds, there should be a unique PLS projection matrix for each iteration that would be used in the next second step of getting PLS scores for the the respective CV-test set inheriting the same standardization parameters (mean, sd), this means that two things would be inherited; the PLS projection and the standardization parameters (mean and sd) in order to apply them to the CV-test folds, this way, given the example here, 50 values would be returned hoping to reach the best parameter in question. One complication though, there should be an argument to specify the desired number of PLS components to retain and to be used in calculation of scores out of each CV-test fold (better to be pre-defined in a previous tuning step may be). My expectation, is that after deciding on the best model, there should be a way to get the PLS projection matrix for the whole training set (i.e.n=168) along with (mean, sd) to apply them on the holdout-test to validate the best model. So in total, there would be 50 different PLS projection matrices, means, sds from CV step and 1 extra frothe whole training set, am I right?
Feature selection in this method would entail two things: predictors space and the PLS components space.

Q5 How to perform these two selections (predictor and PLS component) in caret? this is because feature selection here is different than otherwise since here we deal with scores rather than the observations themselves to determine best predictors that to construct the PLS components.

Note: When one is happy with the best final model, it would be recommended to fit the model on the whole data set n=218 to get the correct estimates withe the least uncertainty.

A similar procedure is implemented in caret::train() function that can be fed with preProc argument to specify the type of desired pre-processing of data (most are mentioned in the help system but I couldn't find PLS among them, better if with an argument to specify the desired components similar to PCA pre-processing). I am aware of the fact, that inheriting pre-processing parameters to holdout test set and to CV-test, can only be performed using the predict.train() function, as opposed to calling the generic predict() function to the $finalModel that won't inherit pre-processing parameters.

Q6 How to implement this strategy (if correctly described) to train and validate the two-step PLS-[classifier] methodology using PLS scores subspace instead of observations making use of caret infrastructure?

Thanks in advance.

Best Answer

I don't have time right now to answer all your questions, but here's a start:

  • yes you can optimize whatever hyperparameters you have to optimize with the same optimization set. The important thing is to make sure that the test set used for the final measurement of prediction performance is kept independent of all kinds of training data, and the optimization set is part of the training data.

  • See nested or double cross/resampling/out-of-bootstrap validation for search terms.

  • Model selection as in comparing a number of models and then keeping the "best" is a data-driven optimization. It belongs into the optimization stage, and in order to get the performance of the chosen model, you need independent test data!

  • As you already say that your sample size is quite small (though I'd be happy to have that many patients, the more so, as I have far more variates...), it is probably better to restrict yourself to not spend samples on optimization - the more so, as you probably cannot do any meaningful model comparison with that sample size anyways. And optimization (as in "pick the best") usually boils down to a massive multiple comparison situation.

  • bootstrap vs. cross validation: boostrap is preferred by some, cross validation by other disciplines. For my type of data, iterated k-fold cross validation and out-of-bootstrap had very similar total error:
    Beleites, C. et al.: Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
    See also Kim, J.-H.: Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , Computational Statistics & Data Analysis , 53, 3735 - 3745 (2009). DOI: 10.1016/j.csda.2009.04.009 for similar findings.

  • Comparing different PLS surrogate models generated during resampling validation:

    • Centering: you don't have to mean center your data. That is just a default. One alternative that would be more stable across the surrogate models would be if among your classes you have e.g. a control group. Then the mean of the controls would make a nice center which is also biologically more meaningful.
    • The same for scaling: variance scaling again is just some default procedure. But if you can derive a meaningful scaling by external knowledge that is fine, and possibly even constant for all surrogate models.
    • Even if you stay with mean centering and variance scaling: you can compare the mean and scaling vectors of the different surrogate models. If they are not reasonably stable, that is a strong indicator that your sample size is too small to derive a meaningful predictive model.
    • Once the center and scale is (approximately) constant, and if your dummy coding of the dependent stays the same, then (unlike PCA loadings and the PLS X loadings) the PLS coefficients should be stable as well. LDA or PLS-LDA final coefficients have some degrees of freedom (flipping and rotation) that don't affect the prediction. Therefore you can and should align the surrogate models accordingly to avoid seeing meaningless variation. See Beleites, C. et al.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal Bioanal Chem, 400, 2801-2816 (2011). DOI: 10.1007/s00216-011-4985-4 for more explanations and an example.
  • Q4: such constraints are known as stratification.

  • as for caret and PLS preprocessing, maybe a glance at the code for PCA pre-processing allows you to define a PLS preprocessing? Otherwise, you'll probably have to set up custom PLS-xxx models