I have 4 clusters (see plot below) extracted from data of medical samples N=218
measured for 11 genes/predictors P=11
by this method: first PCA analysis validated to have 2 important PCs that explained 75% of the data, then different clustering algorithms, distances, linkages (in hierarchical approach only) were compared: the majority support the presence of 4 distinct clusters. Taking the scientific hypothesis into consideration, clusters out of $K$-means algorithm were found the most plausible and were the most balanced clusters too: class #1 n=12
, class #2 n=21
, class #3 n=79
, and class #4 n=106
.
Projecting the observations on plane 1-2 component scores, revealed the below scatter plot with each cluster color coded.
The aim is to find a global optimum classifier using R after doing PLS to the data.
Knowing that these 4 clusters were actually the product of latent PCA components, it was natural to think of PCR as a next step to predict classes, but that approach turned out to be sub-optimal for two reasons: first, results do not related to probibilities (0-1), second, it does not relate well with the classes as the outcome variable. As many know, this would be better solved with PLS-DA method + softmax to find probabilities of class (0-1).
However, many reports confirm the superiority of using LDA as a second step using the scores of PLS, given that same standardization parameters (mean, sd) be used of the training set on the holdout-test set, even using the PLS projections out of the training set on the test set in order to get the scores which would be the actual holdout-test set to validate the classifier in question.
From the methodology point of view, this path is potentially encompassed with many dangers and subtle errors when one is un/misinformed about the tools used in context.
The caret
package which is unique of its kind given the consistent infrastructure it provides to train and validate an array of different models making use of de facto standard respective R-packages, and hence caret
promotes itself as a road map to a validated modeling leveraging off R rich libraries. As heart to blood vessels, so caret
to other packages in my opinion. That being said, unwatchful playing with the heart could cost you dearly, and might lead also to a stand-still or a model-arrest of your data. R is free, many free books out there, but buying caret
only book paid off, i.e. Applied Predictive Modeling
. The help files, companion website (very appealing btw), are great resources but they won't substitute the text inside the book IMHO. However, in the book, I couldn't find a direct answer to the PLS-Classifier two step method amid others. The potential with caret
is immense, thanks to Max Kuhn and his colleagues, that primarily encouraged me to post this question.
Back to the example above and the methodology of wish:
Data splitting:
Training set (77% n=168
) for 10K-cross-validation: tuning (model-specific parameters, feature selection P=11
, and cost to deal with imbalanced clusters). For CV this would be roughly n=150
for fitting the model using differnt parameters of wish and 'n=17for evaluation of parameters (I would call the
n=17' the CV-test to avoid confusion later on). Repetition = 5, so this will make 10 folds x 5 times = 50 training folds (n=150
each) and 50 CV-test folds (n=17
each). Holdout-test set (23% n=50
).
Q1 I know that one can do parameters' tuning along with feature selection at the same time (i.e., parallel), but how to evaluate the cost/weights if one would like to evaluate cost-sensitive models (SVM, CART, C5.0) using the PLS scores to counteract class imbalance?
Q2 What is the alternative approach when reserving a separate data set for cost evaluation (i.e., evaluation set
), as recommended, is not possible given the small sample size in this case? can one do tuning of model parameter, feature selection, and cost for imbalance all three at the same time? if not what is the best practice in this case?
Q3 Given the small sample size, is bootstrapping preferred to CV? if yes how would it be implemented to do exhaustive tuning like above for the PLS scores?
Q4 Given the imbalance above, is there a way to ensure that each CV training fold would include the minimum number of hard
class(es) in order to have good estimation on the CV-test fold? is there any argument to pass to ensure presence of the small classes each time fold would be generated?
PLS special notes
This is the approach in my mind (please correct me if I am missing something somewhere during the course):
In each CV iteration on the many CV-traning folds, there should be a unique PLS projection matrix for each iteration that would be used in the next second step of getting PLS scores for the the respective CV-test set inheriting the same standardization parameters (mean
, sd
), this means that two things would be inherited; the PLS projection and the standardization parameters (mean and sd) in order to apply them to the CV-test folds, this way, given the example here, 50 values would be returned hoping to reach the best parameter in question. One complication though, there should be an argument to specify the desired number of PLS components to retain and to be used in calculation of scores out of each CV-test fold (better to be pre-defined in a previous tuning step may be). My expectation, is that after deciding on the best model, there should be a way to get the PLS projection matrix for the whole training set (i.e.n=168
) along with (mean
, sd
) to apply them on the holdout-test to validate the best model. So in total, there would be 50 different PLS projection matrices, means, sds from CV step and 1 extra frothe whole training set, am I right?
Feature selection in this method would entail two things: predictors space and the PLS components space.
Q5 How to perform these two selections (predictor and PLS component) in caret
? this is because feature selection here is different than otherwise since here we deal with scores rather than the observations themselves to determine best predictors that to construct the PLS components.
Note: When one is happy with the best final model, it would be recommended to fit the model on the whole data set n=218
to get the correct estimates withe the least uncertainty.
A similar procedure is implemented in caret::train()
function that can be fed with preProc
argument to specify the type of desired pre-processing of data (most are mentioned in the help system but I couldn't find PLS among them, better if with an argument to specify the desired components similar to PCA pre-processing). I am aware of the fact, that inheriting pre-processing parameters to holdout test set and to CV-test, can only be performed using the predict.train()
function, as opposed to calling the generic predict()
function to the $finalModel
that won't inherit pre-processing parameters.
Q6 How to implement this strategy (if correctly described) to train and validate the two-step PLS-[classifier] methodology using PLS scores subspace instead of observations making use of caret
infrastructure?
Thanks in advance.
Best Answer
I don't have time right now to answer all your questions, but here's a start:
yes you can optimize whatever hyperparameters you have to optimize with the same optimization set. The important thing is to make sure that the test set used for the final measurement of prediction performance is kept independent of all kinds of training data, and the optimization set is part of the training data.
See nested or double cross/resampling/out-of-bootstrap validation for search terms.
Model selection as in comparing a number of models and then keeping the "best" is a data-driven optimization. It belongs into the optimization stage, and in order to get the performance of the chosen model, you need independent test data!
As you already say that your sample size is quite small (though I'd be happy to have that many patients, the more so, as I have far more variates...), it is probably better to restrict yourself to not spend samples on optimization - the more so, as you probably cannot do any meaningful model comparison with that sample size anyways. And optimization (as in "pick the best") usually boils down to a massive multiple comparison situation.
bootstrap vs. cross validation: boostrap is preferred by some, cross validation by other disciplines. For my type of data, iterated k-fold cross validation and out-of-bootstrap had very similar total error:
Beleites, C. et al.: Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
See also Kim, J.-H.: Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , Computational Statistics & Data Analysis , 53, 3735 - 3745 (2009). DOI: 10.1016/j.csda.2009.04.009 for similar findings.
Comparing different PLS surrogate models generated during resampling validation:
Q4: such constraints are known as stratification.
as for caret and PLS preprocessing, maybe a glance at the code for PCA pre-processing allows you to define a PLS preprocessing? Otherwise, you'll probably have to set up custom PLS-xxx models