Solved – Number of principal components when preprocessing using PCA in caret package in R

caretcross-validationmachine learningpcar

I am using the caret package in R for training of binary SVM classifiers. For reduction of features I am preprocessing with PCA using the built in feature preProc=c("pca") when calling train(). Here are my questions:

  1. How does caret select principal components?
  2. Is there a fixed number of principal components that is selected?
  3. Are principal components selected by some amount of explained variance (e.g. 80%)?
  4. How can I set the number of principal components used for classification?
  5. (I understand that PCA should be part of the outer cross-validation to allow reliable prediction estimates.) Should PCA also be implemented in the inner cross-validation cycle (parameter estimation)?
  6. How does caret implement PCA in the cross-validation?

Best Answer

By default, caret keeps the components that explain 95% of the variance.
But you can change it by using the thresh parameter.

# Example
preProcess(training, method = "pca", thresh = 0.8)

You can also set a particular number of components by setting the pcaComp parameter.

# Example
preProcess(training, method = "pca", pcaComp = 7)

If you use both parameters, pcaComp has precedence over thresh.

Please see: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/preProcess