Solved – Caret: PCA preprocessing and partitioning train and test data

caretpca

My training and test data are in two distinct .csv files:

credit <- read.csv('/Users/dbl/Downloads/Loans Question/RequiredAttributesWithLoanStatusAdjusted.csv')

creditTesting <- read.csv('/Users/dbl/Downloads/Loans Question/NewLoansReducedFields.csv')

credit$LoanStatus <- as.factor(credit$LoanStatus)

I'm using Caret's preProcess=c("center", "scale", "pca") methods in the training phase.

logitBoostFit <- train(LoanStatus~., credit, method = "LogitBoost", family=binomial, preProcess=c("center", "scale", "pca"), 
    trControl = ctrl)

How is PCA applied to the test data in the predict phase?

logitBoostClasses <- predict(logitBoostFit, newdata = creditTesting)

If I concatenate both files into one and partition one file into
training and testing, so that the scaling, normalization, mapping of categorical to numerical is consistent don't I violate the principal of separating training from test data?

Best Answer

You can have them in two different data sets. train will store the loadings created from the training set (and the means and standard deviations) and apply them to the new samples being predicted. In other words, the new samples are projected using the training set information without recomputing anything.

Max