Solved – Carets preProcess function

caretr

I've been introduced to the Caret package for performing analysis and I'm a little confused about one of the operations performed by preProcess.

Say I have a dataset df which consists of two columns a & b. a is a continuous column and b is a binary 1,0.

Now, using carets pre-process function, I can choose to center & scale the data:

preProc <- preProcess(df, method=c('center','scale')) 
dfTransformed <- predict(preProc, df)

It's my understanding that only continuous columns (in this case a) need to be scaled, but using this preProcess setting, both a and b get scaled. This probably isn't a problem as the binary column still only takes a possible two values, its just the values are no longer [0,1].

My question is, is there a reason why these binaries get scaled? Is there a benefit that I lack knowledge of, or, am I right in assuming that the binary columns should not get transformed and I should only perform the preprocessing on a subset of my data containing only my continuous variables.

I've seen related questions but I feel some of the answers lack an explanation.

Best Answer

The problem lies in the fact that your binary column is not a factor, but a numeric column. See what happens with the example below:

df <- data.frame(a = c(1.223, 10.2, 5.24, 3.31, 8.4, 11.54),
                 b = c(1, 0, 0, 0 , 1, 1))
preProcess(df, method=c('center','scale'))

Created from 6 samples and 2 variables

Pre-processing:
  - centered (2)
  - ignored (0)
  - scaled (2)

Both columns have been centered and scaled.

Now with b as a factor:

df$b <- as.factor(df$b)
preProcess(df, method=c('center','scale'))
Created from 6 samples and 2 variables

Pre-processing:
  - centered (1)
  - ignored (1)
  - scaled (1)

You can see that now there is one variable ignored during preprocessing. The factor is now ignored. Factors are being treated as a non-numeric predictor and these will be ignored by the preProcess function.