Solved – Carets preProcess function

caretr

I've been introduced to the Caret package for performing analysis and I'm a little confused about one of the operations performed by preProcess.

Say I have a dataset df which consists of two columns a & b. a is a continuous column and b is a binary 1,0.

Now, using carets pre-process function, I can choose to center & scale the data:

preProc <- preProcess(df, method=c('center','scale')) 
dfTransformed <- predict(preProc, df)

It's my understanding that only continuous columns (in this case a) need to be scaled, but using this preProcess setting, both a and b get scaled. This probably isn't a problem as the binary column still only takes a possible two values, its just the values are no longer [0,1].

My question is, is there a reason why these binaries get scaled? Is there a benefit that I lack knowledge of, or, am I right in assuming that the binary columns should not get transformed and I should only perform the preprocessing on a subset of my data containing only my continuous variables.

I've seen related questions but I feel some of the answers lack an explanation.

Best Answer

The problem lies in the fact that your binary column is not a factor, but a numeric column. See what happens with the example below:

df <- data.frame(a = c(1.223, 10.2, 5.24, 3.31, 8.4, 11.54),
                 b = c(1, 0, 0, 0 , 1, 1))
preProcess(df, method=c('center','scale'))

Created from 6 samples and 2 variables

Pre-processing:
  - centered (2)
  - ignored (0)
  - scaled (2)

Both columns have been centered and scaled.

Now with b as a factor:

df$b <- as.factor(df$b)
preProcess(df, method=c('center','scale'))
Created from 6 samples and 2 variables

Pre-processing:
  - centered (1)
  - ignored (1)
  - scaled (1)

You can see that now there is one variable ignored during preprocessing. The factor is now ignored. Factors are being treated as a non-numeric predictor and these will be ignored by the preProcess function.

Related Solutions

Dimensionality Reduction – Applying SVD or PCA on Large Sparse Matrices

First of all, you really do want to center the data. If not, the geometric interpretation of PCA shows that the first principal component will be close to the vector of means and all subsequent PCs will be orthogonal to it, which will prevent them from approximating any PCs that happen to be close to that first vector. We can hope that most of the later PCs will be approximately correct, but the value of that is questionable when it's likely the first several PCs--the most important ones--will be quite wrong.

So, what to do? PCA proceeds by means of a singular value decomposition of the matrix $X$. The essential information will be contained in $X X'$, which in this case is a $10000$ by $10000$ matrix: that may be manageable. Its computation involves about 50 million calculations of dot products of one column with the next.

Consider any two columns, then, $Y$ and $Z$ (each one of them is a $500000$-vector; let this dimension be $n$). Let their means be $m_Y$ and $m_Z$, respectively. What you want to compute is, writing $\mathbf{1}$ for the $n$-vector of $1$'s,

$$(Y - m_Y\mathbf{1}) \cdot (Z - m_Z\mathbf{1}) = Y\cdot Z - m_Z\mathbf{1}\cdot Y - m_Y\mathbf{1}.Z + m_Z m_Y \mathbf{1}\cdot \mathbf{1}\\ = Y\cdot Z -n (m_Ym_Z),$$

because $m_Y = \mathbf{1}\cdot Y / n$ and $m_Z = \mathbf{1}\cdot Z/n$.

This allows you to use sparse matrix techniques to compute $X X'$, whose entries provide the values of $Y\cdot Z$, and then adjust its coefficients based on the $10000$ column means. The adjustment shouldn't hurt, because it seems unlikely $X X'$ will be very sparse.

Example

The following R code demonstrates this approach. It uses a stub, get.col, which in practice might read one column of $X$ at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!

m <- 500000 # Will be 500,000
n <- 100    # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
  xt.x <- matrix(numeric(), n, n)
  x.means <- rep(numeric(), n)
  for (i in 1:n) {
    i.col <- get.col(i)
    x.means[i] <- mean(i.col)
    xt.x[i,i] <- sum(i.col * i.col)
    if (i < n) {
      for (j in (i+1):n) {
        j.col <- get.col(j)
        xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
      }    
    }
  }
  xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
  svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)

Solved – Search for interactions using carets rfe function

Try using model.matrix function to construct the interactions, e.g. model.matrix(Y~.^2, data), and then pass the resulting matrix to rfe. You might want to strip off the intercept first.

Best Answer

Related Solutions

Dimensionality Reduction – Applying SVD or PCA on Large Sparse Matrices

Example

Solved – Search for interactions using carets rfe function

Related Question