Solved – use PCA with mixed and sparse data types

clusteringdimensionality reductionpcasparse

I am trying to reduce the dimensionality of a data set of about 100'000 rows and 1'000 columns, in order to cluster the individual observations with k-means. I tried PCA with rescaling (i.e., subtracting the mean and dividing by the standard deviation), but I am not sure this approach makes much sense, because

The majority of the variables are not normally distributed (i.e., they follow exponential distributions, or other skewed distributions)
Many variables are 0/1 flags, and most of them are very sparse (i.e., 99.9% of the data is 0, and 0.1% is 1)
There are many outliers, and it is not always clear if the best way to remove an outlier is by removing the corresponding column or row

Is there a better way to reduce the dimensionality than PCA? I also tried linear mapping each variable in the interval [0,1] instead of mean subtraction/sigma division rescaling, and I even tried substituting some variables with the corresponding deciles, but then again I don't know if the combination PCA+kmeans is the best way to perform the clustering in this case.

Best Answer

There is a paper PCA on a DataFrame that seems trying to solve this problem. The technique used here is called collectively Generalized Low Rank Models (PCA and Sparse PCA are examples of this family of methods).

If you are familiar with Python/R you can try to use GLRMs from H2O library. They can handle both categorical and continuous data in single row.

Related Solutions

Solved – How to know when to stop reducing dimensions with PCA

Note that ridge penalisation/regularisation is basically doing model selection using pca. Although it does it smoothly by shrinking along each principal component axis rather than discretely by dropping small.variance pcs. Note that because you are doing pca on different subsets of variables this would roughly correspond to having different regularising parameters for each group rather than one for all the betas. The elements of statistical learning explains ridging quite well and provides some comparisons.

Dimensionality Reduction – Applying SVD or PCA on Large Sparse Matrices

First of all, you really do want to center the data. If not, the geometric interpretation of PCA shows that the first principal component will be close to the vector of means and all subsequent PCs will be orthogonal to it, which will prevent them from approximating any PCs that happen to be close to that first vector. We can hope that most of the later PCs will be approximately correct, but the value of that is questionable when it's likely the first several PCs--the most important ones--will be quite wrong.

So, what to do? PCA proceeds by means of a singular value decomposition of the matrix $X$. The essential information will be contained in $X X'$, which in this case is a $10000$ by $10000$ matrix: that may be manageable. Its computation involves about 50 million calculations of dot products of one column with the next.

Consider any two columns, then, $Y$ and $Z$ (each one of them is a $500000$-vector; let this dimension be $n$). Let their means be $m_Y$ and $m_Z$, respectively. What you want to compute is, writing $\mathbf{1}$ for the $n$-vector of $1$'s,

$$(Y - m_Y\mathbf{1}) \cdot (Z - m_Z\mathbf{1}) = Y\cdot Z - m_Z\mathbf{1}\cdot Y - m_Y\mathbf{1}.Z + m_Z m_Y \mathbf{1}\cdot \mathbf{1}\\ = Y\cdot Z -n (m_Ym_Z),$$

because $m_Y = \mathbf{1}\cdot Y / n$ and $m_Z = \mathbf{1}\cdot Z/n$.

This allows you to use sparse matrix techniques to compute $X X'$, whose entries provide the values of $Y\cdot Z$, and then adjust its coefficients based on the $10000$ column means. The adjustment shouldn't hurt, because it seems unlikely $X X'$ will be very sparse.

Example

The following R code demonstrates this approach. It uses a stub, get.col, which in practice might read one column of $X$ at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!

m <- 500000 # Will be 500,000
n <- 100    # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
  xt.x <- matrix(numeric(), n, n)
  x.means <- rep(numeric(), n)
  for (i in 1:n) {
    i.col <- get.col(i)
    x.means[i] <- mean(i.col)
    xt.x[i,i] <- sum(i.col * i.col)
    if (i < n) {
      for (j in (i+1):n) {
        j.col <- get.col(j)
        xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
      }    
    }
  }
  xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
  svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)

Best Answer

Related Solutions

Solved – How to know when to stop reducing dimensions with PCA

Dimensionality Reduction – Applying SVD or PCA on Large Sparse Matrices

Example

Related Question