Solved – Is (a) multicollinearity and/or (b) binary variables an issue for DBSCAN? if so, how can one correct for these issues

binary dataclusteringdbscanmixed type datamulticollinearity

I have read some related questions, such as: Why are mixed data a problem for euclidean-based clustering algorithms?, What data structure to use for my cluster analysis or what cluster analysis to use for my data?, K-Means Clustering with Dummy Variables, k means with binary variables, How to use both binary and continuous variables together in clustering?, etc.

I can't quite get a consensus feeling or many references posted on the following questions.

I am using the DBSCAN algorithm to perform cluster analyses. This requires numeric data. I have categorical variables, such as race, that I turn into numeric data by creating dummy-coded binary variables. (I also normalize all continuous variables in the unit interval.) In the case of when there are only two levels of the categorical variable, this creates perfect correlation between the two.

For instance, take gender. Making dummies from this variable creates gender_male and gender_female. These two are perfectly correlated, meaning: $\text{Pr}(\text{F} = 1| \text{M} = 0) = 1$, which also implies $\text{Pr}(\text{F} = 0| \text{M} = 1) = 1$.

In the data I have been using, I generally get one cluster and a number of outliers. I had two questions regarding this:

  1. It seems like the typical k-means clustering cannot be used with binary variables. Is there an authoritative reference on this, laying out the reasons why?

  2. DBSCAN uses the same distance measures as k-means does—it just assigns points to clusters differently. Does this mean that binary variables (i.e., either zero or one) are not useful for DBSCAN, either? If not, how could I best represent those categorical data (e.g., gender, race, political affiliation, whether or not people play video games) in clustering algorithms?

  3. Does having highly related (or perfectly correlated) variables in the clustering algorithm throw off DBSCAN? I know why it is an issue in something like ordinary least squares, but those concerns do not apply to the algorithm being used here. It doesn't appear to change anything, as the results I get below don't shift when I drop the perfectly redundant variable:


library(dbscan)
## generating data -------------------------
set.seed(1839)
n <- 5000
x <- rbeta(n, .4, .4)
y <- rbinom(n, 1, x)
z <- ifelse(y == 1, 0, 1)
dat <- data.frame(x, y, z)

## selecting eps ---------------------------
kNNdistplot(dat, 9)
abline(h = .0045) # at the elbow

## clustering ------------------------------
# redundant column
set.seed(1839)
c1 <- dbscan(dat, .0045, 9)

# no redundant columns
set.seed(1839)
c2 <- dbscan(dat[, -3], .0045, 9)

## equivalent? -----------------------------
all.equal(c1$c, c2$c)

Best Answer

You can find plenty of discussion on kmeans with binary variables here in the stack exchange network.

Binary variables are less of a problem for DBSCAN because it does not minimize least squares, it does not compute the mean, and it can be used with arbitrary measures such as Jaccard distance for binary variables (it even can be used with similarities, non-metric measures, kernels, etc.)

But that doesn't mean the data will just work. If you have binary variables there probably is no good choice of epsilon. Because there won't be small values. If you consider Hamming distance, if epsilon is larger than 1 (i.e. it allows one variables to be different), you probably already get everything connected. That is due to a lack of "resolution" in the data itself, and probably not to be solved by the algorithm or distance.

Correlated binary attributes like male/female put double weight on this parameter. That is probably not desirable.

There are many, many extensions of DBSCAN. I'm pretty sure there is one for categorical data. I know there is PreDeCon for axis aligned subspaces, and something like 4C or so for correlated continuous attributes. So you will have to do some research on what has been published. Don't assume a software package contains all that there is.