Solved – Principal component analysis and DBScan

dbscandimensionality reductionpca

My data has 30 dimensions and 150 observations.
I want to cluster the data with DBScan.
Is there a difference between:
1. Performing a PCA and clustering all 30 principal components or
2. Just clustering the raw data?

DBScan works fast on my small dataset. Should I reduce the dimensions anyway? Is dimension reduction only about speed?

Best Answer

The curse of dimensionality has many different facets and one of them is that in high-dimensional domain finding relevant neighbours becomes hard. That is due to qualitative as well as quantitative reasons.

Qualitative, it becomes difficult to have a reasonable metric of similarity. As different dimensions are not necessarily comparable, what constitutes an appropriate "distance" become ill-defined.

Quantitative, even when a distance is well defined, the expected Euclidean distance between any two neighbouring points gets increasingly large. In a high dimensional problem, the data are sparse and as a result local neighbourhoods contain very few points. That is the reason why high-dimensional density estimation is a notoriously hard problem. As a rule of thump, density estimation tends to become quite difficult above 4-5 dimensions - 30 is a definite overkill.

Returning to DBSCAN: In DBSCAN, through the concepts of eps and neighbourhood we try to define regions of "high density". This task is prone to provide spurious results when the number of dimensions in a sample is high. Even if we assume a relatively well-behaved sample uniformly distributed in $[0,1]^d$, the average distance between points increases by $\sqrt{d}$. See this thread too; it deals with Gaussian data in a similar "curse of dimensionality" situation.

I attach a small R quick script showing how the expected distance between points in $U[0,1]^d$ increases with $d$. Notice that the individual features of our multi-dimensional sample $Z$ only take values in $[0,1]$ but the expected distance quickly jumps to values above $2$ after $d = 25$:

set.seed(11)
N = 1234;
d <- round(seq(2,333, length.out = 33)) 

# 20-30" execution time 
averageDist <- sapply(d, function(myd){ 
  set.seed(myd)
  Z = matrix( ncol = myd, data = runif(N*myd) ); 
  mean( dist(Z) )
})

cor(averageDist , sqrt(d))
# 0.9999903

par(mar=c(5,5,4,2))
plot(averageDist, x = d, panel.first = grid(), 
     ylab = expression("Expected distance between points in U[0,1]"^{d}),
     t='l', xlab= "# of dimensions (d) ")

So, yes, it is advisable to reduce the dimensionality of a sample before using a clustering algorithm like DBSCAN that relies on notion of sample density. Performing PCA or another dimensional reduction algorithm (e.g. NNMF if applicable) would be a good idea. Dimensional reduction is not merely a "speed issue" but also a methodological one that has repercussions both in the relevancy of our analysis as well as the feasibility of it.

Related Solutions

Solved – How does Principal Component Analysis help me understand the data

Such patterns can arise when you have inappropriate data for PCA.

For example, when your data is sparse and discrete.

pdata = np.random.poisson(.005, 1000*1000).reshape((1000,1000))
p = sklearn.decomposition.TruncatedSVD(2).fit_transform(pdata)
scatter(p[:,0], p[:,1])

The fish-like shape arises because:

the data is very skewed (most values are 0 due to sparsity)
there are no negative values

In such cases, PCA will often not work very well.

PCA is about maximizing variance in the first components. However, variance is a quantity that makes most sense for continuous and non-sparse data. PCA (usually, if you use the original version) centers your data; but already this operation only makes sense when the data generation process can be assumed to be translation invariant (e.g. Normal distributions don't change their shape when you translate the mean).

Solved – Principal Component Analysis and input feature distribution

PCA is a whitening transformation. After applying PCA, all the different principal components are uncorrelated, even if some of the original variables are highly correlated. This question has been discussed here.
PCA is sensitive to outliers. This question has been discussed here.

2b. You can always tune your threshold and the number of removed observations will change accordingly. However, your data may not be distributed like a bell-curve, and the distribution may not even be symmetric. You should plot some histogram of your data to get a sense of how your data is distributed, and what your outliers look like.

No. This really depends on your data and what you want to do with it after applying PCA. This has been discussed here.

Best Answer

Related Solutions

Solved – How does Principal Component Analysis help me understand the data

Solved – Principal Component Analysis and input feature distribution

Related Question