Solved – Unsupervised Dimensional reduction for mixed data types

binary datadimensionality reductionmixed type datar

I have a data set with about 50K rows and 100 columns. You can consider every row to be representing one restaurant.

My goal is to calculate dissimilarities between all the restaurants – Gower's coefficient.

Of those 100 columns (features), a few of them are numeric data and nominal data. The problem is the other columns (about 90) are very sparse binary data (1/0).

I do think that those 90 columns of binary data can be reduced to some smaller number of columns, so that the computational time can be reduced significantly. But I don't know what method I should use to reduce such a large amount of binary data.

Can anyone give me some suggestions?

It will be most helpful if you can provide me some references and R code.

Best Answer

I do think that those 90 columns of binary data can be reduced to some smaller number of columns, so that the computational time can be reduced significantly

This assumption seems unfounded to me. Since the computational time of calculating pairwise dissimilarities corresponds O(n^2) the effect of dimensionality reduction will be merely noticeable. I mean if it takes two days, you wouldn't mind 2-3 hours less.

What I have in mind is

  1. Do you really need all pairwise dissimilarities? Often one actually doesn't. Therefore a spatial index can be used.

  2. If you do need them all: Do you need to update them often? How about keeping the dissimilarities. Adding a single item will take few time

Anyway you should try to reformulate the problem. 10 or 100 variables will make little difference in accomplishing your current approach.