Solved – Unsupervised Dimensional reduction for mixed data types

binary datadimensionality reductionmixed type datar

I have a data set with about 50K rows and 100 columns. You can consider every row to be representing one restaurant.

My goal is to calculate dissimilarities between all the restaurants – Gower's coefficient.

Of those 100 columns (features), a few of them are numeric data and nominal data. The problem is the other columns (about 90) are very sparse binary data (1/0).

I do think that those 90 columns of binary data can be reduced to some smaller number of columns, so that the computational time can be reduced significantly. But I don't know what method I should use to reduce such a large amount of binary data.

Can anyone give me some suggestions?

It will be most helpful if you can provide me some references and R code.

Best Answer

I do think that those 90 columns of binary data can be reduced to some smaller number of columns, so that the computational time can be reduced significantly

This assumption seems unfounded to me. Since the computational time of calculating pairwise dissimilarities corresponds O(n^2) the effect of dimensionality reduction will be merely noticeable. I mean if it takes two days, you wouldn't mind 2-3 hours less.

What I have in mind is

Do you really need all pairwise dissimilarities? Often one actually doesn't. Therefore a spatial index can be used.
If you do need them all: Do you need to update them often? How about keeping the dissimilarities. Adding a single item will take few time

Anyway you should try to reformulate the problem. 10 or 100 variables will make little difference in accomplishing your current approach.

Related Solutions

Dimensionality Reduction – Techniques for Series Dimensionality Reduction in Classification Inputs

I'm not sure that I'd classify a Fourier transform as a dimensionality reduction technique per se, though you can certainly use it that way.

As you probably know, a Fourier transform converts a time-domain function $f(t)$ into a frequency-domain representation $F(\omega)$. In the original function, the $t$ usually denotes time: for example, f(1) might denote someone's account balance on the first day, or the volume of the first sample of a song's recording, while f(2) indicates the following day's balance/sample value). However, the argument $\omega$ in $F(\omega$) usually denotes frequency: F(10) indicates the extent to which the signal fluctuates at 10 cycles/second (or whatever your temporal units are), while F(20) indicates the extent to which it fluctuates twice as fast. The Fourier transform "works" by reconstructing your original signal as a weighted sum of sinusoids (you actually get "weight", usually called amplitude, and a "shift", typically called the phase, values for each frequency component). The wikipedia article is a bit complex, but there are a bunch of decent tutorials floating around the web.

The Fourier transform, by itself, doesn't get you any dimensionality reduction. If your signal is of length $N$, you'll get about $N/2$ amplitudes and $N/2$ phases back (1), which is clearly not a huge savings. However, for some signals, most of those amplitudes are close to zero or are a priori known to be irrelevant. You could then throw out the coefficients for these frequencies, since you don't need them to reconstruct the signal, which can lead to a considerable savings in space (again, depending on the signal). This is what the linked book is describing as "dimensionality reduction."

A Fourier representation could be useful if:

Your signal is periodic, and
Useful information is encoded in the periodicity of the signal.

For example, suppose you're recording a patient's vital signs. The electrical signal from the EKG (or the sound from a stethoscope) is a high-dimensional signal (say, 200+ samples/second). However, for some applications, you might be more interested in the subject's heart rate, which is likely to be the location of the peak in the FFT, and thus representable by a single digit.

A major limitation of the FFT is that it considers the whole signal at once--it cannot localize a changes in it. For example, suppose you look at the coefficient associated with 10 cycles/second. You'll get similar amplitude values if

There is consistent, but moderate-sized 10 Hz oscillation in the signal,
That oscillation is twice as large in the first half of the signal, but totally absent in the 2nd half, and
The oscillation is totally absent in the first half, but twice as large as #1 in the 2nd half.
(and so on)

I obviously don't know much about your business, but I'd imagine these could be very relevant features. Another major limitation of the FFT is that it operates on a single time scale. For example, suppose one customer religiously visits your business every other day: he has a "frequency" of 0.5 visits/day (or a period of 2 days). Another customer might also consistently come for two days in a row, take two off, and then visit again for the next two. Mathematically, the second customer is "oscillating" twice as slowly as the first, but I'd bet that these two are equally likely to churn.

A time-frequency approach helps get around this issues by localizing changes in both frequency and time. One simple approach is the short-time FFT, which divides your signal into little windows, and then computes the Fourier transform of each window. This assumes that the signal is stationary within a window, but changes across them. Wavelet analysis is a more powerful (and mathematically rigorous approach). There are lots of Wavelet tutorials around--the charmingly named Wavelets for Kids is a good place to start, even if it is a bit much for all but the smartest actual children. There are several wavelet packages for R, but their syntax is pretty straightforward (see page 3 of wavelet package documentation for one). You need to choose an appropriate wavelet for your application--this ideally looks something like the fluctuation of interest in your signal, but a Morlet wavelet might be a reasonable starting point. Like the FFT, the wavelet transform itself won't give you much dimensionality reduction. Instead, it represents your original signal as a function of two parameters ("scale", which is analogous to frequency, and "translation", which is akin to position in time). Like the FFT coefficients, you can safely discard coefficients whose amplitude is close to zero, which gives you some effective dimensionality reduction.

Finally, I want to conclude by asking you if dimensionality reduction is really what you want here. The techniques you've been asking about are all essentially ways to reduce the size of the data while preserving it as faithfully as possible. However, to get the best classification performance we typically want to collect and transform the data to make relevant features as explicit as possible, while discarding everything else.

Sometimes, Fourier or Wavelet analysis is exactly what is needed (e.g., turning a high dimensional EKG signal into a single heart rate value); other times, you'd be better off with completely different approaches (moving averages, derivatives, etc). I'd encourage you to have a good think about your actual problem (and maybe even brainstorm with sales/customer retention folks to see if they have any intuitions) and use those ideas to generate features, instead of blindly trying a bunch of transforms.

Clustering Large Datasets with Mixed Variable Types – How to Efficiently Classify Remaining Observations

Hierarchical clustering in general does not scale well to large data sets. There are some special cases such as SLINK that need only $O(n)$ memory and $O(n^2)$ runtime (naive implementations need $O(n^2)$ memory and $O(n^3)$ runtime). So may need to look into alternative methods such as DBSCAN. DBSCAN will work with arbitrary distance measures; but you will probably not have index acceleration, so it will be $O(n^2)$ runtime, too. But it should still scale to 70k observations; I have ran DBSCAN on 100k years ago. The key is to not compute a complete distance matrix, because that needs $O(n^2)$ memory then.

However, neither will have an obvious way of classifying new observations. Clustering is just something different than classification. It's about getting a sketch of structure in the data to then analyze and turn into knowledge. No clustering will ever be perfect, but it may be able to tell you something you did not know before. You should then formalize it in a way that you can make use of it later.

Obviously, an universal approach is to train a classifier on the clusters afterwards.

I don't know what is available in SAS. I believe it only has the most basic methods available, nothing advanced.

Best Answer

Related Solutions

Dimensionality Reduction – Techniques for Series Dimensionality Reduction in Classification Inputs

Clustering Large Datasets with Mixed Variable Types – How to Efficiently Classify Remaining Observations

Related Question