Solved – Clustering data with Fourier series representation

clusteringdata transformationfeature-engineeringfourier transformgaussian mixture distribution

We are analyzing temporal behavioral patterns across many users and we want to cluster users in order to understand "natural types of behavior". Our idea is to represent the data (672 bins for each user) using a discrete-time Fourier transform. Then the behavior of each user will be decomposed into a combination of basic behavioral signal types.

We will then apply Gaussian mixture modelling (GMM) for clustering the users based on their coefficients for the different signals in the Fourier representation. Now, some of these basic signals will be very small, while others will be quite big. Of course, I am more interested in finding larger differences in the behavior rather than small differences. However, if I just use the coefficients directly, the GMM will have no way of differentiating between the big signals and the small signals.

How do I get this information from the Fourier transform and input into the GMM in order to cluster primarily based on large differences in behavioral signals? I'm brainstorming that I might scale the coefficients by the amplitude or something similar. However, perhaps the theory behind the Fourier transform can propose a better way?

Best Answer

Interesting. I've been working on a similar question recently. First, as is well known, the full periodogram isn't needed since the Nyquist frequency is N/2. Then, if you rank the set by decreasing Amplitude, you could select that subset of frequencies that best represents the periodicities. I'm not aware of any theoretical motivations for selecting a cutoff...but a judgement-based heuristic would work.

My approach was a little different, expanding on the FFT. Obviously, there is likely to be some autocorrelation to the time series as well. Why not fold in a measure to capture that, e.g., the DW or related metrics? Then, there can be issues with trend, cointegration and unit roots...why not use measures such as the Augmented Dickey-Fuller or Philips-Perron to express those relationships? Then, Rob Hyndman put out a paper on clustering time series a couple of years ago that decomposed the time series into the moments of the distribution, and clustering on those metrics.

Anyway, those are some thoughts...

Related Question