Solved – How come Mini Batch K means partial_fit method be useful for stream clustering

clusteringk-meansmachine learningonline-algorithmsscikit learn

Currently, I'm studying the advance in cluster analysis regarding stream clustering. I ended up assessing Mini batch K means because of some comments I read on the Internet, like the following one:

Many clustering algorithms can be tweaked to be suitable for stream clustering. I don't know of many implementations in scikit-learn that do it out of the box other than MiniBatchKMeans and Birch, which both have a partial_fit method allowing you to stream data through in incremental updates.

I'm familiar with online-offline stream clustering algorithms that use micro clusters to sum up the information, processing every element of the data set once.

Now, regarding the quotation, how come the partial_fit method be useful for streams? Or 'stream simulation' with time series data, at least. It seemed to me after reading this example, that the whole MiniBatchKMeans procedure of selecting random batches in different iterations is done every time you call partial_fit, and I do not understand:

  • how the final labeling is done; I mean, how can you get the final label for each element, having called partial_fit with many subsets of elements? For me, you can only get the final centroids at the end, doing mbk.cluster_centers_.

  • how is it useful for streams? IMHO I think every element may be processed more than once, if it is randomly taken to be part of a batch in more than one iteration for a given partial_fit call.

Any help would be appreciated. Thanks in advance 🙂

Best Answer

In stream clustering you assume cluster centers move over time.

You aren't much interested in the "final" location (because the stream never ends), but only in their current locations.

If you use Mini Batch K-Means on a stream, you'd feed each batch of points from the stream into Mini Batch K-Means once, as if it were a new random sample. You don't repeatedly process data, but you assume the stream had enough samples to provide redundancy.

Related Question