Solved – Cluster Sequences of data with different length

clusteringk-meansMATLABsequential-pattern-mining

I need to cluster sequences of data that have different length.

I am using Matlab and my first question is related to the method.

Is KMeans sufficient to achieve this?

IN KMeans I have to use the following command to cluster a set of data stored in an matrix A

 [IDX1,E] = kmeans(A,5);

So, my second questions has to do with the fact that I don't know how to create the matrix for my case.

My data have the following format:

1 15 1 1 13 14;
1 1 1 1 12 1 7 11 9 11 7 11 7 11 7 4 7 7 14 15 9 2;
13 1 13 15 13 2 9 2 9 2 2 2 2 2 2 2;
1 2 9 1 6 10 6 1 6 10 14 3 10;

Assume that each row belongs to a different user.
What I need is to find clusters of similar behaviour/sequences. Do you know if I can proceed with KMeans and if so, how to create the matrix?

Best Answer

One way to do it (among many other ways) is to treat the element of your sequence as a word. In other words, if your assume your list is a sentence, then you can extract ngrams.

import nltk
from nltk import ngrams
a = [1, 15, 1, 1, 13, 14]
b = [1, 1, 1, 1, 12, 1, 7, 11, 9, 11, 7, 11, 7, 11, 7, 4, 7, 7, 14, 15, 9, 2]
c = [13, 1, 13, 15, 13, 2, 9, 2, 9, 2, 2, 2, 2, 2, 2, 2]
d = [1, 2, 9, 1, 6, 10, 6, 1, 6, 10, 14, 3, 10]

bb = list()
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in a])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in b])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in c])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in d])))

I added the x, because seems CountVectorizer neglects single numbers/letters. Lets do word count - alternatively you can go ahead with ngrams (read the sklearn documentation here ) as well

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(bb)
X.toarray()

The out put looks like this

array([[3, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [5, 0, 4, 1, 0, 1, 1, 1, 0, 1, 0, 6, 2],
       [1, 0, 0, 0, 3, 0, 1, 9, 0, 0, 0, 0, 2],
       [3, 3, 0, 0, 0, 1, 0, 1, 1, 0, 3, 0, 1]])

basically columns corresponds to words which are

print(vectorizer.get_feature_names())

['x1', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x2', 'x3', 'x4', 'x6', 'x7', 'x9']

and rows are your samples.

Now that you have a feature matrix, you can go ahead and do clustering, for example kmeans

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_

which results

array([0, 1, 0, 0], dtype=int32)
Related Question