I need to cluster sequences of data that have different length.
I am using Matlab and my first question is related to the method.
Is KMeans sufficient to achieve this?
IN KMeans I have to use the following command to cluster a set of data stored in an matrix A
[IDX1,E] = kmeans(A,5);
So, my second questions has to do with the fact that I don't know how to create the matrix for my case.
My data have the following format:
1 15 1 1 13 14;
1 1 1 1 12 1 7 11 9 11 7 11 7 11 7 4 7 7 14 15 9 2;
13 1 13 15 13 2 9 2 9 2 2 2 2 2 2 2;
1 2 9 1 6 10 6 1 6 10 14 3 10;
Assume that each row belongs to a different user.
What I need is to find clusters of similar behaviour/sequences. Do you know if I can proceed with KMeans and if so, how to create the matrix?
Best Answer
One way to do it (among many other ways) is to treat the element of your sequence as a word. In other words, if your assume your list is a sentence, then you can extract ngrams.
I added the
x
, because seemsCountVectorizer
neglects single numbers/letters. Lets do word count - alternatively you can go ahead with ngrams (read the sklearn documentation here ) as wellThe out put looks like this
basically columns corresponds to words which are
and rows are your samples.
Now that you have a feature matrix, you can go ahead and do clustering, for example kmeans
which results