Solved – Combine Clustering and classification

classificationclustering

I have a receipt database of a grocery store. I would like to find classes of similar customers based on their receipts and classify people after their shopping to one of these classes.

Let us assume that the question how to do the clustering and classification in this particular case is not important. Let us further assume that we have no expert knowledge to label the data found by the clustering. And last assume that the clustering is a long running task, because of the amount of data so it is not done very often.

So I would do a cluster analysis on the data. The result is a number of 'unlabelled' (in the sense of a real name but they are discriminable) clusters but with similar characteristics. I would train a classifier with the elements of each cluster and their associated class labels. This classifier I can later use to classify a person who e.g. just finished his shopping to one of the classes.

Do you think that this approach makes sense or is there sth. I overlook when combining both clustering and classification?

Best Answer

From the training point of view, you can do this. However, I'd still call it rather a (predictive) clustering model than a classification.

There are one or two points you need to keep in mind.

  • Training a classifier on cluster analysis results means that by design the "classes" can be separated.

  • Validation measures the performance of the classifier's predictions. To generate a useful measuren, cases with class labels independent of your model are needed, which you will not really have as the possibility to separate your cases is guaranteed by the cluster analysis (you generate the classes as part of your model).

However, if this isn't important or you have a yet more general (and independent) way of validating, e.g. compare predicted buying behaviour with observed buying behaviour, IMHO you are fine.

Related Question