Solved – Clustering Algorithm for labeled data

clumpingclusteringk-meansmachine learningsupervised learning

This is more of a theoretical/solving an argument sort of question.

Assuming I have a bunch of data point with 11 features I consider relevant about each point and 2 "labels": one is a boolean label ( 0 or 1), one is a continuous "label" (thought I'm not sure the word label really applies here).

What I want to do is basically: Find a way to predict both of these labels as accurately as possible. Find the features which influence these labels the most (even if by finding said features I'm unable to find a strong equation for actually predicting the labels).

Also, of note would be the fact that, although I care about both labels, predicting the second would automatically mean I predict the first and predicting the first is enough to "solve" the problem in most cases.
With me until now? Ok, perfect.

Someone suggested to me that this would be a perfect scenario to apply clustering algorithms, but I'm not quite sure why. I was under the impression that clustering, generally speaking, is used to label unlabeled data (I have labeled data here) and for actually making predictions on unlabeled data a regression (be it with a SVM, a NN or w/e) or classification type algorithm is what can be used.

The only situation in which I could see clustering being used here is to turn the continuous label into a fixed number of labels.

Is there another situation in which a clustering algorithm can be applied to this sort of problem? Are there ways to use clustering algorithms to predict already existing labels or figure out if there is even a relation at all between a given label and a feature? Can clustering algorithms be used to filter out irrelevant features?

Best Answer

Clustering algorithms will always perform much much worse compared to classification methods.

If you have labels, use classification or regression instead of clustering!

The reason is simple:

The clustering does not know which problem to solve, and there may well be more than one. You have no control over it solving anything related to your task.

For example, you might want to predict of a user will buy the product (classification) or the amount of money/time he spends on your site (regression). A clustering algorithm, since it does no know what you are interested in, may consider it more important to cluster your users by the weekday the, first visited your site... which may be a very good clustering, yet essentially useless for your business.