Solved – Is it necessary to split data in clustering like in supervised learning

clusteringunsupervised learning

I'm learning clustering analysis and one book I read says the clustering model should be applied to a disjoint data set to examine the consistency of the model.

I think in clustering analysis we don't need to split the data into train and test sets like in supervised learning since without labels there is nothing to "train".

So what is the possible meaning of this "consistency"? How is it evaluated? Is this disjoint data set really necessary?

Thank you!

Edit: There isn't really a broader context. The text talks about how to select optimal number of clusters and then mentions this. I don't think this consistency is about the number of clusters…

Best Answer

Train-Test split is used to avoid overfitting in machine learning.

In unsupervised clustering, you cannot evaluate, and thus you cannot overfit in this way.

You can however overfit in different ways, by choosing e.g. an unsupervised evaluation criterion that measures a quantity that your clustering procedue also uses. You don't get the best result, but prefer the algorithm that is most related to your evaluation procedure. Don't use these measures to compare different algorithms.

Related Solutions

Solved – Supervised or Unsupervised Clustering

K-means is ''unsupervised'' by definition: it does not take the labels into account.

You however performed a ''supervised initialization''.

So I'd call this an unsupervised algorithm that has been initialized in a supervised manner.

And no, I don't think it makes a lot of sense to do it this way.

Solved – Understanding the difference between Supervised and unsupervised learning

Lets look at a simple example of trying to predict housing prices. Assume we have a dataset that looks like

Cost |  Sq Ft  | N bedroom
 100K    1,800     4
 120K    1,300     3
 220K    2,200     5

In the case of supervised learning we would know the cost (these are our y labels) and we would use our set of features (Sq ft and N bedrooms) to build a model to predict the housing cost. The formula would look like

Cost ~ Sq Ft + N bedrooms

Now in unsupervised learning we would not know the cost of the house but we still would know the features. Therefore, we would train a model and try to group the types of houses together that are similar. For an example of this look at k-means clustering (http://scikit-learn.org/stable/modules/clustering.html#clustering)

This is a great, free, book which covers this very nicely (http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf)

Each type of learning method (may) have a set of parameters which are called model parameters. The training phase is used to find out the optimal set of parameters which generalizes you data the best. That book also gives very nice information on different learning methods are there parameters.

For example, in the leaning algorithm called SVM there is a term that looks like $\exp(-\gamma|x-x^{2}|)$. In this example the $\gamma$ parameter is what we try to optimize using the training data.

Best Answer

Related Solutions

Solved – Supervised or Unsupervised Clustering

Solved – Understanding the difference between Supervised and unsupervised learning

Related Question