Solved – Use clustering to create labels of unlabeled data and then classify a test set (available or not in the clustering)

accuracyclassificationclusteringlabelingsemi-supervised-learning

Let's say that I use Dynamic Time Warping (DTW) along with K-Medoids to cluster unlabeled time-series into a number of clusters. In this way, several clustering solutions in $k_i,i=[1,…,m]$ clusters create 'ground truths' for all the instances.
UPDATE: My goal is to create a predictive model of new time-series instances, which a-priori are unlabeled. Initially the data are completely unlabeled. The clustering aims to build a robust cluster labeling, while the classification is intended to predict the cluster membership for new data.

Classification after clustering:
A. – Does it sound correct to split this dataset into training and test
set for classification purposes, built several classification models
on the training set, and measure the overall accuracy by applying
these models on the test set (using the "ground truth" labels)?
– Or, the test set should not be used during the clustering? Besides, can I create its labels for the classification by assigning to the class label of the
closest cluster center as these are derived from the clustering of the
training set only?
– In other words, is the classification biased by
the labels of the test set that are created during a clustering
process where the test set participates in shaping all the pairwise
distances, and consequently the clustering decision boundaries?

B. – If so, a good classification accuracy is an indicator of an appropriate clustering into $k_j$ clusters?

C. – Is a deep learning approach more appropriate here? The few labeled data could be the cluster centers or some time-series profiles selected and verified by a domain expert.

Best Answer

If you used k-medoids, you do not need to train a classifier afterwards: every new object should be assigned to the nearest medoid...

Deep learning requires massive labeled training data!

Related Question