For my work, im using the multilabel dataset from this webpage. Few dataset which are listed in the page (for, e.g bibtex) have nominal attributes, i.e attribute values are 0 and 1.
My queries are given below
-
Is it valid to run kMeans clustering algorithm on these nominal dataset to get centers and target label which are meaningful?
-
Otherwise, to run kMeans algorithm (forget abt the target label), i need to convert this nominal dataset into numerical dataset. What is the standard procedure of doing it. I can normalize each instance, but it just gives me a real number with equal value for an instance.
-
I would also like to reduce the dimension of nominal dataset such as rcv1v2. How do i go about it. I can use any Feature selection technique but it requires an optimization criteria. But in my case, i need to compare the result of different algorithm on this dataset which have different optimization criteria, so i got into trouble of choosing which criteria. Is there any technique of selecting a top features?
Best Answer
Although formally you may do K-means clustering on nominal data after converting nominal variables into dummy variables, this is regarded inadequate approach. To use K-means meaningfully, you must have all variables at scale (interval or ratio) level.
One of the ways to quantify a set of nominal variables is to apply multiple correspondence analysis. It can be seen as a dimension-reduction technique, like PCA, only for nominal data. You could use the resultant quantifications (the coordinates) as the input to K-means, if you like.