Solved – How to convert nominal dataset into numerical dataset

clusteringdatasetfeature selection

For my work, im using the multilabel dataset from this webpage. Few dataset which are listed in the page (for, e.g bibtex) have nominal attributes, i.e attribute values are 0 and 1.

My queries are given below

  1. Is it valid to run kMeans clustering algorithm on these nominal dataset to get centers and target label which are meaningful?

  2. Otherwise, to run kMeans algorithm (forget abt the target label), i need to convert this nominal dataset into numerical dataset. What is the standard procedure of doing it. I can normalize each instance, but it just gives me a real number with equal value for an instance.

  3. I would also like to reduce the dimension of nominal dataset such as rcv1v2. How do i go about it. I can use any Feature selection technique but it requires an optimization criteria. But in my case, i need to compare the result of different algorithm on this dataset which have different optimization criteria, so i got into trouble of choosing which criteria. Is there any technique of selecting a top features?

Best Answer

Although formally you may do K-means clustering on nominal data after converting nominal variables into dummy variables, this is regarded inadequate approach. To use K-means meaningfully, you must have all variables at scale (interval or ratio) level.

One of the ways to quantify a set of nominal variables is to apply multiple correspondence analysis. It can be seen as a dimension-reduction technique, like PCA, only for nominal data. You could use the resultant quantifications (the coordinates) as the input to K-means, if you like.