Solved – Building training set from unlabeled data

classificationdata miningdatasettrain

I want to build classifier using naive bayes. I know that naive bayes is supervised learning that requires labeled data for training set. However, I only have data with no label on it.
Is there any method to give label to unlabeled training data or we only need to label it manually?
I want to label my data with two classes.
My data is all in continuous variable and has three attributes to determine the class.

Edit:

I've found useful java program which uses MVE for multivariate outlier detection in this link : http://www.kimvdlinde.com/professional/mve.html

I use that program to label the data which is all in continuous variable. What kind of method that I can use to label data if my data has three categorical attribute? Is it still possible to label it manually?

Best Answer

If the data was only one or two dimensions it would be fairly easy to assign class labels from visual inspection. However, in three or more dimensions, more advanced methods are required. One way to look at your problem is as one of outlier detection. If one of your classes has more than 50% of the observations, then you can consider this the majority group and the second class as outliers. To help crystallize that concept, I will give a definition:

An outlier is an observation which does not come from the distribution of the majority of the data.

With this in mind, you can try to use an outlier detection method. These come from the field of robust statistics, and tend to have some nice statistical properties that some machine learning algorithms lack. They tend to be very good at detection a majority group because they were specifically designed for that.

A number of these are implemented in R. Since your data is continuous, options include the MCD, SDE and MVE, implemented in rrcov package as CovMcd, CovMved, and CovSde respectively. Another option is FastPCS, which is implemented in the FastPCS package. The paper on FastPCS might be a good introduction to how these methods work. Namely, you might want to check it out to learn a bit more about the concept of $\alpha$ that these methods use. It is the percentage of the dataset you expect to belong to the majority class. To be safe, you can always set $\alpha$ to 0.5 and plot the distances of the observations (all of the algorithms should have some distance output for the observations after they do reweighting). If you see a big jump in the distances between two groups in the data, and some observations from the first clump that are still considered outliers, you can probably increase $\alpha$ a bit to add those in - it may be that $\alpha =0.5$ is a bit too conservative.

Make sure to follow examples on the implementation of the methods in their packages to run them correctly.

In the end, these methods will identify the "good" data and "outliers," and you can give these two groups your class labels.

Related Question