I'm currently taking a data mining class, and for one our projects we're required to predict the class label for an unknown data set by first building a classifier on a training data set which already provides the class label.
We're only required to get an accuracy of 80% to get a full mark on the assignment. I have already achieved this using the J48 Decision Tree algorithm (acc=84.08%).
There is also an ongoing competition on who can get the highest accuracy (determined by a Judge system we can't see).
I have two questions:
- How can I use an ensemble method with to do this
- Is there a way to optimize the parameters for each classifier?
import java.io.*;
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.*;
import weka.classifiers.trees.*;
import weka.classifiers.Evaluation;
public class CompClassifier {
public static FileOutputStream Output;
public static PrintStream file;
public static void main(String[] args) throws Exception {
// load training data
weka.core.Instances training_data = new weka.core.Instances(new
java.io.FileReader("/Users//Weka/training.arff"));
//load test data
weka.core.Instances test_data = new weka.core.Instances(new
java.io.FileReader("/Users//Weka/unknown.arff"));
//Clean up training data
ReplaceMissingValues replace = new ReplaceMissingValues();
replace.setInputFormat(training_data);
Instances training_data_filter1 = Filter.useFilter(training_data, replace);
//Normalize training data
Normalize norm = new Normalize();
norm.setInputFormat(training_data_filter1);
Instances processed_training_data = Filter.useFilter(training_data_filter1, norm);
//Set class attribute for pre-processed training data
processed_training_data.setClassIndex(processed_training_data.numAttributes() - 1);
//output to file
Output = new FileOutputStream("/Users//Desktop/CLASSIFICATION/test.txt");
file = new PrintStream(Output);
//build classifier
J48 tree = new J48();
tree.buildClassifier(processed_training_data);
//Clean up test data
replace.setInputFormat(test_data);
Instances test_data_filter1 = Filter.useFilter(test_data, replace);
//Normalize test data
norm.setInputFormat(training_data_filter1);
Instances processed_test_data = Filter.useFilter(test_data_filter1, norm);
//Set class attribute for pre-processed training data
processed_test_data.setClassIndex(processed_test_data.numAttributes() - 1);
//int num_correct=0;
for (int i = 0; i < processed_test_data.numInstances(); i++) {
weka.core.Instance currentInst = processed_test_data.instance(i);
int predictedClass = (int) tree.classifyInstance(currentInst);
System.out.println(predictedClass);
file.println("O"+ predictedClass);
}
}
Best Answer
An easy way to build an ensemble is by using a random forest. I'm fairly sure weka has a random forest algorithm, and if other tree-based models are performing well it's worth trying out.
You could also build your own ensemble by training multiple (say 50 or 100) J48 decision trees and using them to "vote" on the classification of each object. For example, if 60 tress say a given observation belongs to class "A", and 40 say it belongs to class "B", you classify the object as class "A."
You can further improve such an ensemble by training each tree on a random sub-sample of the training data. This is called "bagging," and the random sub-samples are usually created with replacement.
Finally, you can additionally give each tree a random subset of variables from the training set. This is called a "random forest." While your professor will probably be impressed if your write your own random forest algorithm, it's probably best to use an existing implementation.