Solved – Weka java API: Attribute Selection and Cross Validation

classificationfeature selectionjavaweka

Is there a way to perform Attirbute selection(aka feature selection) (regardless of method) only for the training dataset before passing data for Cross Validation ?

I currently think that the only possbile way to perform this using the Weka API is through a meta>>AttributeSelectedClassifier. However I am not yet sure whether this method Performs first Attributes Selection in the whole dataset (without taking into account the crossvalidation folds) and then classification, thus possibly introducing bias into the cross validation evaluation result.

Any Ideas???

Best Answer

Suppose that you want to evaluate a {feature selector+classifier} metaclassifier using 5 CV.

As far as I know, the meta>>AttributeSelectedClassifier is treated like any other classifier. That is, it is trained on 4/5 of data and tested on 1/5 of the data. This means that the feature selector runs on training data identifying the best features. Then, the reduced feature set is fed to the classifier and an actual schema gets generated. When testing the meta, the feature selection part of the metaclassifier just selects only those features previously determined as good. The result is fed to the learned schema and a prediction will be generated.

So, afaik, using AttributeSelectedClassifier is the right way to evaluate your schema.

Word of advice, I would throw in here, the data normalization/standardization, missing value inference, and/or any other meta parameter search for the actual classifier. So you will end up having an actual classifier wrapped up in several "metaclassifiers"

Related Solutions

Solved – Implementing the 0.632+ bootstrap method using the Weka Java API

I think I have managed to rewrite the Bootstrap 0.632+ from R to Java using the Weka Java API. The original R function can be found in the bootpred method inside the bootstrap package (link).

As you can see from the source code, I have used the corrected $\hat{R'}$ and $\hat{Err}^{(1)'}$ with the final (corrected) equation (32) from the original article.

However, despite correcting for abnormal values, I sometimes still get negative error rate, which is of course impossible and therefore invalid. Also, I have noticed that the difference between the 0.632 and the 0.632+ method is minimal, if any.

If someone finds any errors in my source code, I would be really grateful if you could point them out.

public class Bootstrap632plus extends AbstractPerformance {

    private final int repeats;
    private double Err632;
    private double resub;

    public Bootstrap632plus(Instances instances, int repeats) {
        super(instances);
        this.repeats = repeats;
    }

    public double getErr632() {
        return Err632;
    }

    public double getResub() {
        return resub;
    }

    @Override
    public double getErrorRate(final MachineLearningAlgorithm machineLearningAlgorithm, final Random seed) throws Exception {

        // First component
        double err = predictionError(machineLearningAlgorithm);
        this.resub = err;

        // Error rates
        List<Double> errorRates = Collections.synchronizedList(new ArrayList<>());

        // GAMA related stuff
        final int numClasses = instances.numClasses();
        AtomicIntegerArray p_l = new AtomicIntegerArray(numClasses);
        AtomicIntegerArray q_l = new AtomicIntegerArray(numClasses);

        // Bootstrap iterations
        seed.ints(repeats).parallel().forEach(randomSeed -> {

            // Get error rate
            Evaluation evaluation = bootstrapIteration(machineLearningAlgorithm, randomSeed);
            errorRates.add(evaluation.errorRate());

            /*
             GAMA VARIABLE

                Confusion matrix:
                 - first dimension (rows): real distribution for first class
                 - second dimension (columns): predicted distribution for first class

                p_l = observed proportions of responses where y_i equals l
                    - sum by l-th row (first dimension)

                q_l = observer proportions of predicted responses where y_i equals l
                    - sum by l-th column (second dimension)

                GAMA = SUM_by_l(p_l * (1 - q_l))

             */
            double[][] confusionMatrix = evaluation.confusionMatrix();

            for(int l = 0; l < numClasses; l++) {

                int p_tmp = 0, q_tmp = 0;

                for(int n = 0; n < numClasses; n++) {

                    // Sum for l-th class
                    p_tmp += confusionMatrix[l][n];
                    q_tmp += confusionMatrix[n][l];

                }

                // Add data for l-th class
                p_l.addAndGet(l, p_tmp);
                q_l.addAndGet(l, q_tmp);

            }

        });

        // Second component
        double Err1 = errorRates.stream().mapToDouble(i -> i).average().orElse(0);

        // Plain 0.632 bootstrap
        Err632 = .368*err + .632*Err1;

        // GAMA
        final double observations = instances.size() * repeats;
        double gama = 0;
        for(int l = 0; l < numClasses; l++) {

            // Normalize numbers -> divide by number of all observations (repeats * dataset size)
            gama += ((double)p_l.get(l) / observations) * (1 - ((double)q_l.get(l) / observations));

        }

        // Relative overfitting rate (R)
        double R = (Err1 - err) / (gama - err);

        // Modified variables (according to original journal article)
        double Err1_ = Double.min(Err1, gama);
        double R_ = R;

        // R can fall out of [0, 1] -> set it to 0
        if(!(Err1 > err && gama > err)) {
            R_ = 0;
        }

        // The 0.632+ bootstrap (as used in original article)
        double Err632plus = Err632 + (Err1_ - err) * (.368 * .632 * R_) / (1 - .368 * R_);

        return Err632plus;

    }

    /**
     * Prediction error: first component of the 0.632+ bootstrap.
     * Train the classifier on the whole dataset and then also test it on the whole dataset.
     * 
     * @param machineLearningAlgorithm Specified machine learning algorithm
     * @return prediction error [0, 1]
     * @throws Exception exception
     */
    private double predictionError(final MachineLearningAlgorithm machineLearningAlgorithm) throws Exception {

        // Train
        Classifier classifier = ClassifierFactory.instantiate(machineLearningAlgorithm);
        classifier.buildClassifier(instances);

        // Test
        Evaluation evaluation = new Evaluation(instances);
        evaluation.evaluateModel(classifier, instances);

        // Return error rate
        return evaluation.errorRate();

    }

    /**
     * One iteration of the Leave-one-out Bootstrap Cross-Validation.
     * @return
     * @throws Exception 
     */
    private Evaluation bootstrapIteration(final MachineLearningAlgorithm machineLearningAlgorithm, final int randomSeed) {

        try {

            final int SIZE = instances.size();
            final Random r = new Random(randomSeed);

            // Custom sampling (100%, with replacement)
            List<Instance> TRAIN = new ArrayList<>(SIZE); // Empty list (add one-by-one)
            List<Instance> TEST = new ArrayList<>(instances); // Full (remove one-by-one)

            for(int i = 0; i < SIZE; i++) {

                // Random select instance
                Instance instance = instances.get(r.nextInt(SIZE));

                // Add to TRAIN, remove from TEST
                TRAIN.add(instance);
                TEST.remove(instance);

            }

            // Train
            Instances trainSet = new Instances(instances, TRAIN.size());
            trainSet.addAll(TRAIN);

            Classifier classifier = ClassifierFactory.instantiate(machineLearningAlgorithm);
            classifier.buildClassifier(trainSet);

            // Test set
            Instances testSet = new Instances(instances, TEST.size());
            testSet.addAll(TEST);

            // Test
            Evaluation evaluation = new Evaluation(instances);
            evaluation.evaluateModel(classifier, testSet);

            // Return the evaluation (for further processing)
            return evaluation;

        } catch(Exception e) {

            throw new RuntimeException(e);

        }

    }

}

Solved – Cross Validation with Preprocessing (Normalization, Discretization, Feature Selection)

Doing preprocessing out of the cross validation loop is especially bad if feature selection is performed (esp when you have large feature size) but not so much for data normalization, as by scaling either by 1 or 100, these numbers already has a predetermined meaning that there's nothing that the model can cheat and learn about the left-out set.

If you have a problem about this, it reflects more about a programming defect than a mathematical problem. A work around to just to first make the lower and upper bound for your bin incorporate all your data. Yet I don't think packages nowadays have this problem.

Best Answer

Related Solutions

Solved – Implementing the 0.632+ bootstrap method using the Weka Java API

Solved – Cross Validation with Preprocessing (Normalization, Discretization, Feature Selection)

Related Question