Solved – use output of classifier A as feature for classifier B

classificationensemble learningfeature selectionfeature-engineering

This is likely to be a confused question, but I'm curious if this is a valid way to combine classifiers.

I have a classification data set, i.e. column of labels and N columns of features, and I use a classifier (A) to generate a column of predicted labels. Can I turn around and use those labels as another feature for a second classifier (B)? That is, can I classify using classifier B with N+1 features, where the +1 feature is the output of classifier A? (Question 1)

A similar question was asked here about neural networks, but I think the answer was geared to unsupervised learning. I'm wondering if this is a valid way to combine classifiers for supervised learning, and why / why not.

If it is valid, is there any advantage / disadvantage to this "sequential" combination compared to simply combining the predictions of classifiers A and B "after the fact"? That is, predict labels with A, predict labels with B, and average or otherwise combine the predictions. (Question 2)

Best Answer

Yes, using outputs of one model as inputs of another is possible, and this concept is used to some extent in some approaches.

If you stack models on top of each other, in the application case the model at position $n$ in the chain essentially does preprocessing/feature derivation for the model at position $n+1$. One advantage of this is that more preprocessing and feature derivation conceptually increases the power of whole setup (lets just assume it's this way for simplicity) - hence more complex problems can be solved with such chains. But this will also make training more complex. The question will usually be: how to train this chain of models? Both at the same time? One after another? If supervised - which target variable to use for models not at last position in the chain?

Such thoughts lead to e.g. what we currently see in deep learning: one core concept of deep neural networks (deep nets) is that there are many layers, where each layer is an additional feature preprocessing/feature derivation that is embedded right into the model (therefore increases the models power). Conceptually, all layers could be trained together - which would lead to straight, supervised learning. But in practice, complexity issues usually prevent this. This is why some deep nets learn some layers individually, some learn parts in supervised manner, some in unsupervised manner, and most mix those concepts. Consider e.g. the learning process and information flow in deep believe networks, convolutional neural nets, or deep autoencoders. My guess is that those might get pretty close to what you had in mind when asking your question.

Sidenote: you might also want to look into the concept of boosting if you are not familiar with it yet. Boosting does not "chain" any models, but it uses the training error of one model to influence training of subsequent models - which turned out to be very effective in the past. Boosting in a nutshell: it assigns a weight to each sample at the start of training. After a model is trained and evaluated, samples that were classified wrong get their weights increased, while sample that were classified correctly get their weight decreased. The subsequent model uses those new weight in training, therefore lays more weight on samples that were classified wrong before. This process is repeated, and this way each model is influenced by its predecessor and focuses on the part of data that is "currently difficult to classify".

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Solved – Classifier feature importance

One famous paper is Wrappers for feature subset selection by Kohavi and John (1997). It basically deals with the problem of removing the irrelevant features, which should provide an answer to your question. The approach is agnostic wrt the classifier used.

It's old, and it has almost 4k citations, so there is probably a wealth of follow-up work.

Best Answer

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

Solved – Classifier feature importance

Related Question