Solved – Training multiple models for classification using the same dataset

classificationmachine learning

For my classification problem, I am trying to classify an object as Good or Bad. I have been able to create a good first classification step that separates the data into 2 groups using SVM.

After tuning the parameters for the SVM using a training/holdout set (75% training, 25% holdout), I obtained the following results from the holdout set: Group 1 (model classified as Bad) consisted of 99% Bad objects, and Group 2 (model classified as Good) consisted of about 45% Good objects and 55% Bad objects. I verified the performance of the model using k-fold CV (k=5) and found the model to be stable and perform relatively consistently in terms of misclassification rates.

Now, I want to pass these objects through another round of classification by training another model (may or may not be SVM) on my group 2 of maybe good/maybe bad objects to try and correctly classify this second group now that I have gotten rid of the obviously bad objects.

I had a couple of thoughts, but am unsure of how to proceed.

(1) My first idea was to use the data from the classified objects from the Holdout set to train another model. I was able to train another classification model from the results of the holdout set. The problem is I am using less than 25% of the original data, and I am worried of overfitting on a very small subset of my data.

(2) My second idea was to gather the results of the 5-fold CV to create another dataset. My reasoning is that since the data is partitioned into 5 parts, and each part is classified into two groups from a model trained by the other 4 parts, I thought that I could aggregate the predicted results of the 5 parts to obtain a classified version of my original dataset and continue from there.

The only problem is, I have a sinking feeling that both methods are no good. Could CV shed some light on some possible next steps?

EDIT

Sorry, my question was badly worded. Let me try to clarify what I am trying to do. It can be thought of like a tree…

  • Let me call the original dataset Node 0.
  • I used classification method 1 to split Node 0 into Node 1 and Node 2.
    • Node 1 has low misclassification rate (Mostly consists of bad objects)
    • Node 2 has high misclassification rate (Roughly even mix of good and bad objects)
  • I now want to use classification method 2 to split Node 2 into Node 3 and 4

The "classification method" can be anything (LDA, QDA, SVM, CART, Random Forest, etc). So I guess what I am trying to achieve here is a "classification" tree (not CART), where each node is subjected to a different classification method to obtain an overall high "class purity". Basically, I want to use a mix of different classification methods to obtain reasonable results.

My problem lies in the loss of training data after the first split. I run out of usable data after I run it through "classification method 1", which was SVM in my case.

Best Answer

Just to make sure that we are on the same page, I take it from your description that you consider a supervised learning problem where you know the Good/Bad status of your objects and where you have a vector of features for each object that you want to use to classify the object as either Good or Bad. Moreover, the result of training an SVM is to give a classifier, which, on the holdout data, gives almost no false Bad predictions, but 55% false Good predictions. I have not personally worked with problems with such a huge difference in error rates on the two groups. It suggests to me that the distribution of features in the two groups overlap, but that the distribution of features in the Bad group is more spread out. Like two Gaussian distributions with almost the same mean but larger variance for the group of Bad objects. If that is the case, I would imagine that it will be difficult, if not impossible, to improve much on the error rate for the Good predictions. There may be other explanations that I am not aware of.

Having said that, I think it is a sensible strategy to combine classification procedures in a hierarchical way as you suggest. First, one classifier splits the full training set into two groups, and then other classifiers split each of the groups into two groups etc. In fact, that is what classification trees do, but typically using very simple splits in each step. I see no formal problem in training whatever model you like on the training data that is classified as being Good by the SVM. You don't need to use the holdout data. In fact, you shouldn't, if you need the holdout data for assessment of the model.

Your second suggestion is closely related to just using the group classified as Good from your training data to train a second model. I don't see any particular reason to use CV-based classifications to obtain this group. Just remember, that if you are going to use CV, then the entire training procedure must be carried out each time.

My suggestion is to first get a better understanding of what the feature distributions look like in the two groups from low-dimensional projections and exploratory visualizations. It might shed some light on why the error rate on the Good classifications is so large.

Related Question