Solved – “Semi supervised learning” – is this overfitting

boostingmachine learningoverfittingrandom forestsemi-supervised-learning

I was reading the report of the winning solution of a Kaggle competition (Malware Classification). The report can be found in this forum post. The problem was a classification problem (nine classes, the metric was the logarithmic loss) with 10000 elements in the train set, 10000 elements in the test set.

During the competition, the models were evaluated against 30% of the test set. Another important element is that the models were performing very well (close to 100% accuracy)

The authors used the following technique:

Another important technique we come up is Semisupervised Learning. We
first generate pseudo labels of test set by choosing the max
probability of our best model. Then we predict the test set again in a
cross validation fashion with both train data and test data. For
example, the test data set is split to 4 part A, B, C and D. We use
the entire training data, and test data A, B, C with their pseudo
labels, together as the new training set and we predict test set D.

The same method is used to predict A, B and C. This approach, invented
by Xiaozhou, works surprisingly well and it reduces local cross
validation loss, public LB loss and private LB loss. The best
Semisupervised learning model can achieve 0.0023 in private LB log
loss, which is the best score over all of our solutions.

I really don't see how it can improve the results. Is it because 30% of the test set was "leaked" and it was a way to use this information?

Or is there any theoretical reason explaining why it works ?

Best Answer

It doesn't appear to be overfitting. Intuitively, overfitting implies training to the quirks (noise) of the training set and therefore doing worse on a held-out test set which does not share these quirks. If I understand what happened, they did not do unexpectedly-poorly on held-out test data and so that empirically rules out overfitting. (They have another issue, which I'll mention at the end, but it's not overfitting.)

So you are correct that it takes advantage of the available (30%?) test data. The question is: how?

If the available test data has labels associated with it, you could simply lump it into your training data and enlarge your training data, which in general would yield better results in an obvious way. No real accomplishment there.

Note that the labels wouldn't have to be explicitly listed if you have access to an accuracy score. You could simply climb the accuracy gradient by repeatedly submitting scores, which is what people have done in the past with poorly-designed competitions.

Given that the available test data does not have labels associated with it -- directly or indirectly -- there are at least two other possibilities:

First, this could be an indirect boosting method where you're focusing on cases where your predictions with only the training data disagree with your predictions with the pseudo-labeled test data included.

Second, it could be straightforward semi-supervised learning. Intuitively: you could be using the density of unlabeled data to help shape the classification boundaries of a supervised method. See the illustration (https://en.wikipedia.org/wiki/Semi-supervised_learning#/media/File:Example_of_unlabeled_data_in_semisupervised_learning.png) in the Wikipedia definition of semi-supervised learning to clarify.

BUT this doesn't mean that there isn't a trick here. And that trick comes from the definition of training and test data. In principle, training data represents data that you could have in hand when you are ready to deploy your model. And test data represents future data that will come into your system once it's operational.

In that case, training on test data is a leak from the future, where you are taking advantage of data you would not have seen yet. This is a major issue in the real world, where some variables may not exist until after the fact (say after an investigation is done) or may be updated at a later date.

So they are meta-gaming here: what they did is legitimate within the rules of the competition, because they were given access to some of the test data. But it's not legitimate in the real world, where the true test is how well it does in the future, on new data.

Related Question