It doesn't appear to be overfitting. Intuitively, overfitting implies training to the quirks (noise) of the training set and therefore doing worse on a held-out test set which does not share these quirks. If I understand what happened, they did not do unexpectedly-poorly on held-out test data and so that empirically rules out overfitting. (They have another issue, which I'll mention at the end, but it's not overfitting.)
So you are correct that it takes advantage of the available (30%?) test data. The question is: how?
If the available test data has labels associated with it, you could simply lump it into your training data and enlarge your training data, which in general would yield better results in an obvious way. No real accomplishment there.
Note that the labels wouldn't have to be explicitly listed if you have access to an accuracy score. You could simply climb the accuracy gradient by repeatedly submitting scores, which is what people have done in the past with poorly-designed competitions.
Given that the available test data does not have labels associated with it -- directly or indirectly -- there are at least two other possibilities:
First, this could be an indirect boosting method where you're focusing on cases where your predictions with only the training data disagree with your predictions with the pseudo-labeled test data included.
Second, it could be straightforward semi-supervised learning. Intuitively: you could be using the density of unlabeled data to help shape the classification boundaries of a supervised method. See the illustration (https://en.wikipedia.org/wiki/Semi-supervised_learning#/media/File:Example_of_unlabeled_data_in_semisupervised_learning.png) in the Wikipedia definition of semi-supervised learning to clarify.
BUT this doesn't mean that there isn't a trick here. And that trick comes from the definition of training and test data. In principle, training data represents data that you could have in hand when you are ready to deploy your model. And test data represents future data that will come into your system once it's operational.
In that case, training on test data is a leak from the future, where you are taking advantage of data you would not have seen yet. This is a major issue in the real world, where some variables may not exist until after the fact (say after an investigation is done) or may be updated at a later date.
So they are meta-gaming here: what they did is legitimate within the rules of the competition, because they were given access to some of the test data. But it's not legitimate in the real world, where the true test is how well it does in the future, on new data.
Great question. Anything can be good or bad, useful or not, based on what your goals are (and perhaps on the nature of your situation). For the most part, these methods are designed to satisfy different goals.
- Statistical tests, like the $t$-test allow you to test scientific hypotheses. They are often used for other purposes (because people just aren't familiar with other tools), but generally shouldn't be. If you have an a-priori hypothesis that the two groups have different means on a normally distributed variable, then the $t$-test will let you test that hypothesis and control your long-run type I error rate (although you won't know whether you made a type I error rate in this particular case).
- Classifiers in machine learning, like a SVM, are designed to classify patterns as belonging to one of a known set of classes. The typical situation is that you have some known instances, and you want to train the classifier using them so that it can provide the most accurate classifications in the future when you will have other patterns whose true class is unknown. The emphasis here is on out of sample accuracy; you are not testing any hypothesis. Certainly you hope that the distribution of the predictor variables / features differ between the classes, because otherwise no future classification help will be possible, but you are not trying to assess your belief that the means of Y differ by X. You want to correctly guess X in the future when Y is known.
- Unsupervised learning algorithms, like clustering, are designed to detect or impose structure on a dataset. There are many possible reasons you might want to do this. Sometimes you might expect that there are true, latent groupings in a dataset and want to see if the results of clustering will seem sensible and usable for your purposes. In other cases, you might want to impose a structure on a dataset to enable data reduction. Either way, you are not trying to test a hypothesis about anything, nor are you hoping to be able to accurately predict anything in the future.
With this in mind, lets address your questions:
- The three methods differ fundamentally in the goals they serve.
- b and c could be useful in scientific arguments, it depends on the nature of the arguments in question. By far the most common type of research in science is centered on testing hypotheses. However, forming predictive models or detecting latent patters are also possible, legitimate goals.
- You would not typically try to get 'significance' from methods b or c.
- Assuming the features are categorical in nature (which I gather is what you have in mind), you can still test hypotheses using a factorial ANOVA. In machine learning there is a subtopic for multi-label classification. There are also methods for multiple membership / overlapping clusters, but these are less common and constitute a much less tractable problem. For an overview of the topic, see Krumpleman, C.S. (2010) Overlapping clustering. Dissertation, UT Austin, Electrical and Computer Engineering (pdf).
- Generally speaking, all three types of methods have greater difficulty as the number of cases across the categories diverge.
Best Answer
It doesn't make too much sense to split dataset for unsupervised learning since you don't have labels to automatically calculate the accuracy/effectiveness of your model.
One way of getting a sense of how well your model is doing is to check the detected samples from your unsupervised model. For example, say you detected 50 samples that are falling away from the majority of your data, then manually check those 50 to see the percentage of positives. That way you can feel how good your model is. Then, based on your previous knowledge on how many positive cases (roughly) should be there in your dataset, you can estimate how many positive cases are not captured by your current model. This allows you to calculate a rough sensitivity and specificity of your model.