Machine Learning – Distinguishing Between Two Groups: Hypothesis Test vs. Classification vs. Clustering

hypothesis testingmachine learningsupervised learningt-testunsupervised learning

Assume I have two data groups, labeled A and B (each containing e.g. 200 samples and 1 feature), and I want to know if they are different. I could:

  • a) perform a statistical test (e.g. t-test) to see if they are statistically different.

  • b) use supervised machine learning (e.g. support vector classifier or random forest classifier). I can train this on a part of my data and verify it on the rest. If the machine learning algorithm classifies the rest correctly afterwards, I can be sure that the samples are differentiable.

  • c) use an unsupervised algorithm (e.g. K-Means) and let it divide all data into two samples. I can then check if these two found samples agree with my labels, A and B.

My questions are:

  1. How are these three different ways overlapping/exclusive?
  2. Are b) and c) useful for any scientific arguments?
  3. How could I get a “significance“ for the difference between samples A and B out of methods b) and c)?
  4. What would change if the data had multiple features rather than 1 feature?
  5. What happens if they contain a different number of samples, e.g. 100 vs 300?

Best Answer

Great question. Anything can be good or bad, useful or not, based on what your goals are (and perhaps on the nature of your situation). For the most part, these methods are designed to satisfy different goals.

  • Statistical tests, like the $t$-test allow you to test scientific hypotheses. They are often used for other purposes (because people just aren't familiar with other tools), but generally shouldn't be. If you have an a-priori hypothesis that the two groups have different means on a normally distributed variable, then the $t$-test will let you test that hypothesis and control your long-run type I error rate (although you won't know whether you made a type I error rate in this particular case).
  • Classifiers in machine learning, like a SVM, are designed to classify patterns as belonging to one of a known set of classes. The typical situation is that you have some known instances, and you want to train the classifier using them so that it can provide the most accurate classifications in the future when you will have other patterns whose true class is unknown. The emphasis here is on out of sample accuracy; you are not testing any hypothesis. Certainly you hope that the distribution of the predictor variables / features differ between the classes, because otherwise no future classification help will be possible, but you are not trying to assess your belief that the means of Y differ by X. You want to correctly guess X in the future when Y is known.
  • Unsupervised learning algorithms, like clustering, are designed to detect or impose structure on a dataset. There are many possible reasons you might want to do this. Sometimes you might expect that there are true, latent groupings in a dataset and want to see if the results of clustering will seem sensible and usable for your purposes. In other cases, you might want to impose a structure on a dataset to enable data reduction. Either way, you are not trying to test a hypothesis about anything, nor are you hoping to be able to accurately predict anything in the future.

With this in mind, lets address your questions:

  1. The three methods differ fundamentally in the goals they serve.
  2. b and c could be useful in scientific arguments, it depends on the nature of the arguments in question. By far the most common type of research in science is centered on testing hypotheses. However, forming predictive models or detecting latent patters are also possible, legitimate goals.
  3. You would not typically try to get 'significance' from methods b or c.
  4. Assuming the features are categorical in nature (which I gather is what you have in mind), you can still test hypotheses using a factorial ANOVA. In machine learning there is a subtopic for multi-label classification. There are also methods for multiple membership / overlapping clusters, but these are less common and constitute a much less tractable problem. For an overview of the topic, see Krumpleman, C.S. (2010) Overlapping clustering. Dissertation, UT Austin, Electrical and Computer Engineering (pdf).
  5. Generally speaking, all three types of methods have greater difficulty as the number of cases across the categories diverge.