Solved – Unsupervised Learning: Train Test division

isolation-forestmachine learningrandom forestscikit learnunsupervised learning

I have one conceptual question.

In Unsupervised Learning, when I have no labels. The anomaly detection model (Isolation forests, Autoencoders, Distance-based methods etc.), it should fit on a training data and then test( Train- Test split) just like a common supervised technique of creating the datafolds?

It helps in many ways during supervised learning to reduce the overfitting.

Or, it doesn't matter in unsupervised learning and I can train on all of my available dataset? Since there are no labels or measures to check the accuracy of fit.

Best Answer

It doesn't make too much sense to split dataset for unsupervised learning since you don't have labels to automatically calculate the accuracy/effectiveness of your model.

One way of getting a sense of how well your model is doing is to check the detected samples from your unsupervised model. For example, say you detected 50 samples that are falling away from the majority of your data, then manually check those 50 to see the percentage of positives. That way you can feel how good your model is. Then, based on your previous knowledge on how many positive cases (roughly) should be there in your dataset, you can estimate how many positive cases are not captured by your current model. This allows you to calculate a rough sensitivity and specificity of your model.