I'm struggling with understanding the concept of splitting data for unsupervised anomaly/outlier detection. You can find all approaches here. I found some papers and implementations that didn't split the data in their analysis or evaluate their algorithm:
- Unsupervised Anomaly Detection:XBOS/HBOS/IForest
- Histogram-based Outlier Score (HBOS):
A fast Unsupervised Anomaly Detection
Algorithm
I also found some other implementations of the splitted data in their analysis:
- Testing Isolation Forest on Non-Financial Data – 2 KDDCUP99 Sets
- Is train/test-Split in unsupervised learning of neural network necessary?
The questions are:
- Why should I split or not split my datasets when I apply unsupervised anomaly/outlier detection?
- What is the advantage or disadvantage of both approaches?
- The result(accuracy: AUC or AP, FP) of detection should be the same whether I split the data (e.g., 70%train-30%test) or un-cut data?
Generally, Unsupervised outlier detection algorithms (e.g. IsolationForest, HBOS) predict outliers based on their outlier scores over unlabeled data.
Suppose I split the data (e.g., 70%train-30%test including startify, there is still somehow a possibility of neglecting/missing the possible outliers exist the trainset, while the model results reflect based on test-set observations at the end of the day (there is no guaranty). On the other hand, it might be the case that the final evaluation would not be fair. please see this post
In my case, I want to apply some algorithms on famous outlier detection datasets/benchmark without labels/target column and although the labels are there BUT not for being used, a bit confusing, it is more to validate & plotting purposes the approaches afterwards to compare different detection models with my own built algorithm.
Please see the Pythonic code after dopping the labels exist in name_target
:
X, y = df.loc[:, df.columns!= name_target], df[name_target]
seed = 120
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=test_size,
random_state=seed,
stratify=y)
Best Answer
The splitting of datasets is used to give an estimate of generalized performance, and is used for predictive models - models that are designed to take new datapoints and output new predictions for them. Predictive models can be made using supervised learning (most common for classification and regression), unsupervised learning (common for anomaly detection) or combinations of unsupervised and supervised learning (semi-supervised, self-supervised etc).
So, whenever you are making a predictive model - use train/validation/test splits. And also note that even if applying unsupervised learning, it is extremely useful to have labeled validation and test sets - because the labels are key to most meaningful performance metrics.
Note: Sometimes unsupervised learning methods are used for models which are not designed to be predictive models. Common example are (a) clustering when used for example for Explorative Data Analysis, or (outlier detection when used for cleaning a particular dataset. In these cases we only care how the models work on a given dataset, and not on new data - and the dataset splitting is not needed.