What is the best way to automatically select features for anomaly detection?
I normally treat Anomaly Detection as an algorithm where the features are selected by human experts: what matters is the output range (as in "abnormal input – abnormal output") so even with many features you can come up with a much smaller subset by combining the features.
However, assuming that in general case a feature list can be huge, perhaps an automated learning is sometimes preferable. As far as I can see, there are some attempts:
- "Automated feature selection for Anomaly Detection" (pdf) which generalizes Support Vector Data Description
- "A Fast Host-Based Intrusion Detection System Using Rough Set Theory" (no pdf available?) which, I guess, uses Rough Set Theory
- "Learning Rules for Anomaly Detection of Hostile Network Traffic" (pdf, video) which uses statistical approach
So now I wonder if anyone can tell – assuming anomaly detection and a really big (hundreds?) feature set:
- Do those huge feature sets make sense at all? Shouldn't we just reduce the feature set upto, say, a few dozens and that's it?
- If huge feature sets do make sense, which one of the approaches above would give better predictions, and why? Is there anything not listed which is much better?
- Why should they give better results comparing to, say, dimensionality reduction or feature construction via clustering/ranking/etc?
Best Answer
One practical approach (in case of supervised learning at least) is to include all possibly relevant features and use a (generalized) linear model (logistic regression, linear svm etc.) with regularization (L1 and/or L2). There are open source tools (e.g. Vowpal Wabbit) that can deal with trillions of example/feature combinations for these types of models so scalability is not an issue (besides, one can always use sub-sampling). The regularization helps to deal with feature selection.