Solved – Can decision tree be used for fraud detection in this way

cartoutliers

A large dataset with more than 100 variables including a target variable. A small portion of target = 1 cases are fraud or due to other errors. I want to identify these target = 1 cases, i.e. fraud or error.
Assuming most cases are good, I intend to use decision tree to classify cases. In one leaf with a high percent of target = 0, e.g. in 95% cases target = 0, and in 5% cases target = 1, then I think the 5% cases with target = 1 in this leaf are classification errors that include random error and the error caused by fraudulent behavior, so they can be consider as suspected fraud cases.

Does this make sense?

As for the decision tree leaves with close percentages of target = 1 and target = 0, e.g. 43% vs. 57%, what can I do?

Based on @Zhubarb's suggestion, I made the following changes to the model and its interpretation:

  1. Set leaf size, e.g. at least 100 observations in a leaf
  2. My goal is to identify fraud at the organization level instead of the individual observation level. Considering the prior probability problem as Zhubarb mentioned, I interpret the decision tree model in a new way: take a leaf with 30% target = 1 and 70% target = 0 for example, and assume the expected positive percentage should be 30% for the individuals in this leaf. Then look at each organization; if in this leaf one organization's positive percentage is much higher than 30%, it can be considered as a suspect organization. In this way, all leaves regardless of whatever positive percentages are high or low can be used.

Any comments are appreciated!

Best Answer

Just a couple of remarks that may be helpful:

As far as I know decision trees are not traditionally used for anomaly detection. Support Vector Machines, Artificial Neural Networks, Gaussian Mixture Models, and Bayesian Networks are the more commonly used machine learning methodologies for this purpose. Yuo can have a look at this paper for further reading.

What you describe can be used to highlight the 'unlikely' cases in the leaf nodes. However, bear in mind that depending on your feature dimensionality and training data size, you may end up with leaf nodes with very few observations, e.g. 2 positives and 0 negatives. In this case, it is debatable whether labelling a 'negative' observation with the variable combination of that particular leaf node as 'potentially fraudulent' would be wise.

Similarly, as you point out, labelling based on the dominant class of a leaf node, if the class frequencies are 43% positive and 57% negative, it may not make much sense to make any inferences about any observation to be an anomaly (e.g. fradulent).

Furthermore, the prior distribution of your class labels should also affect your decisions. For instance if initially 90% of your observations are labelled as negative, any inferences you will make from the posterior distributions should take this inherent bias into account (not only in detecting anomalies but also in evaluating the performance of your classifier in the first place).