Solved – How to improve a Fraud Classification Model

fraud-detectionlogisticmachine learningnormalization

I built a classification model (Logistic Regression) in order to classify data in Fraud or Not Fraud.
This data is related with online CNP (Card Not Present) transactions and after choosing some parameters that seemed to be related with fraud, I tested the model.
For this I used a training set of 225000 examples, and a test set of 75000.

I conducted two different tests, in the first one I had 7 parameters and managed to obtain a 96% accuracy in classification. The problem is that the number of Fraud cases is much lower than the Not Fraud ones, and so regarding the Fraud cases I got an accuracy of only 11% while on the Not Fraud 90 and something %.

In the second test I included more parameters, making a total of 15.
Same training and test set size, and although the overall classification dropped to 92%, regarding the cases of Fraud I got an improvement to 30%, and Not Fraud still around the 90s%.

I would like to maintain the overall classification accuracy around 90%, and improve the accuracy on Fraud cases to something like 65~75%, but I can't find more parameters that seem to be relevant, to include in my model and I am stuck as other than that, no more ideas come to mind…
Can someone please give me some hints or ideas on what to try next in order to try to achieve these goals?

Also, I have another doubt. Because the values of the parameters that I am using, have a very wide range, I applied Feature Scaling and Mean Normalization over them. I have 300000 example samples (Training set – 225000 and Test set – 75000). My question is, for each of these sets should I calculate the respective Average of each column and the Max – Min, in order to convert the values to a smaller scale, or should I calculate the average and max-min, based on the whole sample (the 300000 examples)?

Best Answer

What proportion of your 225,000 training samples are cases of fraud? I suspect very few. This will cause problems unless you take care in how you build a classifier from the logistic regression.

Given the issue you've described, I assume you are making your classifications based on a cut-off of a probability of $0.5$ from the logistic regression. You either need to choose a more appropriate cut-off, or weight the fraud samples (a weight of $19$ would be appropriate given you have $19$ times more not fraud cases in your data set), or discard a lot of the not fraud cases so you have a balanced data set.

As for your second question, to properly assess the out of sample performance I would calculate the average/min/max of each column only using your training set. As an alternative to scaling you might consider binning the relevant variables.