Solved – Detect outliers (anomalies) in salary data

I have a dataset with over 10,000 employees and I'm interested in performing some tests to identify salary anomalies. For example, paychecks for some employees might look like this:

Employee 1 - Jan:
Salary Symbol  Salary Code Amount  
Regular Sal      400       $1000
Health care      401       $300
Dental           402       $100
Tax              403       $-100
Net Pay          404       $1300


Employee 2 - Jan:
Salary Symbol  Salary Code Amount  
Regular Sal      400       $1500
Health care      401       $200
Dental           402       $200
Tax              403       $-200
Net Pay          404       $1700

Employee 1 - Feb: 
Salary Symbol  Salary Code Amount  
Regular Sal      400       $1500
Health care      401       $200
Dental           402       $1000
Tax              403       $-200
Net Pay          404       $2500

I have data on paychecks with symbols for 10,000 employees for January to December. I'm interested in finding a pairs (Employee, Symbol) that present an anomaly.

For example: in the paychecks I made up maybe (Employee1, Dental(402)) might be an anomaly.

I can look at the mean, but I thought maybe I should look at a variance comparison between the Salary Symbol for all employees compared to the variance of the symbol for a specific employee and test how different they are using a non-parametric or parametric version of Levene's test.

For example, for the "Dental" Symbol:

Dental  - Ave for pupulation (calaculated without employee 1): 

Symbol Jan   Feb    Mar   Apr  May   June  July   Aug  Sep  oct Nov  Dec  
Dental 500   340   500    340  500   340   500    340  500  340 500  340 

Dental - For Employee 1: 

Symbol Jan   Feb    Mar   Apr  May   June  July   Aug  Sep  oct Nov  Dec  
Dental 100   1000   300   243  523   240   542    543  131  334 543  124

The idea is to test whether the "Dental" Symbol – for employee 1 behaves like the symbol "Dental" for the population and of course to test that for all other employees.

Would the chi-squared test be appropriate?
What do you think is a better way?

Best Answer

"Outlier" is not a well-defined term (there are in fact many different definitions), so the question as asked does not have a single answer. However, an appropriate answer needs to be one that fits the use you will make of the results. You have said in a comment that the use is "to report possible fraud to management," so you need to think carefully about what a case with possible fraud might look like. And you also need to think about the consequences of "false positives" -- e.g., if you reported a case as possible fraud, but there was in fact no fraud, what might be deleterious consequences (i.e, serious "side effects") of that? It sounds like you are looking for some handy formula. Sorry, this is a case where you really need to think about what is appropriate for this particular use. It's not a problem with a ready-made standard solution. However, it might be worth looking up literature on fraud detection to get some ideas. But even if you find some ideas for fraud detection, they are just idea -- you still need to think carefully about whether or not they fit your situation, and about possible consequences of identifying "possible fraud" when there is no fraud. The literature on medical testing would also be good to study (concepts such as sensitivity, specificity, positive predictive value, and operating characteristic curve).

Best Answer

Related Solutions

Solved – Forecasting beyond one season using Holt-Winters’ exponential smoothing

Detecting Outliers in Very Small Datasets – Techniques and Strategies

Related Question