I have a dataset with over 10,000 employees and I'm interested in performing some tests to identify salary anomalies. For example, paychecks for some employees might look like this:
Employee 1 - Jan:
Salary Symbol Salary Code Amount
Regular Sal 400 $1000
Health care 401 $300
Dental 402 $100
Tax 403 $-100
Net Pay 404 $1300
Employee 2 - Jan:
Salary Symbol Salary Code Amount
Regular Sal 400 $1500
Health care 401 $200
Dental 402 $200
Tax 403 $-200
Net Pay 404 $1700
Employee 1 - Feb:
Salary Symbol Salary Code Amount
Regular Sal 400 $1500
Health care 401 $200
Dental 402 $1000
Tax 403 $-200
Net Pay 404 $2500
I have data on paychecks with symbols for 10,000 employees for January to December. I'm interested in finding a pairs (Employee, Symbol)
that present an anomaly.
For example: in the paychecks I made up maybe (Employee1, Dental(402))
might be an anomaly.
I can look at the mean, but I thought maybe I should look at a variance comparison between the Salary Symbol
for all employees compared to the variance of the symbol for a specific employee and test how different they are using a non-parametric or parametric version of Levene's test.
For example, for the "Dental" Symbol:
Dental - Ave for pupulation (calaculated without employee 1):
Symbol Jan Feb Mar Apr May June July Aug Sep oct Nov Dec
Dental 500 340 500 340 500 340 500 340 500 340 500 340
Dental - For Employee 1:
Symbol Jan Feb Mar Apr May June July Aug Sep oct Nov Dec
Dental 100 1000 300 243 523 240 542 543 131 334 543 124
The idea is to test whether the "Dental" Symbol
– for employee 1 behaves like the symbol "Dental"
for the population and of course to test that for all other employees.
- Would the chi-squared test be appropriate?
- What do you think is a better way?
Best Answer
"Outlier" is not a well-defined term (there are in fact many different definitions), so the question as asked does not have a single answer. However, an appropriate answer needs to be one that fits the use you will make of the results. You have said in a comment that the use is "to report possible fraud to management," so you need to think carefully about what a case with possible fraud might look like. And you also need to think about the consequences of "false positives" -- e.g., if you reported a case as possible fraud, but there was in fact no fraud, what might be deleterious consequences (i.e, serious "side effects") of that? It sounds like you are looking for some handy formula. Sorry, this is a case where you really need to think about what is appropriate for this particular use. It's not a problem with a ready-made standard solution. However, it might be worth looking up literature on fraud detection to get some ideas. But even if you find some ideas for fraud detection, they are just idea -- you still need to think carefully about whether or not they fit your situation, and about possible consequences of identifying "possible fraud" when there is no fraud. The literature on medical testing would also be good to study (concepts such as sensitivity, specificity, positive predictive value, and operating characteristic curve).