I am working on a current project where I need to classify and detect data that doesn't follow the standards and parameters required. I have past data which is the data that I will train my model with. I know what data is wrong and why, so I have labeled is as isWrong and assigned it an either 0 or 1.
I know this is a supervised machine learning project, more specifically, a binary classification problem.
I decided to first work with dummy data before going into the real data to correctly understand the functionality of these algorithms.
I have the following data set:
Name Age Office# SSN isWrong
Name: String
Age: int
Office: int
SSN: int
isWrong: int
A record is wrong if the person has an Age over 60 years old or under 18 years old. A record is also wrong if the Office# is negative or greater than 2.
Either one of those condition makes the record to be labeled as wrong.
I've tried both Logistic Regression and Naïve Bayes. I trained my model with this data set and then decided to test it with another data set that follows the same schema (without the isWrong column) and same pattern. However, when I print the prediction column, I only get at most three out of ten bad data. In other words, I don't get the prediction I wanted because it failed to classify some wrong data as wrong, meaning that my model gave a label of 0 to an obviously wrong data.
Why is this? Is there something I need to specify before going into my training and testing part?
For training I've done something as simple as this:
LogisticRegression lr = new LogisticRegression().setLabelCol("isWrong");
lr.fit(dataset);
EDIT: Here is a sample of the dummy, example data (top 25)
Name Office Age SSN isAnomaly
F1 1 32 42376794 0
F2 2 61 72353599 1
F3 1 35 49001025 0
F4 0 25 25613860 0
F5 1 36 85720367 0
F6 2 52 85725161 0
F7 2 48 78016430 0
F8 2 53 89762357 0
F9 2 51 84866475 0
F10 0 24 24062706 0
F11 4 34 46233821 1
F12 0 21 17686431 0
F13 2 51 85547952 0
F14 0 27 32159403 0
F15 1 37 52685792 0
F16 1 35 47836229 0
F17 1 67 52770341 1
F18 0 21 17104127 0
F19 1 41 63042591 0
F20 1 36 50374422 0
F21 0 25 26303816 0
F22 0 22 19645563 0
F23 2 16 69963996 1
F24 0 62 19884506 1
F25 0 26 28498270 0
Best Answer
For (standard) logistic regression, there is an assumption being made that the true decision boundary is linear. This isn't the case with your data, because the chance of being "wrong" does not strictly increase with
Age
or withOffice#
since you have upper and lower limits.Using logistic regression after adding squared terms should improve your classifier (approximate a square decision boundary with an ellipse), though a better model would be something like nearest neighbors or decision trees/random forest; the latter would also quickly rule out the other predictors making a difference. (Incidentally, the predictors are not independent for naive Bayes, since
SSN
should be highly correlated withAge
).