Solved – Problem with classification machine learning: prediction is always wrong

classificationmachine learningpattern recognitionspark-mllib

I am working on a current project where I need to classify and detect data that doesn't follow the standards and parameters required. I have past data which is the data that I will train my model with. I know what data is wrong and why, so I have labeled is as isWrong and assigned it an either 0 or 1.
I know this is a supervised machine learning project, more specifically, a binary classification problem.

I decided to first work with dummy data before going into the real data to correctly understand the functionality of these algorithms.
I have the following data set:

Name    Age    Office#   SSN   isWrong

Name: String

Age: int

Office: int

SSN: int

isWrong: int

A record is wrong if the person has an Age over 60 years old or under 18 years old. A record is also wrong if the Office# is negative or greater than 2.
Either one of those condition makes the record to be labeled as wrong.

I've tried both Logistic Regression and Naïve Bayes. I trained my model with this data set and then decided to test it with another data set that follows the same schema (without the isWrong column) and same pattern. However, when I print the prediction column, I only get at most three out of ten bad data. In other words, I don't get the prediction I wanted because it failed to classify some wrong data as wrong, meaning that my model gave a label of 0 to an obviously wrong data.

Why is this? Is there something I need to specify before going into my training and testing part?

For training I've done something as simple as this:

LogisticRegression lr = new LogisticRegression().setLabelCol("isWrong");
lr.fit(dataset);

EDIT: Here is a sample of the dummy, example data (top 25)

Name    Office  Age    SSN     isAnomaly
  F1        1   32  42376794    0
  F2        2   61  72353599    1
  F3        1   35  49001025    0
  F4        0   25  25613860    0
  F5        1   36  85720367    0
  F6        2   52  85725161    0
  F7        2   48  78016430    0
  F8        2   53  89762357    0
  F9        2   51  84866475    0
 F10        0   24  24062706    0
 F11        4   34  46233821    1
 F12        0   21  17686431    0
 F13        2   51  85547952    0
 F14        0   27  32159403    0
 F15        1   37  52685792    0
 F16        1   35  47836229    0
 F17        1   67  52770341    1
 F18        0   21  17104127    0
 F19        1   41  63042591    0
 F20        1   36  50374422    0
 F21        0   25  26303816    0
 F22        0   22  19645563    0
 F23        2   16  69963996    1
 F24        0   62  19884506    1
 F25        0   26  28498270    0

Best Answer

For (standard) logistic regression, there is an assumption being made that the true decision boundary is linear. This isn't the case with your data, because the chance of being "wrong" does not strictly increase with Age or with Office# since you have upper and lower limits.

Using logistic regression after adding squared terms should improve your classifier (approximate a square decision boundary with an ellipse), though a better model would be something like nearest neighbors or decision trees/random forest; the latter would also quickly rule out the other predictors making a difference. (Incidentally, the predictors are not independent for naive Bayes, since SSN should be highly correlated with Age).

Related Question