I have a dataset with 1000 rows and I have an identifier variable called person_id
.
May I check why is it told that we should remove identifier variables from ML model?
I ask because in my dataset, the same person can occur multiple times. Menaing, the same person has consulted the doctor multiple times. So, we have multiple observations of him.
So, my questions are as follows
a) Why should we remove identifier variables from ML model? If not, how useful are they?
b) If we remove identifier variable, how will I link the predicted outcome with the person_id
?
Let's say my algo predicts that this specific instance is positive
but if I remove the person_id
column, how can I know that this prediction is for a specific person (with id = blabla) because we would have removed his Id.
c) Even if a identifier column is completely unique (like a serial number), should we still remove them from the model?
Best Answer
Quick answer: Keep it for identification, but don't pass it into the ML model.
(c & a) We should. To me, we train a model to generalize all of your persons. So if you treat your
person_id
as a categorical feature, because no two persons share the sameperson_id
, oneperson_id
does not speak anything about another person, so it lacks the power to generalize. If you treat it like a continuous feature, then you may have a very hard time when justifying why you should order persons in the way it is, i.e. why person A should be followed by person B but not person C.(b) Just don't give it to the ML model, keep it anywhere else :)