Machine Learning – Usefulness of Identifier Variables in Model Building

classificationhypothesis testingmachine learningneural networks

I have a dataset with 1000 rows and I have an identifier variable called person_id.

May I check why is it told that we should remove identifier variables from ML model?

I ask because in my dataset, the same person can occur multiple times. Menaing, the same person has consulted the doctor multiple times. So, we have multiple observations of him.

So, my questions are as follows

a) Why should we remove identifier variables from ML model? If not, how useful are they?

b) If we remove identifier variable, how will I link the predicted outcome with the person_id?

Let's say my algo predicts that this specific instance is positive but if I remove the person_id column, how can I know that this prediction is for a specific person (with id = blabla) because we would have removed his Id.

c) Even if a identifier column is completely unique (like a serial number), should we still remove them from the model?

Best Answer

Quick answer: Keep it for identification, but don't pass it into the ML model.

(c & a) We should. To me, we train a model to generalize all of your persons. So if you treat your person_id as a categorical feature, because no two persons share the same person_id, one person_id does not speak anything about another person, so it lacks the power to generalize. If you treat it like a continuous feature, then you may have a very hard time when justifying why you should order persons in the way it is, i.e. why person A should be followed by person B but not person C.

(b) Just don't give it to the ML model, keep it anywhere else :)

Best Answer

Related Solutions

Downsampling – Reasons and Advantages in Machine Learning

Solved – Overfitting due to a unique identifier among features

Related Question