Solved – Why would somebody use a hash function for creating a test/train split instead of random seed

hashhypothesis testingmachine learningtrain

I'm going through some ML training material from Google (I can't post a link because I'm getting the material through my company).

In the part about how to extract data and split it into train and test sets, they're using a hash function on one of the data fields to provide a deterministic and repeatable test/train split instead of a random one.

But can't the same thing be accomplished with a random.seed function?

Moreover, using a hash function would mean we can no longer use the field on which the hash was generated (which might be potentially useful for a model) or it might be inserting some unknown bias into the model?

What advantage does using a hash function have over using random seed?

Best Answer

But can't the same thing be accomplished with a random.seed function? ... What advantage does using a hash function have over using random seed?

Sampling is less straight forward when you can't fit the entire dataset in memory. In the context of a DBMS, this article suggests that using RAND() with a seed may not be reproducible when writing SQL. This is due to the multithreaded nature of the application, which does not guarantee the order of the returned items (unless you add the ORDER BY clause, which might be expensive). The author of the article proceeds by hashing one of the date fields in each row to get around this problem.

One other plausible use case would be when dealing with files. If I have a huge directory of images that I want to use for training/testing, it might be easier to work with a hash of the filename rather than trying to maintain a reproducible ordering of the files.

Moreover, using a hash function would mean we can no longer use the field on which the hash was generated (which might be potentially useful for a model) or it might be inserting some unknown bias into the model?

Computing the hash of a field is not the same as computing the hash and then overwriting the original value. The hash would just be computed in some other memory block and used to assign the item to the train/test/validation set, the same way generating a random number does not overwrite any data.

With respect to introducing bias, I found this question on the cryptography site which attempts to address the statistical properties of SHA-1 mod n.

Related Question