Solved – Why would somebody use a hash function for creating a test/train split instead of random seed

hashhypothesis testingmachine learningtrain

I'm going through some ML training material from Google (I can't post a link because I'm getting the material through my company).

In the part about how to extract data and split it into train and test sets, they're using a hash function on one of the data fields to provide a deterministic and repeatable test/train split instead of a random one.

But can't the same thing be accomplished with a random.seed function?

Moreover, using a hash function would mean we can no longer use the field on which the hash was generated (which might be potentially useful for a model) or it might be inserting some unknown bias into the model?

What advantage does using a hash function have over using random seed?

Best Answer

But can't the same thing be accomplished with a random.seed function? ... What advantage does using a hash function have over using random seed?

Sampling is less straight forward when you can't fit the entire dataset in memory. In the context of a DBMS, this article suggests that using RAND() with a seed may not be reproducible when writing SQL. This is due to the multithreaded nature of the application, which does not guarantee the order of the returned items (unless you add the ORDER BY clause, which might be expensive). The author of the article proceeds by hashing one of the date fields in each row to get around this problem.

One other plausible use case would be when dealing with files. If I have a huge directory of images that I want to use for training/testing, it might be easier to work with a hash of the filename rather than trying to maintain a reproducible ordering of the files.

Moreover, using a hash function would mean we can no longer use the field on which the hash was generated (which might be potentially useful for a model) or it might be inserting some unknown bias into the model?

Computing the hash of a field is not the same as computing the hash and then overwriting the original value. The hash would just be computed in some other memory block and used to assign the item to the train/test/validation set, the same way generating a random number does not overwrite any data.

With respect to introducing bias, I found this question on the cryptography site which attempts to address the statistical properties of SHA-1 mod n.

Related Solutions

Solved – Feature selection + classification in Caret

You should be able to accomplish everything you want with the sbf function instead. I originally assumed it worked the same way you are, but the functionality given by sbf is apparently more like a super set of what's available in train.

For example, something like this sounds like what you're getting at:

fit <- sbf(
  form = response ~ .,
  data = d, method = "glmnet", 
  tuneGrid=expand.grid(.alpha = .01, .lambda = .1),
  preProc = c("center", "scale"),
  trControl = trainControl(method = "none"),
  sbfControl = sbfControl(functions = caretSBF, method = 'cv', number = 10) 
)

This would run 10 outer folds and fit a single glmnet model to each, using only a feature subset. You could also specify some number of cv folds for trControl and a parameter grid to do training on inner folds.

Solved – Why use a train/test split with linear regression

If you're not trying to generalise on new data, then you don't need to.

If you are trying to generalise to new data, and if your algorithm has no hyper-parameters (i.e. settings you can tweak), then you don't need to.

If you are trying to generalise to new data, and (as is usual), you have hyper-parameters to tune, then you need to.

For example, if you were using regularised linear regression (a.k.a. "ridge" regression), then you would need to have some way of choosing the regularlisation parameter, such that it will be valid when testing on new data, rather than just fitting the "training" data perfectly.

Best Answer

Related Solutions

Solved – Feature selection + classification in Caret

Solved – Why use a train/test split with linear regression

Related Question