I need a hash function that is easily computable in parallel(think of MapReduce framework), so such hash function must be "additive". I.e. given such function f
, another function g
must exist, having the property g(f(X),f(Y))=f(X||Y)
, where ||
denotes concatenation of strings X
and Y
.
Do such hash functions exist?
Best Answer
If you pick $g$ first, you can take your condition as a recursive definition of $f$. i.e.
If you disallow the null string, the only requirement on $g$ is that it is associative:
$$ g(a, g(b,c)) = g(g(a,b), c) $$
If you allow the null string, then you also require there be an element $z$ such that $g(a,z) = a = g(z, a)$, and you define $f() = z$.
Some simple examples include
$$g(x,y) = x+y$$
and
$$ g((m, x), (n, y)) = (m+n, 2^m x + y) $$
The intent of the second example is for a hash function like
where computing the hash of a concatenation requires knowledge of the length of the string. So I've paired the length with the hash. Or, if you prefer, it's for the hash function
You might be better served by thinking harder about the context of your problem; e.g. if the things you need to take the hash of are reliably distributed across different nodes in a consistent fashion, then you can simply be the hash function to be any function that combines the values of any hash function computed on the individual nodes. Because if you got the same string again, it would be distributed in the same way, and you'd compute the same hash function the next time you tried.