A footnote on collision-resistance

Yes, this sounds ridiculous. But in formal terms, it’s true. Unkeyed public hash functions like SHA512 can’t be truly collision-resistant, at least not according to the standard definitions we use to define the concept.

I know I sound like a crank, but me explain. In (paraphrased) formal terms, collision-resistance is usually defined something like this:

There exists no efficient algorithm that can output a ‘collision’ in the hash function.

Usually we’re more specific about what ‘efficient’ means in this context (polynomial time), but it’s not so important for this discussion. What is important is that this definition is awfully hard to apply to an unkeyed, public hash function like SHA512. Allow me to explain.

Fact #1: we know that there exist collisions in SHA512. We don’t have to find them, their existence is a direct consequence of the pigeonhole principle.

Fact #2: if there are collisions, then we know that something like the following algorithm exists:

/* This algorithm outputs a collision in SHA512 */
#define FIRST “Zip-a-Dee-Doo-Dah299288asdhaskjdhak…” // a few digits omitted
#define SECOND “I hate dill pickles920423034294328…” // due to lack of space
   printf(“Here’s a collision in SHA512:\n%s\n%s\n”, FIRST, SECOND);

Now obviously I’ve left out a few details in the above code listing. What I am saying is that it exists. Guaranteed, somewhere in the universe.
When we say ‘SHA512 is collision resistant’, what we’re actually saying is that no human being knows the algorithm, at least so far as we know. It turns out that expressing such concepts in mathematical notation is tricky. Try codifying this:

Xiaoyun Wang’s lab at Shandong University ran out of Nespresso pods last month and so (pending a restocking) we’re still mostly certain that SHA512 is collision-resistant, hmmkay?   

It’s hard!

This may seem like mountains out of a molehill, but it’s a real issue. In crypto papers we deal with it formally in ways that have nothing to do with the way we use hash functions in real life — typically insisting that hash functions have keys. This kills off the argument above, because any algorithm would have to store a collision for every possible key of the hash function, something that’s just not practical for an efficient algorithm to do.

Let me guess, nobody ever told you that your hash functions need keys to be collision-resistant? That’s right. Because nobody uses them in practice. And we all agree never to talk about it.

None of this is germane to the main discussion of indifferentiability proofs (which get around this in a more ridiculous way), but it goes to show how far divorced our formal analysis is from our practice when it comes to hash functions.