Adler32 vs CRC32 for Asset ID

Started by
10 comments, last by deadc0deh 4 years, 10 months ago

Hi everybody,
I often saw CRC32 used for Asset ID but you can also use his competitor Adler32.
Why CRC32 is preferred over Adler32?
Thanks!

Advertisement

This page goes in to details:

https://en.wikipedia.org/wiki/Adler-32

But, of particular note for usage for an Asset ID is:

"Adler-32 has a weakness for short messages with few hundred bytes, because the checksums for these messages have a poor coverage of the 32 available bits."

If you are generating a value based on something like a file name (a string that is not very long), that means that you have a higher chance of a Asset ID value collision using Adler32 versus CRC32.

Neither of the algorithms are that great as hash functions. The main difference is Adler32 was included as a part of the zlib RFC as per Mark Adler's zlib design and implementation (the author of Adler32). He made the Adler32 algorithm as an alternative to something like CRC32 because it was faster at the time. I believe the intention was mostly to assert correctness of the input stream bits, as opposed to generating unique IDs, so the preference would be low performance impact during decoding of PNG rows (DEFLATE decompression).

Personally I think Adler32 is superior since it's simpler. Most CRC functions use big lookup tables and other nonsense, while Adler32 can be copy pasted straight from the zlib RFC, which is great since you can trust it works.

However, I don't think Adler32 is very useful nowadays since the main utility of these kinds of functions is to verify the bits of a stream, and for networking protocols decryption routines already do exactly this by verifying encrypted bits + Associated Data (the AD in AEAD) against the associated MAC bytes. I suppose Adler32 can still be useful to verify contents coming from disk, but the utility is really drowned out by decryption.

From an error-detection/data validation standpoint, I agree with everything Randy posted above.  But for the purposes of a hashing algorithm, CRC32 seems better suited.

As always, YMMV, but I have found CRC32 to be a fairly adequate hashing solution.  Not perfect by any means, but I (personally and on various big studio game projects) have used it for generating string IDs, as well as for Asset IDs, and have found it to be satisfactory.  In these contexts, the primary concern is uniqueness/hash name collision, and occurrences of this have been pretty rare in my experience.  However, they do happen, and it's something that you need to have a mechanism for detecting, and needs to be addressed (when it matters*), especially in cases of resolving asset lookup.

On my current project, using CRC32 as the hashing function for both string IDs and asset IDs, I've had around 10-ish cases of hash collision spanning the ~78k asset files as well as generated string IDs.  Due to the way I bucket my assets (by type), the particular asset ID name collisions I do have don't matter (no collisions within the same asset type).  Likewise, the string ID collisions also are not an issue as they don't happen within the sub-set of values I am comparing against in the various contexts.

These days, for the purposes of a string/ID/asset hash, a 64-bit hash seems to getting some traction.

Fwiw, for asset IDs, I use the FNV-1a hash of the filename.

There's plenty of other 32 bit hash functions too, like murmurhash, xxHash, etc

Any 32 bit hash is going to have a high chance of collision, and should therefore be avoided.  With as few as 9300 assets, there is already a 1% probability of a collision, assuming an ideal hash function.

12 hours ago, a light breeze said:

Any 32 bit hash is going to have a high chance of collision, and should therefore be avoided.  With as few as 9300 assets, there is already a 1% probability of a collision, assuming an ideal hash function.

This isn't a big issue as long as you check for that situation. In your asset compiler, keep a dictionary of all the asset names and hashes. If a collision is detected (99% unlikely), then change the project-wide hash seed/salt that's used for hashing asset IDs and repeat.

We are using Fnv also in our application to generate IDs for several classes, assets, built-in components etc. and never had any hash collision issue

If you're hashing large quantities of data, and need good performance xxHash is a good option. For small quantities of data, something simple like FNV should be good enough.

Note that thanks to the birthday paradox, once you've generated about 77,000 hashes (with a well behaved hash function), the chances of a new hash being equal to one of the existing hashes is 50%! This means you want 64-bit hashes if you think you might exceed a few thousand items, and you can't handle collisions easily.

There's some more discussion on various hash algorithms at https://aras-p.info/blog/2016/08/02/Hash-Functions-all-the-way-down/

Using hashes for asset IDs is conceptually a bit weird. Most hash functions don't go out of their way to prevent collisions, and most algorithms that use hashes are designed to be able to resolve collisions (typically by comparing the actual unhashed data).

You may be better off using a hash function that actually increases the odds of collisions, so that you test the collision recovery procedure on a regular basis. Otherwise you risk ending up in the position of having to implement recovery after you discover that things have gone horribly wrong... and that's always good fun

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

This topic is closed to new replies.

Advertisement