Hash functions for a table index

Started by
29 comments, last by WaterCrane 15 years, 4 months ago
I'm still developing my archive format. For those who remember my "Initialisation Vectors" topic, I made a mistake in saying it was general-purpose; it is actually designed to be fast to read at the cost of being slow to write, suiting it for archival purposes or to store game data that seldom changes. I might be being a little too ambitious, but I am hoping to make it as an open alternative to Blizzard's Mo'PaQ (MPQ) format. Anyway, back-story aside, I am using hashes as a means to locate a file's index within the archive's file table. Initially, I used SHA-1 to calculate the hashes in order to minimise collisions to negligible levels, but I realised that this algorithm is far too slow for indexing purposes (it's potentially faster to just do an exhaustive search). I have kept the option of using SHA-1 as a more secure version of the archive (filenames do not have to be stored), but I am seeking a hashing algorithm that is much faster for cases where filename concealment is not a major concern. Due to the design of the file table, optimal hash sizes are 32-bit, 64-bit, 160-bit or 192-bit; personally, I consider 32-bit too collision-prone. At present, I have my sights set on a Cyclic Redundancy Check algorithm, CRC-64-ECMA-182. It isn't designed for indexing, I know, so I was wondering if anyone more enlightened on the subject of hashing could give me some guidance.
Advertisement
I use 32bit FNV-1a personally (there's also 64/128bit versions if you do get too many collisions) because it hasn't given me any collisions yet and when I tested it on a dictionary, it had almost perfect distribution over the 32-bit range of numbers.
If you are worried about collisions, just do what a normal hash table would do and store files which map to the same hash value in a linked list (or something) and do a simple linear search on those. The vast majority of filenames will hash to different values, but you'll still be covered for the rare case of a hash collision.
In my experience, CRCs work great as hash functions. I think it's also what Donald Knuth advices.

Thank you for the responses everybody. The reason why I have to be careful with collisions is that the filename itself is not stored in the "File Hash Table" (FHT for short, as it is so named), so if two filenames collide to give the exact same hash, it is impossible to distinguish between them.

The archive format uses the leftmost 32 bits of the hash (on a little-endian system) modulo the size of the table to compute the index; for small tables there are bound to be clashes, but these can be quickly noticed because the computed hash will not match the one stored at the derived index, even if the first 32 bits are identical. An open addressing system is employed in the case of a clash, so the details are simply stored in the next available slot.
Here are some good hash sites, people have done comparisions with the crc32 algorithim.

http://burtleburtle.net/bob/hash/doobs.html
http://www.azillionmonkeys.com/qed/hash.html

Good Luck!

-ddn
Quote:Original post by WaterCrane
Thank you for the responses everybody. The reason why I have to be careful with collisions is that the filename itself is not stored in the "File Hash Table" (FHT for short, as it is so named), so if two filenames collide to give the exact same hash, it is impossible to distinguish between them.


It's impossible to create a hash function that can hash arbitraryily-long strings into a fixed bit pattern without collision. It's called the pigeon-hole principle.

So no matter what hashing algorithm you choose, there will always be the possibilty of a collisions. The question is, what do you do about it? Is the any particular reason you don't want to store the file name in the archive? Maybe a simple obfuscation routine can hide the names from prying eyes?
Personally, I'd use a cryptographic hash created using XTEA with the Davies-Meyer algorithm for converting a block cipher into a secure hash. Using XTea means it should be fast and the algorithm is very simple. Using XTea would give you a 64-bit hash, but if you wanted only 32 bits, you could just xor the two values together.
with XTea(Key, Data),
Hash' = Hash xor XTea(Text, Hash);

Make sure you use the Merkle-Damgard construction to finalize the hash value after iterating the above algorithm.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Quote:Original post by Codeka
Quote:Original post by WaterCrane
Thank you for the responses everybody. The reason why I have to be careful with collisions is that the filename itself is not stored in the "File Hash Table" (FHT for short, as it is so named), so if two filenames collide to give the exact same hash, it is impossible to distinguish between them.


It's impossible to create a hash function that can hash arbitraryily-long strings into a fixed bit pattern without collision. It's called the pigeon-hole principle.

So no matter what hashing algorithm you choose, there will always be the possibilty of a collisions. The question is, what do you do about it? Is the any particular reason you don't want to store the file name in the archive? Maybe a simple obfuscation routine can hide the names from prying eyes?


The choice of hidden filenames is just an option. Since the name is hashed you don't actually need to store the filename in the archive if you already know its name - it is more of a space-saving feature than anything else, but security is also an option.

Just to clarify... the full hash is stored in the FHT, but only the first 32 bits of it are actually used to calculate the index, hence the archive can store up to 2^32 individual files, although I admit it is unlikely that even large archives will store more than maybe a few thousand files; nevertheless, it is the reason why I cannot use a 32-bit hash, because a full collision will cause undesired results. With a 64-bit or larger hash, the risk of collision is reduced to far more negligible levels, and the use of open addressing will mean that the archive can still work even if the first 32 bits of two hashes collide (but not the rest of the hash).

I like the suggestions that are coming in; thanks guys. I'm considering running some benchmarks on a number of hash algorithms. Cryptographic security is not an issue in this case, just good speed and distribution across the range.
Quote:Original post by WaterCrane
[...]I like the suggestions that are coming in; thanks guys. I'm considering running some benchmarks on a number of hash algorithms. Cryptographic security is not an issue in this case, just good speed and distribution across the range.
I didn't mention it because of security concerns, but simply because using that algorithm is very simple to understand and implement, it should be quick with XTEA, and because it is supposed to be secure, it should have all the properties you want in a hash function.

Another possibility if you're working with file names that can be taken from a limited character set (say lowercase a-z, 0-9, and a single delimiter such as ' ' or '-') then you can just store the filename directly in a few bits. With the example character set, there are 26+10+1=37 characters, and log37(2^64) = 12.285, which means you could store a 12-character file name in 64 bits and still have 1 bit left for a flag of some kind. The way to store the file names in this manner is a 12-digit base-37 number. For example 'filename.ext' would be f*37^11 + i*37^10 + l*37^9 + e*37^8 + ... x*37^1 + t*37^0 where f would be 5 ('f'-'a') and i would be 8 ('i'-'a') and so on. You can actually use up to a 40-symbol alphabet and still fit 12 characters in 64 bits (or 6 characters in 32 bits).

A final possibility, if your archives never need to be modified (and are entirely fixed once they're created from a set of files that is entirely known at creation time) is using a perfect hash generator that will create a unique hash algorithm for specifically the names used in the archive. You'd need to figure out a way to store the hash algorithm in the archive itself (obviously not in a file in the archive - it'd have to be a header etc), but it would ensure that no collisions happen.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk

This topic is closed to new replies.

Advertisement