Sign in to follow this  

Hash functions for a table index

This topic is 3293 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm still developing my archive format. For those who remember my "Initialisation Vectors" topic, I made a mistake in saying it was general-purpose; it is actually designed to be fast to read at the cost of being slow to write, suiting it for archival purposes or to store game data that seldom changes. I might be being a little too ambitious, but I am hoping to make it as an open alternative to Blizzard's Mo'PaQ (MPQ) format. Anyway, back-story aside, I am using hashes as a means to locate a file's index within the archive's file table. Initially, I used SHA-1 to calculate the hashes in order to minimise collisions to negligible levels, but I realised that this algorithm is far too slow for indexing purposes (it's potentially faster to just do an exhaustive search). I have kept the option of using SHA-1 as a more secure version of the archive (filenames do not have to be stored), but I am seeking a hashing algorithm that is much faster for cases where filename concealment is not a major concern. Due to the design of the file table, optimal hash sizes are 32-bit, 64-bit, 160-bit or 192-bit; personally, I consider 32-bit too collision-prone. At present, I have my sights set on a Cyclic Redundancy Check algorithm, CRC-64-ECMA-182. It isn't designed for indexing, I know, so I was wondering if anyone more enlightened on the subject of hashing could give me some guidance.

Share this post


Link to post
Share on other sites
If you are worried about collisions, just do what a normal hash table would do and store files which map to the same hash value in a linked list (or something) and do a simple linear search on those. The vast majority of filenames will hash to different values, but you'll still be covered for the rare case of a hash collision.

Share this post


Link to post
Share on other sites
Thank you for the responses everybody. The reason why I have to be careful with collisions is that the filename itself is not stored in the "File Hash Table" (FHT for short, as it is so named), so if two filenames collide to give the exact same hash, it is impossible to distinguish between them.

The archive format uses the leftmost 32 bits of the hash (on a little-endian system) modulo the size of the table to compute the index; for small tables there are bound to be clashes, but these can be quickly noticed because the computed hash will not match the one stored at the derived index, even if the first 32 bits are identical. An open addressing system is employed in the case of a clash, so the details are simply stored in the next available slot.

Share this post


Link to post
Share on other sites
Here are some good hash sites, people have done comparisions with the crc32 algorithim.

http://burtleburtle.net/bob/hash/doobs.html
http://www.azillionmonkeys.com/qed/hash.html

Good Luck!

-ddn

Share this post


Link to post
Share on other sites
Quote:
Original post by WaterCrane
Thank you for the responses everybody. The reason why I have to be careful with collisions is that the filename itself is not stored in the "File Hash Table" (FHT for short, as it is so named), so if two filenames collide to give the exact same hash, it is impossible to distinguish between them.


It's impossible to create a hash function that can hash arbitraryily-long strings into a fixed bit pattern without collision. It's called the pigeon-hole principle.

So no matter what hashing algorithm you choose, there will always be the possibilty of a collisions. The question is, what do you do about it? Is the any particular reason you don't want to store the file name in the archive? Maybe a simple obfuscation routine can hide the names from prying eyes?

Share this post


Link to post
Share on other sites
Personally, I'd use a cryptographic hash created using XTEA with the Davies-Meyer algorithm for converting a block cipher into a secure hash. Using XTea means it should be fast and the algorithm is very simple. Using XTea would give you a 64-bit hash, but if you wanted only 32 bits, you could just xor the two values together.
with XTea(Key, Data),
Hash' = Hash xor XTea(Text, Hash);

Make sure you use the Merkle-Damgard construction to finalize the hash value after iterating the above algorithm.

Share this post


Link to post
Share on other sites
Quote:
Original post by Codeka
Quote:
Original post by WaterCrane
Thank you for the responses everybody. The reason why I have to be careful with collisions is that the filename itself is not stored in the "File Hash Table" (FHT for short, as it is so named), so if two filenames collide to give the exact same hash, it is impossible to distinguish between them.


It's impossible to create a hash function that can hash arbitraryily-long strings into a fixed bit pattern without collision. It's called the pigeon-hole principle.

So no matter what hashing algorithm you choose, there will always be the possibilty of a collisions. The question is, what do you do about it? Is the any particular reason you don't want to store the file name in the archive? Maybe a simple obfuscation routine can hide the names from prying eyes?


The choice of hidden filenames is just an option. Since the name is hashed you don't actually need to store the filename in the archive if you already know its name - it is more of a space-saving feature than anything else, but security is also an option.

Just to clarify... the full hash is stored in the FHT, but only the first 32 bits of it are actually used to calculate the index, hence the archive can store up to 2^32 individual files, although I admit it is unlikely that even large archives will store more than maybe a few thousand files; nevertheless, it is the reason why I cannot use a 32-bit hash, because a full collision will cause undesired results. With a 64-bit or larger hash, the risk of collision is reduced to far more negligible levels, and the use of open addressing will mean that the archive can still work even if the first 32 bits of two hashes collide (but not the rest of the hash).

I like the suggestions that are coming in; thanks guys. I'm considering running some benchmarks on a number of hash algorithms. Cryptographic security is not an issue in this case, just good speed and distribution across the range.

Share this post


Link to post
Share on other sites
Quote:
Original post by WaterCrane
[...]I like the suggestions that are coming in; thanks guys. I'm considering running some benchmarks on a number of hash algorithms. Cryptographic security is not an issue in this case, just good speed and distribution across the range.
I didn't mention it because of security concerns, but simply because using that algorithm is very simple to understand and implement, it should be quick with XTEA, and because it is supposed to be secure, it should have all the properties you want in a hash function.

Another possibility if you're working with file names that can be taken from a limited character set (say lowercase a-z, 0-9, and a single delimiter such as ' ' or '-') then you can just store the filename directly in a few bits. With the example character set, there are 26+10+1=37 characters, and log37(2^64) = 12.285, which means you could store a 12-character file name in 64 bits and still have 1 bit left for a flag of some kind. The way to store the file names in this manner is a 12-digit base-37 number. For example 'filename.ext' would be f*37^11 + i*37^10 + l*37^9 + e*37^8 + ... x*37^1 + t*37^0 where f would be 5 ('f'-'a') and i would be 8 ('i'-'a') and so on. You can actually use up to a 40-symbol alphabet and still fit 12 characters in 64 bits (or 6 characters in 32 bits).

A final possibility, if your archives never need to be modified (and are entirely fixed once they're created from a set of files that is entirely known at creation time) is using a perfect hash generator that will create a unique hash algorithm for specifically the names used in the archive. You'd need to figure out a way to store the hash algorithm in the archive itself (obviously not in a file in the archive - it'd have to be a header etc), but it would ensure that no collisions happen.

Share this post


Link to post
Share on other sites
Hashes are not not isomorphic to the data they represent (unlike an encyption cypher). This is always the case (irrespective of the size of the hash) unless the set of discreet values to be hashed are known a-head of time and the hash function itself can be modified to yeild a unique hash value for the input set. Additionally it's often the case that a client may still generate spurious input outside the set which produces collisions meaning that an explicit test of the the underlying data is still required in order to exclude false positive matches.

If you require encryption then encrypt first, using whatever (proven) cypher system you want then hash and store. If you need compression, then compress it then hash and store.

I would really be reconsidering using a more conventional hash table approach using chaining or whatever to handle collisions. the approach of only using a hash (like you suggest for filenames) and then completely ignoring the data is known as a bloom filter .

Edit: my rotten spelling

Share this post


Link to post
Share on other sites
The idea of encoding the filename itself into a hash is a good idea, but not suitable in this case because the filenames also contain directory trees and hence can be arbitrarily long. I was under the impression that storing data in the next free space, open addressing, was a standard method of handling collisions in a hash table. Saying that, I do like the idea of a perfect hash function, because the archives are designed to be read-only after they are compiled; their main uses that I can foresee are for storing game data (which would only be updated through patches in most cases) or data archiving / backup, which should never change on the fly.

To make a perfect hash function, would appending a preset salt to the filenames do the trick, randomising and testing salt values until one works during compilation (or just giving up after a set number of salts fail to yield a perfect hash, otherwise we risk an infinite loop or otherwise a very long operation)?

The archive format does support solid compression and encryption, and if they are enabled, everything bar the archive header is affected, including the File Hash Table, meaning the contents cannot be derived without the correct decryption key. Encryption is the last operation performed when writing (and decrypting being the first operation when reading), because trying to compress a stream of ciphertext is not going to yield good results!

Share this post


Link to post
Share on other sites
Maybe I'm missing something, but how does this work for user?

std::fstream character_data(????)


How do I, in code, specify which file I want? Or how do files cross-reference each other (mesh->texture)? And since hashes are used as identities, how do I know that the file I opened is really the file i wanted and not one caused by collision?

Share this post


Link to post
Share on other sites
You specify which file you would like to open, and the API (or one you program yourself) takes the hash of the filename and then calculates the index from the first 32 bits of that hash. With the index to hand, it jumps to that point inside the File Hash Table, which contains details on where to find the file data, metadata, and a full copy of the filename hash, along with a few flags. If the hash stored there does not match the one calculated, then it is a collision; if that occurs, it increments the index (modulo the size of the table), checking the next entries until it either finds a hash that matches the calculated one (in which case it has found the file), it finds an empty slot (file does not exist), or the index wraps around to the value it started at (file does not exist).

Once I freeze the format and get an API working, I'll release a specification of the format under the GNU Free Documentation License.

Share this post


Link to post
Share on other sites
Quote:
Original post by WaterCrane
You specify which file you would like to open, and the API (or one you program yourself) takes the hash of the filename and then calculates the index from the first 32 bits of that hash. With the index to hand, it jumps to that point inside the File Hash Table, which contains details on where to find the file data, metadata, and a full copy of the filename hash, along with a few flags. If the hash stored there does not match the one calculated, then it is a collision; if that occurs, it increments the index (modulo the size of the table), checking the next entries until it either finds a hash that matches the calculated one (in which case it has found the file)


I don't get it. If you have a collision, how do you know which is the correct file? Somewhere you have to save the file name in an un-hashed form. A hash table requires the ability to test for equality of elements in addition to the ability to generate a hash for each element (unless the hash is perfect, but that depends on the file names you happen to have in this particular archive and it means you have to store the hash function in the archive in some form).

Also, surely it's more important to "hide" the contents of the file rather than the file's name?

If I were you, I'd take all this hashing stuff out and let the user add a layer of encryption on top if they want it.

Share this post


Link to post
Share on other sites
Ah, I see... in the case of a full collision, then it does break the archive, which is why I prefer at least 64 bits to be safe. For a 32-bit hash, there is about a 50% chance of a full collision at 81,920 entries; for a 64-bit hash, the 50% mark is at 5,368,709,120 entries - being greater than 2^32, the maximum number of entries allowed in the File Hash Table, I consider this relatively safe. (Following the principles of the Birthday Attack.)

Individual files can be encrypted if you so desire in order to disguise their contents, although I imagine "solid encryption" will be employed more often, which encrypts the File Hash Table as well.

Hope this clears up some of my ambiguities.

Share this post


Link to post
Share on other sites
Yeah. I've seen in a few archive formats in games where they don't store anything but the hash. Some work around the collision problem by storing the shortest unambigious path string for any files that collide, thus insuring something to test against, but still allowing 99% of the entries to not have any strings in the file.

You could also fail to pack on collision.

I've seen formats where you hash on both ends, and to get around the chance of collisions there is only one archive per resource (level, textures, sounds). So in code I would call archives.find(0x0000F00D).Open(0xF00DCA5E) to get my files out of the archive. The hash just being a standard CRC32 of the relitive paths in all cases. This CAN be combersome, but the benifits are worth it. It is crazy fast to find things, and the memory footprint of both the .exe's data segment and the resource's file table is greatly reduced from a full string storage version. (and that is a few 100K of strings that could be better spent on text the player will actually see, since portable devices aren't the awesome 4+Gb memory pool a PC is, this can mean a lot)

Share this post


Link to post
Share on other sites
A little update.

I programmed CRC-64 and FNV-1a (64-bit) algorithms in Turbo Delphi and ran some benchmarks; CRC-64 was twice as fast as FNV-1a, but that was mostly because Delphi lacks an unsigned 64-bit integer type and I had to use a workaround in the latter algorithm. Therefore, I performed the most insane thing possible... I reprogrammed them both in assembly language, having never used assembly language with an Intel processor before! (Of course, I do know enough CPU theory to have a head start)

My assembler code for FNV-1a was about 20% faster than the Pascal version, but the CRC-64 assembler was slower than Pascal, which meant I had a lot to learn if I was to beat the Delphi compiler at optimisation! However, with a little more research, experimentation, and advice from other programmers, the CRC-64 assembly language is now faster than Pascal, and the FNV-1a even faster still.

My benchmark tests aren't concrete, but I did my best to compensate for the thousands of CPU cycles stolen by the Windows API timing routines, and by other other processes, by only checking the clocks every 100 runs and averaging over 10 seconds for each algorithm in turn. I only hashed the same 10-character string over and over again though, since retrieving strings from a list box was skewing the statistics too much. For comparison, I tested SHA-1, SHA-256 and MD5 too (in Pascal). The figures are, approximately, the number of times the hash was calculated in one second.

CRC-64 (Pascal) - 7,767,560 Hz
CRC-64 (Assembler) - 8,629,410 Hz

64-bit FNV-1a (Pascal) - 3,811,310 Hz
64-bit FNV-1a (Assembler) - 9,104,370 Hz

SHA-1 - 619,540 Hz
SHA-256 - 620,560 Hz
MD5 - 620,100 Hz

I haven't implemented XTEA yet, but to be honest I cannot see it being as fast as CRC and FNV.

Long story short, I am leaning towards FNV-1a as the hash function used for table lookups; I have yet to test it for collision resistance, but I haven't found anything that shows it has bad distribution. Thank you Hodgman for introducing me to it.

ADDENDUM: My assembly language code is very likely to be less than optimal, so there's a chance that CRC-64 could actually be faster; the "scores" are very close after all!

Share this post


Link to post
Share on other sites
Quote:
Original post by WaterCrane
[...]To make a perfect hash function, would appending a preset salt to the filenames do the trick, randomising and testing salt values until one works during compilation (or just giving up after a set number of salts fail to yield a perfect hash, otherwise we risk an infinite loop or otherwise a very long operation)?[...]
That works as long as you don't have any total collisions. As an example of perfect hashing, check out Perfect Hashing. There is also NIST on minimal perfect hashing, which has some links to implementations.
Quote:
Original post by WaterCrane
[...]and ran some benchmarks[...]
I'm quite curious to see your benchmark code.

[Edited by - Extrarius on November 17, 2008 12:39:28 PM]

Share this post


Link to post
Share on other sites
The test code for each algorithm is the following (I'm not sure how to tabulate code on this forum to make it more readable - sorry!):


Count := 0;
QueryPerformanceCounter(Start);

repeat

for Index := 1 to 100 do gjuCRC64(Input, InputLen, @Hash);
Inc(Count, 100);
QueryPerformanceCounter(Current);

until Current >= Start + Freq;

lblCRC64A.Caption := IntToStr(Count);


... replacing "gjuCRC64" and the label object "lblCRC64A" for each algorithm. Freq is the result returned by QueryPerformanceFrequency multiplied by the number of seconds to run the test for. Several of these test run one after the other. As you can see, it is very quick-and-dirty and not exactly scientifically sound (like I should really subtract 50 from the final value of Count).

I have, however, run into another possible problem with perfect hashes that I overlooked. If my statistics serves me correct... take N as the number of entries in the table, and assume the hash function result, modulo the number of entries, has an equal probability 1/N of landing anywhere in the range 0...N-1. For N runs of the hash function, there is a 1/(N^N) chance of the hash returning a particular sequence of indices between 0 and N-1. For a sequence with no duplicate values (collisions), there are N! combinations. Therefore, the probability of an unbiased hash being perfect for an N-sized table is N!/(N^N)...

N = 1, P = 1 ... N = 2, P = 0.5 ... N = 3, P = 0.22222 ... N = 4, P = 0.09375 ... N = 5, P = 0.0384 ... N = 6, P = 0.01543 ... N = 7, P = 0.00612

The probability approximately halves when N is increased by 1, meaning it can get very prohibitive with large N.

What this means is that it is quite likely that a collision will occur in the File Hash Table (FHT), even with a salt (if the entire hash collides, an error is returned). Still, it is not a major problem because it can be remedied by making the FHT larger (therefore leaving empty spaces in the table once all of the files have been entered), or simply accepting the fact that a collision may be inevitable. Still, thought it was something I should share.


ADDENDUM: I'm checking my code for errors, but I'm suspecting that the result of 64-bit FNV-1a favours even numbers over odd numbers or vice versa. For an odd number of test strings (i.e. Hash mod OddNumber), I can generally find a salt fairly fast, whereas for an even number I often fail to find a suitable salt even trying 2^24 (~16 million) different values... either that or it could be a freak coincidence!

ADDENDUM DUO: Just to check that my implementation of FNV-1a is not faulty, can anyone supply some known test vectors for this hash?

ADDENDUM TRIO: Update. It is starting to look like a weakness in FNV-1a... because the same salt is appended to the end of each filename (the salt, if present, currently ranges from 1 to 4 characters between ASCII codes 32 and 126), it seems to cause the hash to favour odd numbers or even numbers. When I increment the salt little-endian (for a 4-character salt, the 4th character from the end of the concatenated string), the hash result predictably alternates between odd and even in some cases (not all the time though... it depends on the filenames I supply). Can anyone else confirm this?

[Edited by - WaterCrane on November 25, 2008 12:59:04 AM]

Share this post


Link to post
Share on other sites
If you want to ensure an unbiased hash then go for Zobrist. The zobrist technique can also produce a hash of any arbitrary bit length. It is quite versatile, although it has a memory overhead linear to the size of the strings being hashed.

Share this post


Link to post
Share on other sites
I'm astonished murmerhash hasn't been mentioned yet!
http://murmurhash.googlepages.com/

Unless you need cryptographic security I haven't seen anything to beat it. (And of course if you needed crytographic security you wouldn't be using a 32 bit hash!)

Share this post


Link to post
Share on other sites
Quote:
Original post by Martin
I'm astonished murmerhash hasn't been mentioned yet!
http://murmurhash.googlepages.com/


WaterCrane, if you add murmur hash to your profiling tests I would be very interested in seeing the results

Share this post


Link to post
Share on other sites
I'll certainly give it a test - gives me a chance to practise some assembly language too!

I am having similar problems on bias with CRC-64 as well, in that the last few characters cause the result to favour a certain parity. Though I can easily manipulate the hash or input in some way to produce results that are less biased, I would much rather find a fast algorithm that does not suffer from biasing, mainly so the archive's mechanics are as simple as possible and can easily be replicated in third-party implementations.

My own implementation of the archive format are programmed in Turbo Delphi, making use of inline assembler for the hash algorithms. I'm not an expert with C++ yet, but I'll start translating the MurmurHash routines.

Share this post


Link to post
Share on other sites
I've had nothing but good results from DJB2.
template< typename C >
long GetDjb2HashCode( const C* string )
{
register long hash = 5381;
register long c;

while( c = *string++ )
{
hash *= 33;
hash += c;
}

return hash;
}

Share this post


Link to post
Share on other sites

This topic is 3293 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this