Hash functions for a table index

Started by
29 comments, last by WaterCrane 15 years, 4 months ago
Hashes are not not isomorphic to the data they represent (unlike an encyption cypher). This is always the case (irrespective of the size of the hash) unless the set of discreet values to be hashed are known a-head of time and the hash function itself can be modified to yeild a unique hash value for the input set. Additionally it's often the case that a client may still generate spurious input outside the set which produces collisions meaning that an explicit test of the the underlying data is still required in order to exclude false positive matches.

If you require encryption then encrypt first, using whatever (proven) cypher system you want then hash and store. If you need compression, then compress it then hash and store.

I would really be reconsidering using a more conventional hash table approach using chaining or whatever to handle collisions. the approach of only using a hash (like you suggest for filenames) and then completely ignoring the data is known as a bloom filter .

Edit: my rotten spelling
Advertisement
The idea of encoding the filename itself into a hash is a good idea, but not suitable in this case because the filenames also contain directory trees and hence can be arbitrarily long. I was under the impression that storing data in the next free space, open addressing, was a standard method of handling collisions in a hash table. Saying that, I do like the idea of a perfect hash function, because the archives are designed to be read-only after they are compiled; their main uses that I can foresee are for storing game data (which would only be updated through patches in most cases) or data archiving / backup, which should never change on the fly.

To make a perfect hash function, would appending a preset salt to the filenames do the trick, randomising and testing salt values until one works during compilation (or just giving up after a set number of salts fail to yield a perfect hash, otherwise we risk an infinite loop or otherwise a very long operation)?

The archive format does support solid compression and encryption, and if they are enabled, everything bar the archive header is affected, including the File Hash Table, meaning the contents cannot be derived without the correct decryption key. Encryption is the last operation performed when writing (and decrypting being the first operation when reading), because trying to compress a stream of ciphertext is not going to yield good results!
Maybe I'm missing something, but how does this work for user?

std::fstream character_data(????)


How do I, in code, specify which file I want? Or how do files cross-reference each other (mesh->texture)? And since hashes are used as identities, how do I know that the file I opened is really the file i wanted and not one caused by collision?
You specify which file you would like to open, and the API (or one you program yourself) takes the hash of the filename and then calculates the index from the first 32 bits of that hash. With the index to hand, it jumps to that point inside the File Hash Table, which contains details on where to find the file data, metadata, and a full copy of the filename hash, along with a few flags. If the hash stored there does not match the one calculated, then it is a collision; if that occurs, it increments the index (modulo the size of the table), checking the next entries until it either finds a hash that matches the calculated one (in which case it has found the file), it finds an empty slot (file does not exist), or the index wraps around to the value it started at (file does not exist).

Once I freeze the format and get an API working, I'll release a specification of the format under the GNU Free Documentation License.
Quote:Original post by WaterCrane
You specify which file you would like to open, and the API (or one you program yourself) takes the hash of the filename and then calculates the index from the first 32 bits of that hash. With the index to hand, it jumps to that point inside the File Hash Table, which contains details on where to find the file data, metadata, and a full copy of the filename hash, along with a few flags. If the hash stored there does not match the one calculated, then it is a collision; if that occurs, it increments the index (modulo the size of the table), checking the next entries until it either finds a hash that matches the calculated one (in which case it has found the file)


I don't get it. If you have a collision, how do you know which is the correct file? Somewhere you have to save the file name in an un-hashed form. A hash table requires the ability to test for equality of elements in addition to the ability to generate a hash for each element (unless the hash is perfect, but that depends on the file names you happen to have in this particular archive and it means you have to store the hash function in the archive in some form).

Also, surely it's more important to "hide" the contents of the file rather than the file's name?

If I were you, I'd take all this hashing stuff out and let the user add a layer of encryption on top if they want it.
Ah, I see... in the case of a full collision, then it does break the archive, which is why I prefer at least 64 bits to be safe. For a 32-bit hash, there is about a 50% chance of a full collision at 81,920 entries; for a 64-bit hash, the 50% mark is at 5,368,709,120 entries - being greater than 2^32, the maximum number of entries allowed in the File Hash Table, I consider this relatively safe. (Following the principles of the Birthday Attack.)

Individual files can be encrypted if you so desire in order to disguise their contents, although I imagine "solid encryption" will be employed more often, which encrypts the File Hash Table as well.

Hope this clears up some of my ambiguities.
Yeah. I've seen in a few archive formats in games where they don't store anything but the hash. Some work around the collision problem by storing the shortest unambigious path string for any files that collide, thus insuring something to test against, but still allowing 99% of the entries to not have any strings in the file.

You could also fail to pack on collision.

I've seen formats where you hash on both ends, and to get around the chance of collisions there is only one archive per resource (level, textures, sounds). So in code I would call archives.find(0x0000F00D).Open(0xF00DCA5E) to get my files out of the archive. The hash just being a standard CRC32 of the relitive paths in all cases. This CAN be combersome, but the benifits are worth it. It is crazy fast to find things, and the memory footprint of both the .exe's data segment and the resource's file table is greatly reduced from a full string storage version. (and that is a few 100K of strings that could be better spent on text the player will actually see, since portable devices aren't the awesome 4+Gb memory pool a PC is, this can mean a lot)
A little update.

I programmed CRC-64 and FNV-1a (64-bit) algorithms in Turbo Delphi and ran some benchmarks; CRC-64 was twice as fast as FNV-1a, but that was mostly because Delphi lacks an unsigned 64-bit integer type and I had to use a workaround in the latter algorithm. Therefore, I performed the most insane thing possible... I reprogrammed them both in assembly language, having never used assembly language with an Intel processor before! (Of course, I do know enough CPU theory to have a head start)

My assembler code for FNV-1a was about 20% faster than the Pascal version, but the CRC-64 assembler was slower than Pascal, which meant I had a lot to learn if I was to beat the Delphi compiler at optimisation! However, with a little more research, experimentation, and advice from other programmers, the CRC-64 assembly language is now faster than Pascal, and the FNV-1a even faster still.

My benchmark tests aren't concrete, but I did my best to compensate for the thousands of CPU cycles stolen by the Windows API timing routines, and by other other processes, by only checking the clocks every 100 runs and averaging over 10 seconds for each algorithm in turn. I only hashed the same 10-character string over and over again though, since retrieving strings from a list box was skewing the statistics too much. For comparison, I tested SHA-1, SHA-256 and MD5 too (in Pascal). The figures are, approximately, the number of times the hash was calculated in one second.

CRC-64 (Pascal) - 7,767,560 Hz
CRC-64 (Assembler) - 8,629,410 Hz

64-bit FNV-1a (Pascal) - 3,811,310 Hz
64-bit FNV-1a (Assembler) - 9,104,370 Hz

SHA-1 - 619,540 Hz
SHA-256 - 620,560 Hz
MD5 - 620,100 Hz

I haven't implemented XTEA yet, but to be honest I cannot see it being as fast as CRC and FNV.

Long story short, I am leaning towards FNV-1a as the hash function used for table lookups; I have yet to test it for collision resistance, but I haven't found anything that shows it has bad distribution. Thank you Hodgman for introducing me to it.

ADDENDUM: My assembly language code is very likely to be less than optimal, so there's a chance that CRC-64 could actually be faster; the "scores" are very close after all!
Quote:Original post by WaterCrane
[...]To make a perfect hash function, would appending a preset salt to the filenames do the trick, randomising and testing salt values until one works during compilation (or just giving up after a set number of salts fail to yield a perfect hash, otherwise we risk an infinite loop or otherwise a very long operation)?[...]
That works as long as you don't have any total collisions. As an example of perfect hashing, check out Perfect Hashing. There is also NIST on minimal perfect hashing, which has some links to implementations.
Quote:Original post by WaterCrane
[...]and ran some benchmarks[...]
I'm quite curious to see your benchmark code.

[Edited by - Extrarius on November 17, 2008 12:39:28 PM]
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
The test code for each algorithm is the following (I'm not sure how to tabulate code on this forum to make it more readable - sorry!):


Count := 0;
QueryPerformanceCounter(Start);

repeat

for Index := 1 to 100 do gjuCRC64(Input, InputLen, @Hash);
Inc(Count, 100);
QueryPerformanceCounter(Current);

until Current >= Start + Freq;

lblCRC64A.Caption := IntToStr(Count);


... replacing "gjuCRC64" and the label object "lblCRC64A" for each algorithm. Freq is the result returned by QueryPerformanceFrequency multiplied by the number of seconds to run the test for. Several of these test run one after the other. As you can see, it is very quick-and-dirty and not exactly scientifically sound (like I should really subtract 50 from the final value of Count).

I have, however, run into another possible problem with perfect hashes that I overlooked. If my statistics serves me correct... take N as the number of entries in the table, and assume the hash function result, modulo the number of entries, has an equal probability 1/N of landing anywhere in the range 0...N-1. For N runs of the hash function, there is a 1/(N^N) chance of the hash returning a particular sequence of indices between 0 and N-1. For a sequence with no duplicate values (collisions), there are N! combinations. Therefore, the probability of an unbiased hash being perfect for an N-sized table is N!/(N^N)...

N = 1, P = 1 ... N = 2, P = 0.5 ... N = 3, P = 0.22222 ... N = 4, P = 0.09375 ... N = 5, P = 0.0384 ... N = 6, P = 0.01543 ... N = 7, P = 0.00612

The probability approximately halves when N is increased by 1, meaning it can get very prohibitive with large N.

What this means is that it is quite likely that a collision will occur in the File Hash Table (FHT), even with a salt (if the entire hash collides, an error is returned). Still, it is not a major problem because it can be remedied by making the FHT larger (therefore leaving empty spaces in the table once all of the files have been entered), or simply accepting the fact that a collision may be inevitable. Still, thought it was something I should share.


ADDENDUM: I'm checking my code for errors, but I'm suspecting that the result of 64-bit FNV-1a favours even numbers over odd numbers or vice versa. For an odd number of test strings (i.e. Hash mod OddNumber), I can generally find a salt fairly fast, whereas for an even number I often fail to find a suitable salt even trying 2^24 (~16 million) different values... either that or it could be a freak coincidence!

ADDENDUM DUO: Just to check that my implementation of FNV-1a is not faulty, can anyone supply some known test vectors for this hash?

ADDENDUM TRIO: Update. It is starting to look like a weakness in FNV-1a... because the same salt is appended to the end of each filename (the salt, if present, currently ranges from 1 to 4 characters between ASCII codes 32 and 126), it seems to cause the hash to favour odd numbers or even numbers. When I increment the salt little-endian (for a 4-character salt, the 4th character from the end of the concatenated string), the hash result predictably alternates between odd and even in some cases (not all the time though... it depends on the filenames I supply). Can anyone else confirm this?

[Edited by - WaterCrane on November 25, 2008 12:59:04 AM]

This topic is closed to new replies.

Advertisement