Sign in to follow this  
_Sigma

Hash Table Tutorial

Recommended Posts

_Sigma    792
I've looked around(here, google), but I've yet to find a really good, really indepth tutorial about creating your own hash table class. Any good links? =) Cheers //edit. In C++ would be great too =)

Share this post


Link to post
Share on other sites
iMalc    2466
Quote:
Original post by SiCrane
Did you see this one when you were searching gamedev?

Yeah that tutorial is okay, but you could simply forget prime numbers altogether and non-power-of-two sized hash tables and use a CRC function.
It gives about the best hash you can get for practically any datatype you could be using, AND you can then use a power-of-two for the hash table size, avoiding that horrible slow '%' operation by using an '&' instead. I have found this to work exceptionally well in the real world, beating other commercial programs.

Also, if your table size reaches the capacity, then you simply create a new one of double the size. Conversely, if the size goes below 1/4 of the capacity, you may like to shrink it.

Share this post


Link to post
Share on other sites
_Sigma    792
Quote:
Original post by SiCrane
Did you see this one when you were searching gamedev?


I did. I found it lacking in its explication tho..
@iMalc... Eh? Any tutorials on this method?

Share this post


Link to post
Share on other sites
_Sigma    792
What does funnel mean?
Ok, CRC does seem to be rather nice. Do you know of any tutorials describing how to impliment this?

Share this post


Link to post
Share on other sites
hplus0603    11356

To get the hash of some block of data:


unsigned int hash( unsigned char const * base, size_t size ) {
return crc32( base, size );
}


To get the table index, given the hash:


unsigned int table_index( unsigned int h ) {
assert( !(table_size&(table_size-1)) );
return h & (table_size-1);
}


This works for all non-0 powers of 2 for table_size. It also works for most kinds of data that you want to hash. If you don't have a crc32 function laying around (you'll find it in zip, zlib, etc), then you can use a linear congruential rng instead:


// note: I haven't tuned the specific constants used, they are
// just examples. You probably want to tweak them so you get a
// good spread in the lower bits given similar inputs.

unsigned int hash( unsigned char const * base, size_t size ) {
unsigned int v = 0x54321;
while( size > 0 ) {
v = v*0xffeeddcb + *base + 0x938265;
++base;
--size;
}
return v;
}

Share this post


Link to post
Share on other sites
T1Oracle    100
Quote:
Original post by SiCranehorrible slow '%' operation

Why would that be "horrible slow"? Something like...

// example of c=a%b
c=a;
while (c>b)
c-=b;





Has to be fast on a modern processor. I would figure something like that would be optimized considering its frequency of use in software.

Share this post


Link to post
Share on other sites
smart_idiot    1298
Loops imply conditionals, and conditionals are slow. Multiply by several hundred thousand iterations for something like 563223%3, and thats pathetic.

Modulus is considered slow because it entails division, which is one of the slower operations. A more realistic example might look like this:


template <typename T> T modulus(T a, T b)
{
return a-b*static_cast<int>(a/b);
}



Share this post


Link to post
Share on other sites
SiCrane    11839
Quote:
Original post by T1Oracle
Quote:
Original post by SiCranehorrible slow '%' operation

Why would that be "horrible slow"? Something like...

Hey now, if you're going to quote something, at least attribute it to the right guy. :P

Share this post


Link to post
Share on other sites
_Sigma    792
Quote:
Original post by hplus0603

To get the hash of some block of data:


unsigned int hash( unsigned char const * base, size_t size ) {
return crc32( base, size );
}


To get the table index, given the hash:


unsigned int table_index( unsigned int h ) {
assert( !(table_size&(table_size-1)) );
return h & (table_size-1);
}


This works for all non-0 powers of 2 for table_size. It also works for most kinds of data that you want to hash. If you don't have a crc32 function laying around (you'll find it in zip, zlib, etc), then you can use a linear congruential rng instead:


// note: I haven't tuned the specific constants used, they are
// just examples. You probably want to tweak them so you get a
// good spread in the lower bits given similar inputs.

unsigned int hash( unsigned char const * base, size_t size ) {
unsigned int v = 0x54321;
while( size > 0 ) {
v = v*0xffeeddcb + *base + 0x938265;
++base;
--size;
}
return v;
}

Ohhhhkkkayayyy.....You lost me :P hahaha. Could you explain that a little more indepth, such as what to do with collison in thos cases, as well as what exactly each function does, and kinda step by step as to how I would impliment this? I raided my uni's library, and got "Itroduction to data structures and algorithms with C++ by Glenn W. Rowe (c) 1997 So i'll read that so I atleast know some of the terms :P Has anyone else read this book? Is it good or?

Cheers! =)

Share this post


Link to post
Share on other sites
iMalc    2466
What you want to use is the method called 'seperate chaining' what this means is that you never rehash, instead you simply append the item to a list an a position given by the hash function and table mask.

The data structure is a simple array where each element is a pointer to a linked list. This is better than rehashing because it allows you to easily remove the item later.

Here's a CRC implementation for you.

Here and here are some visualisation applets of hash tables, for you.

This is a tutorial about hash tables for java, although it's definately still worth a read.

There is a ton of information out there about hash-tables, keep googling.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Quote:
Original post by T1Oracle
Quote:
Original post by SiCranehorrible slow '%' operation

Why would that be "horrible slow"? Something like...
*** Source Snippet Removed ***
Has to be fast on a modern processor. I would figure something like that would be optimized considering its frequency of use in software.
What you wrote is usually *much* slower than '%', considering that the hash value can be on range 0 to (2^32-1) and it is "reduced" on a tiny range like 0-1023. You might have to loop 1 million times in your "fast" loop to get the same result '%' does in one or two instructions.

I don't think '%' is horribly slow either. Computing the CRC or some other hash for a string probably takes so much longer that using '%' doesn't really matter. But I haven't done benchmarks, could very well be wrong (at least for small strings).

Share this post


Link to post
Share on other sites
T1Oracle    100
Quote:
Original post by smart_idiot
Loops imply conditionals, and conditionals are slow. Multiply by several hundred thousand iterations for something like 563223%3, and thats pathetic.

Modulus is considered slow because it entails division, which is one of the slower operations. A more realistic example might look like this:

*** Source Snippet Removed ***


But on modern CPU's SIMD operations allow for multiple Floating Point operations in only 2 cycles. How could any float operation be faster than an integer operation?
Quote:
3DNow!™Technology ManualAll 3DNow! operations
have an execution latency of two clock cycles and are fully
pipelined.

Share this post


Link to post
Share on other sites
nmi    978
Quote:
Original post by _Sigma
Could you explain that a little more indepth, such as what to do with collison in thos cases, as well as what exactly each function does, and kinda step by step as to how I would impliment this?


Your hashtable maps keys to values, so it is a function t:K->V, where K is the set of possible keys, and V is the set of possible values.
Your hash function maps each key to a much smaller hash, so it is a function h:K->H, where H is the set of all possible hash values.
Since K can contain much more keys than you have hash values of H available, multiple keys will have the same hash value. In other words, those keys collide.

To get around this, different techniques exist.
Internal hashing will find another place in the table itself, for instance by rehashing.
External hashing will not store the data itself in the table, but just a link to the data. If multiple entries can share the same hash, the link will be implemented as a linked list in most cases. Other options would be a tree or a second hash table (with a different hashing algorithm).


Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this