If you find this article contains errors or problems rendering it unreadable (missing images or files, mangled code, improper text formatting, etc) please contact the editor so corrections can be made. Thank you for helping us improve this resource |
Contents
Meet the Hash Table...
Hash tables are data structures that are used when you are managing a large amount of data and need to be able find an item quickly. They are used a lot in compilers, for example, to quickly find
symbol table entries for identifiers in the source code. They could also be used in a text game to look items up in the player's inventory. Any program which may need to look data up by name could
use a hash table.
How Does It Work
The table part is just an ordinary array, it is the hash that we are interested in. The hash is a function that transforms a string (it could be other things, too; but this article
is just about strings) into an integer. If the size of the table is N, then the integer will be in the range 0 to N-1. The integer is used as an index into the array. Thus, in essence,
the string itself indexes the array, although there is an intermediate step.
Although in theory this is great and all, there are always a few snags (of course). First, it is obvious that if the table is a fixed size there will be a limit on the number of entries. Second,
how does the hash avoid sending two or more strings to the same index?
Unfortunately, the solution to the second problem is beyond the scope of this article. It's called perfect hashing, and can be found in an advanced book on data structures. We will have to
compromise and find a way to make it work, even when two or more strings do go to the same index. When this happens it is called collision.
In this article, we'll look at one way to solve these problems and implement a hash table. The method is called chaining, which solves both the size problem and the collision problem.
Another often used method is called rehashing or open addressing. However, chaining is more versitile and easier to code, so it makes a better introduction.
"WhatNeedKnow">
What Do I Need to Know?
Just C for all the code. It's pretty basic stuff, a little pointer arithmetic, but not much. A handy guide to the C library functions might help if you don't have them memorized. You will need to
know about linked lists, so check out a tutorial on that first, if you need to. Lastly, a little knowledge of number theory might make the math more interesting, but you don't really need to know it
(you can skip the math entirely and still write the code with no problem, for sure).
Chaining
Chaining resolves collision by using a linked list. (If you don't know about linked lists, check out a tutorial on them before going on.) We create a linked list at each entry of the table. When a
string hashes to an index, we stick it on the list. Note that this solves both collision and size.
First, we'll create a structure for an entry. It will simply contain a string field and a next pointer:
typedef struct ENTRY_t /* hash table entry */ { char name [SIG_CHARS + 1]; struct ENTRY_t *next; /* your specific fields here */ } ENTRY;
Any other fields can be filled in ad hoc (or in a derived class, in C++). You can make SIG_CHARS (number of significant characters) whatever you want. You could also make the strings'
length dynamic.
We'll have an array of pointers to these entries (the heads of the lists) which will comprise the table. The size of the array will be a prime number. For those who don't know, a prime number is
one that can't be divided evenly by any other number except one (the number 1). 2, 3, 5, 7, 11 and 13 are the first few prime numbers. We'll need a bigger one, though, so let's go with, say, 383. You
can choose the size to fit your needs, however some sizes work better than others (I'll come back to this).
#define PRIME 383 ENTRY *table [PRIME]; /* each element is a linked list */
Now, we have an array of linked lists, and this is our table. The next step is to write the hash. This is a function that will, in our case, transform any string into an integer in the range 0 to
382. This is usually done in two steps: convert the string into an integer; get the integer into the correct range.
The second part is easy. There are several ways to do it, but the simplest is to just take the remainder of the integer when divided by the size of the table (in other words, the number mod
PRIME).
As for the first part, there are many, many ways to accomplish this. For example, you could just add up the ASCII values of the characters in the string. Even better, we should make the hash
dependant on the order of the characters. A standard function looks something like this:
unsigned int Hash (const char *name) { unsigned int hash = 0; /* transform the string into an integer */ while (*name != '\0') { hash = hash * MULTIPLIER + *name; ++ name; } /* reduce the integer to the correct range */ return hash % PRIME; }
The function adds up the ASCII values of the string, but it weighs them according to their location in the string. Many different values have been tried for MULTIPLIER. For example, if
MULTIPLIER is 5, the name "bat" adds up to 5^{2} * 98 + 5^{1} * 97 + 5^{0} * 116 = 3051. The remainder is then taken to give 3051 % 383 = 370. According to the Dragon Book
(Aho et al. 1986), 65599 works well for MULTIPLIER. We don't need to worry about overflow, because even if that happens it will still produce a valid integer. MULTIPLIER is not arbitrary,
however. If we made MULTIPLIER the same as PRIME, we would be returning the remainder of only the last character of the string (assuming overflow did not occur)! In theory, MULTIPLIER should not have
factors in common with the size of the table. However, in a later section we'll see that even this apparently doesn't stop it from performing well in some cases (hint: overflow)!
So, now the rest is easy. When we get a string, we will hash it and add it to the list at the correct index. Our funtion will be called Enter, it will take a string and return a newly created
ENTRY* (note that no check is done to see if the name is already in the table):
ENTRY *Enter (const char *name) { unsigned int index = Hash (name); ENTRY *node = malloc (sizeof (ENTRY)); if (node == NULL) return NULL; /* initialize the node (copy the name into it and anything else) */ strncpy (node->name, name, SIG_CHARS); node->name [SIG_CHARS] = '\0'; /* put the new node at the head of the list at table [index] */ node->next = table [index]; table [index] = node; return node; }
The new node is put at the head of the list. Note that even though only SIG_CHARS characters are stored in the entry, the hash actually sees more than that. So if two different strings have the
same first 10 characters, and SIG_CHARS is 10, they may still hash to different indexes - which is good.
Now, we can write functions for lookup, deletion and whatever else, but all the hashing stuff will be the same. To look a string up, just get the index from the hash function and compare the
string with the name of each node in the list at that index. Either the string is found and you return a pointer to the node, or it isn't there and you return NULL. As long as you are up on your
linked lists, you shouldn't have any problems.
How do I Pick MULTIPLIER and the Size of the Table
Let's first introduce some terminology here (I am introducing my own terms, they may or may not be used by other people). Call the first hash - changing the string to an integer - the
primary hash and the second hash - taking the remainder of the division by the size of the table - the secondary hash. The primary hash deals with MULTIPLIER and the secondary hash
deals with the size of the table.
Let's look at the secondary hash first. Before, I said that the size (call it SIZE from now on) should be prime. But why? Now I have a gripe because I looked for this in a number of books and
sites and couldn't find any good mathematical reasons. They all said something about patterns, but that was it. So I ended up figuring it out myself, after some preliminary head scratching, and now I
bring it to you. Skip the next three paragraphs if you just don't care about the math.
Call the result from the primary hash H. The secondary hash is then H % SIZE. Then what makes H % SIZE good or bad. Well, observe what happens if H and SIZE have a common factor, say F. So H = F *
H' and SIZE = F * SIZE', where H' and SIZE' are what's left when F is divided out. Looking at (F * H') % (F * SIZE'), we note that the F distributes across the mod operator. That is (F * H') %
(F * SIZE') = F * (H' % SIZE').
If you want proof, it's a simple use of the division algorithm. We are looking for (ab) mod (ac), which is r in the equation ab = acq + r (r is in the range 0 to ac - 1). Since r = ab - acq, r is
divisible by a. So let r' = r / a. Now, dividing out a we get b = cq + r'. If we can show that r' is in the range 0 to c - 1, then r' will be b mod c. But if r' is greater than or equal to c then r
is greater than or equal to ac (by multiplying both sides by a), and this is a contradiction. Hence, r' = b mod c and ar' = r = a (b mod c) = (ab) mod (ac), the far right side being our original
equation.
So what? OK, say that SIZE is divisible by 2. Then any time H is divisible by 2, H % SIZE will be divisible by 2. Using a negative argument, any time H is not divisible by 2, H % SIZE won't be
either. In other words, as far as the secondary hash is concerned, even numbers hash to even indexes and odd numbers hash to odd indexes. If SIZE was also divisible by 3, say, then multiples of 3
would hash to multiples of 3 and non-multiples of 3 would hash to non-multiples of 3. It still doesn't seem like a big deal because we would expect half the numbers to be even and the other half to
be odd. Unfortunately, this is very unlikely. Now we see what those other articles meant when they talked about patterns! The sample set is more likely to be biased (especially, the smaller it is)
and the problem is the the secondary hash is perpetuating this instead of mixing it up. You could think of the secondary hash as being too lazy. Therefore, it is best, in general, to
use a prime number for the size, because it only has one factor - itself.
I've seen two seemingly contradictory views on which prime should be chosen for the size of the table. One says that it should be as near to a power of 2 as possible, and one says just the
opposite - it should be as far away from a power of 2 as possible. I'm inclined to go with the second option because it appears to be more widely claimed.
What about MULTIPLIER? The only thing to be said in general, is that it shouldn't have any factors in common with SIZE, as we've already seen. Again, prime numbers are a popular choice (they're so
cool, aren't they!) This brings me to my next point which I cannot stress enough, so I'm giving it its own section.
Testing the Hash
Before writing an apparently great hash function and sticking it in your project, take the time to run sample sets of data through it. Make sure the sample set is similar to what would really be
used. Observe the distribution of the hash into the table. Try out different values of MULTIPLIER to get the distribution even.
A simple way to measure the distribution in the table is to sum the squares of the number of names hashed to the same index. The following code implements a hash testing program:
#include <stdio.h> #include <math.h> #include <assert.h> #define PRIME 23 unsigned int table [PRIME] = {0}; unsigned int sampSize = 0; unsigned int multiplier; unsigned int Hash (const char *name) { unsigned int hash = 0; while (*name != '\0') { hash = hash * multiplier + *name; ++ name; } return hash % PRIME; } void Enter (const char *name) { /* count how many names hash to each index */ ++ table [Hash (name)]; } void TestSample (void) { char name [64]; FILE *file = fopen ("sample.txt", "r"); assert (file); /* read MULTIPLIER off the first line of the file */ fscanf (file, "%u", &multiplier); /* enter names from the file until it reaches the end */ while (fscanf (file, "%s", name) != EOF) { Enter (name); ++ sampSize; /* keep a total of words "in" the table */ } fclose (file); } void TallyResults (void) { double score; unsigned int index, sum, q = sampSize / PRIME, r = sampSize % PRIME; /* add up the squares of the number of words hashed to each index */ for (index = sum = 0; index < PRIME; ++ index) sum += table [index] * table [index]; /* compute a relative score and print it */ score = (double) sum / (double) (q * (PRIME * q + 2 * r) + r); printf ("score = %f\n", score); } int main (void) { TestSample (); TallyResults (); return 0; }
You'll notice that very little is going on here. We aren't even adding anything to the table. We're just counting how many times each index is hashed to. TestSample opens a file called sample.txt.
The file begins with an string-integer which will be used as MULTIPLIER and is followed by any number of strings. It reads in the strings and hashes them one by one. Enter just increments a running
total of each index.
When we get down to TallyResults, we add up the square of each index count. We square them so that bigger values (more strings hashing to the same place) are penalized more than smaller ones.
After getting the total, we divide by the funny looking expression above the printf statement. This expression is the lowest (best) possible total we can have, when all the strings are distributed as
evenly as possible. It's not hard to derive but not really important to do so. Just take a look at what q and r are and I'm sure you can figure it out. Dividing by this normalizes the result
so that the best score is always 1.0. The bigger the score, the worse the hash.
Any other statistical info that might be helpful can be gathered, too. The longest list and the number of empty lists for example, might be nice to know.
You'll also notice that I made the size of the table much smaller. This is just to illustrate my point with an example. I ran the following set of strings through the program:
sword mace axe arrow shield bag stone key skull jar bottle fairy potion water spoon book spear dagger katana helmet chain halberd pipe hat eyeofnewt soup wolfbane soap instantcoffee bugspray flint bones orb gold silver wine bread
Then I played around with MULTIPLIER for a bit. The following table summarizes the results:
MULTIPLIER | 1024 | 5 | 1000000007 | 65599 | 1048576 | 49157 | 46 |
score | 1.185 | 1.246 | 1.369 | 1.523 | 1.615 | 1.646 | 1.800 |
A little bit surprisingly, 1024 (a power of 2) did best; but 1048576 = 1024^{2} didn't do too great. Neither did 512 or 2048 (not shown), by the way. The small prime 5 did fairly well
while the very large prime 1000000007 did a tid bit worse. The prime 65599 close to a power of 2 takes us to the poor side. 49157, the prime far away from a power of 2 is no good either. The worst,
46 = PRIME * 2, was tried just to see how well it performed (this will only take the remainder of the last character unless overflow occurs). I must admit, I was quite surprised to see that 12167 =
23^{3} (not shown) scored a 1.215, though! This was because the primary hash was overflowing very quickly. Even short words like "axe" caused overflow. It just goes to show that you've gotta
test these things!
This doesn't mean that 1024 is always the best, or even the best in this case (just the best one I found in about ten minutes of playing with it). It depends on the names being hashed as well as
the size of the table. Even the size of an unsigned int affects it (maybe an unsigned __int32 would be better). Just make sure it works with a large sample set similar to what would really be used.
If you are writing a C compiler, go through a few dozen C files and pick out a few hundred of the most popular identifiers, etc.
Anyways, you see how this goes. Make sure you test your data because how good a MULTIPLIER is depends a lot on the sample set. If you are a real perfectionist, write some kind of AI to choose a
MULTIPLIER for you. I have been tinkering with a Genetic Algorithm to do just that.
What Other Hash Functions Are There?
Anything that turns a string into an integer is a primary hash function, and anything that reduces that integer into the correct range is a secondary hash function. I've seen a handful of other
secondary hash functions as well, which can be found in a good data structures book. Primary hash functions are easier to invent, so we'll focus on those.
One way to spice it up is to use a bitwise exclusive or (xor) instead of an addition in the above hash function:
while (*name != '\0') { hash = hash * multiplier ^ *name; ++ name; }
A totally different type of hash function comes, again, from the Dragon Book (Aho et al. 1986):
int Hash (const char *name) { unsigned int g, h = 0; while (*name) { h = (h << 4) + *name; if ((g = h & 0xF0000000) != 0) { h ^= g >> 24; h ^= g; } ++ name; } return h % PRIME; }
I can't tell you exactly how this monstrosity works, but according to the tests they ran it through, it seems to. It was written by P.J. Weinberger. I've used it before, just for fun really, with
good results. It's obviously a whole different breed than the simple weighted addition method.
Basically, when writing your own, be sure to test it. Very little practical advice can be given in general. The whole point of writing a primary hash is to destroy patterns in the strings. As you
might guess, this is easier said than done, especially if you don't know what strings will be going into your table. Most of the time, you will want to use all the characters of the strings and make
their order count. I like the add-and-weight strategy used in this article because it is simple, fast and gives you a lot of versatility because you have so many MULTIPLIERs to choose from.
If you are interested, universal hashing is a method which can amortize (make level over time) the efficiency of the table by using a "random" hash. It's not exactly random, which is
why I put it in quotes. This method can be found in a good data structures book like Introduction to Algorithms, 2nd ed. (Cormen et al. 2001)
Anything else about Chaining
Using a circular linked list instead of a regular one is a neat idea. When a name is looked up, the list can be rotated so that the name will be at the head of the list next time. This
works well if a name is likely to be used several times close to each other, as in a C program. Also, if you want to delete things from the list by node, you might find a doubly linked list to be
more suited for that.
Thanks for reading my article! ;-) Please comment at gaulerthegoat@yahoo.com.
Works Cited
- Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. >Compilers: Principles , Techniques, and Tools. ("DragonBook") Reading: Addison Wesley Longman, 1986.
- Corman, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd ed.Cambridge: The MIT Press, 2001.