Hashcodes

Started by
8 comments, last by Gyzmo 23 years, 11 months ago
I am currently coding some container classes and I can''t think of a good way to generate a hashcode for a string. What is a good way to do this? Gyzmo
Gyzmo=============================="Well is the world standard" - The Alchemist"Not in Canada!" - jonnyfish
Advertisement
the sum of the ascii codes of the individual characters modulus size of table

--
Float like a butterfly, bite like a crocodile.

--Float like a butterfly, bite like a crocodile.
What is a hashcode used for? Since it involves strings, I probably want to add it to my string libarary. Could you give an example of its use?

Wait a sec... would it tell you the most used character in the string? I am not in the mood to do the math to prove or disprove that...

--------------------


You are not a real programmer until you end all your sentences with semicolons;

Yanroy@usa.com

Visit the ROAD Programming Website for more programming help.

--------------------

You are not a real programmer until you end all your sentences with semicolons; (c) 2000 ROAD Programming
You are unique. Just like everybody else.
"Mechanical engineers design weapons; civil engineers design targets."
"Sensitivity is adjustable, so you can set it to detect elephants and other small creatures." -- Product Description for a vibration sensor

Yanroy@usa.com

A hashcode is a way to deterministically assigning a number to a given piece of data so it can be indexed in a hashtable.
This is used in dictionaries for instance, where it''s used to look up words in a long list of entries.

A hash-table is a combination of an array, and linked lists. The array is indexed by the list-elements hash-codes, and the list contains the actual data. That way, you have separate little "boxes" for data with the same hashcode, making it faster to use than a plain linked-list, but not very much more complex.


#pragma DWIM // Do What I Mean!
~ Mad Keith ~
**I use Software Mode**
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
A Hashtable isn''t necessarily a combination of an array and linked lists. That''s only the case for open hashing. Closed hashing schemes only need an array. (But suffer from other disadvantages.)

The sum of the ascii codes of the individual characters is not necessarily a good way to calculate hash values. It tends to clump, strings with the same length to similar hash values. A better method is to compute h mod m, where h is obtained iteratively by starting with 0, multiplying the old value by a and adding in the next character. a = 65599 is a good value, as is any largish prime.

Another nice method comes from P.J. Weinberger''s C compiler:
int hashpjw(char * s) {  char *p;  unsigned int h = 0, g;  for (p = s; *p != ''\0''; p++) {    h = (h << 4) + *p;    if (g = h & 0xf0000000) {      h = h ^ (g >> 24);      h = h ^ g;    }  }} 

Though it''s effectiveness goes down significantly as hash table size approaches 2^24. For most reasonable hash table sizes (say one trillion entries and below), it performs quite well.
what''s wrong with similar hash values? as long as the collisions are kept to a minimum.. and with closed hashing, collision resolution for lookups is pretty fast..

-goltrpoat



--
Float like a butterfly, bite like a crocodile.

--Float like a butterfly, bite like a crocodile.
I think he was just trying to say that your hashcode should lead to a more or less even distribution across the different hash buckets.

And I completely forgot about closed hashing! Damn was it that long ago that I took that class? :-)


#pragma DWIM // Do What I Mean!
~ Mad Keith ~
**I use Software Mode**
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
quote:Original post by goltrpoat
what''s wrong with similar hash values? as long as the collisions are kept to a minimum.. and with closed hashing, collision resolution for lookups is pretty fast..

Um, similar hash values lead to increased collisions.

Let''s say I have a program that generates code for me, and it uses stylized symbol names so that it won''t collide with my symbols. (ex: lex, yacc). So it creates symbols like _Vt_XXXX where XXXX is a four digit number. So one symbol might look like _Vt_3333 and another might look like _Vt_1335. Which hash to the same location when just adding the ASCII values. So does _Vt_0345, _Vt_0237, _Vt_0048, _Vt_0129, etc, etc, etc. And _Vt_0346 will hash right next to all those values, so we can throw out linear probing right off the bat.

And also the relative variance on the hashcodes goes down as the number of character''s increase. That is to say that there''s a relatively greater chance that two strings will hash to the same value the longer they are. This is counter-intuitive to the nature of hashing. Longer strings give more hashing information, so should lead to more different hash values. Practically, this is bad because the longer the strings are, the more computation is necessary to resolve that the two strings are different, even though they hash to the same value.
The old-school hash for strings was to form a polynomial from the letters:
c a t = c * 26^2, + a * 26 + t

Of course, it's much easier to multiply by 32 in a binary system (it becomes a left-shift by 5), and you can use that funky math rule to find the sum this way:
char *sp = string;unsigned long hash = 0;while (sp){  hash <<= 5;  hash += *sp++;}  


Step into a CArray of CStrings and you'll see that this is the way Mickeysoft still does it.

Edited by - Stoffel on 5/1/00 3:47:17 PM
There is also another very quick and efficent string hashing algorithm that I used in a recent dictionary program that I had to write for my CS70 class. Here is the well commented code.

/* * Hash a string using the modified CRC method.  The basic idea of * this function is that the hash value is rotated left by 5 bits and * then the next character is exclusive-or''ed in to the hash value. * The implementation is complicated by the lact of a rotate operation * in C++. * * This hash function does not work well with table sizes that are a * power of two. */unsigned int hashStringCRC(		// Hash a string    const char*		key)		// Key to be hashed    {    unsigned int hashValue = 0;    while (*key != ''\0'')	{	/*	 * The following expression could be done in one line, but it	 * would be really nasty, and a modern compiler ought to	 * generate the same code whether it''s one line or several.	 * So we''ll break it up to make it easier to read.	 *	 * First, we shift the value left to make room for bits from	 * the new key character.	 */	unsigned int leftShiftedValue = hashValue << CRC_HASH_SHIFT;	/*	 * Shifting left lost the top bits, so we have to extract and	 * position them separately with a right shift.  If we were	 * writing in assembly, we could do all of this in a single	 * rotate instruction, but C++ doesn''t give us access to that	 * machine operation so we have to do it the hard way.	 */	unsigned int rightShiftedValue =	  hashValue >> (WORD_WIDTH - CRC_HASH_SHIFT);	/*	 * Put the shifted values together, and then XOR them with the	 * next key character (stepping past it in the process).	 */	hashValue = (leftShiftedValue / rightShiftedValue) ^ *key++;	}    return hashValue;    } 


Check out the GPI project today!

This topic is closed to new replies.

Advertisement