Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

Gyzmo

Hashcodes

This topic is 6933 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am currently coding some container classes and I can''t think of a good way to generate a hashcode for a string. What is a good way to do this? Gyzmo

Share this post


Link to post
Share on other sites
Advertisement
the sum of the ascii codes of the individual characters modulus size of table

--
Float like a butterfly, bite like a crocodile.

Share this post


Link to post
Share on other sites
What is a hashcode used for? Since it involves strings, I probably want to add it to my string libarary. Could you give an example of its use?

Wait a sec... would it tell you the most used character in the string? I am not in the mood to do the math to prove or disprove that...

--------------------


You are not a real programmer until you end all your sentences with semicolons;

Yanroy@usa.com

Visit the ROAD Programming Website for more programming help.

Share this post


Link to post
Share on other sites
A hashcode is a way to deterministically assigning a number to a given piece of data so it can be indexed in a hashtable.
This is used in dictionaries for instance, where it''s used to look up words in a long list of entries.

A hash-table is a combination of an array, and linked lists. The array is indexed by the list-elements hash-codes, and the list contains the actual data. That way, you have separate little "boxes" for data with the same hashcode, making it faster to use than a plain linked-list, but not very much more complex.


#pragma DWIM // Do What I Mean!
~ Mad Keith ~
**I use Software Mode**

Share this post


Link to post
Share on other sites
A Hashtable isn''t necessarily a combination of an array and linked lists. That''s only the case for open hashing. Closed hashing schemes only need an array. (But suffer from other disadvantages.)

The sum of the ascii codes of the individual characters is not necessarily a good way to calculate hash values. It tends to clump, strings with the same length to similar hash values. A better method is to compute h mod m, where h is obtained iteratively by starting with 0, multiplying the old value by a and adding in the next character. a = 65599 is a good value, as is any largish prime.

Another nice method comes from P.J. Weinberger''s C compiler:

int hashpjw(char * s) {
char *p;
unsigned int h = 0, g;
for (p = s; *p != ''\0''; p++) {
h = (h << 4) + *p;
if (g = h & 0xf0000000) {
h = h ^ (g >> 24);
h = h ^ g;
}
}
}

Though it''s effectiveness goes down significantly as hash table size approaches 2^24. For most reasonable hash table sizes (say one trillion entries and below), it performs quite well.

Share this post


Link to post
Share on other sites
what''s wrong with similar hash values? as long as the collisions are kept to a minimum.. and with closed hashing, collision resolution for lookups is pretty fast..

-goltrpoat



--
Float like a butterfly, bite like a crocodile.

Share this post


Link to post
Share on other sites
I think he was just trying to say that your hashcode should lead to a more or less even distribution across the different hash buckets.

And I completely forgot about closed hashing! Damn was it that long ago that I took that class? :-)


#pragma DWIM // Do What I Mean!
~ Mad Keith ~
**I use Software Mode**

Share this post


Link to post
Share on other sites
quote:
Original post by goltrpoat
what''s wrong with similar hash values? as long as the collisions are kept to a minimum.. and with closed hashing, collision resolution for lookups is pretty fast..


Um, similar hash values lead to increased collisions.

Let''s say I have a program that generates code for me, and it uses stylized symbol names so that it won''t collide with my symbols. (ex: lex, yacc). So it creates symbols like _Vt_XXXX where XXXX is a four digit number. So one symbol might look like _Vt_3333 and another might look like _Vt_1335. Which hash to the same location when just adding the ASCII values. So does _Vt_0345, _Vt_0237, _Vt_0048, _Vt_0129, etc, etc, etc. And _Vt_0346 will hash right next to all those values, so we can throw out linear probing right off the bat.

And also the relative variance on the hashcodes goes down as the number of character''s increase. That is to say that there''s a relatively greater chance that two strings will hash to the same value the longer they are. This is counter-intuitive to the nature of hashing. Longer strings give more hashing information, so should lead to more different hash values. Practically, this is bad because the longer the strings are, the more computation is necessary to resolve that the two strings are different, even though they hash to the same value.

Share this post


Link to post
Share on other sites
The old-school hash for strings was to form a polynomial from the letters:
c a t = c * 26^2, + a * 26 + t

Of course, it's much easier to multiply by 32 in a binary system (it becomes a left-shift by 5), and you can use that funky math rule to find the sum this way:

char *sp = string;
unsigned long hash = 0;
while (sp)
{
hash <<= 5;
hash += *sp++;
}


Step into a CArray of CStrings and you'll see that this is the way Mickeysoft still does it.

Edited by - Stoffel on 5/1/00 3:47:17 PM

Share this post


Link to post
Share on other sites
There is also another very quick and efficent string hashing algorithm that I used in a recent dictionary program that I had to write for my CS70 class. Here is the well commented code.


/*
* Hash a string using the modified CRC method. The basic idea of
* this function is that the hash value is rotated left by 5 bits and
* then the next character is exclusive-or''ed in to the hash value.
* The implementation is complicated by the lact of a rotate operation
* in C++.
*
* This hash function does not work well with table sizes that are a
* power of two.
*/
unsigned int hashStringCRC( // Hash a string
const char* key) // Key to be hashed
{
unsigned int hashValue = 0;
while (*key != ''\0'')
{
/*
* The following expression could be done in one line, but it
* would be really nasty, and a modern compiler ought to
* generate the same code whether it''s one line or several.
* So we''ll break it up to make it easier to read.
*
* First, we shift the value left to make room for bits from
* the new key character.
*/
unsigned int leftShiftedValue = hashValue << CRC_HASH_SHIFT;
/*
* Shifting left lost the top bits, so we have to extract and
* position them separately with a right shift. If we were
* writing in assembly, we could do all of this in a single
* rotate instruction, but C++ doesn''t give us access to that
* machine operation so we have to do it the hard way.
*/
unsigned int rightShiftedValue =
hashValue >> (WORD_WIDTH - CRC_HASH_SHIFT);
/*
* Put the shifted values together, and then XOR them with the
* next key character (stepping past it in the process).
*/
hashValue = (leftShiftedValue / rightShiftedValue) ^ *key++;
}
return hashValue;
}


Check out the GPI project today!

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!