fast hash map

Started by
33 comments, last by polygon7 16 years, 3 months ago
Strings are the interesting case when it comes to containers indexed by them.
Depending on various factors, all sorts of different data structures can come out fastest.
For example, if the strings are quite long in length and don't tend to have common prefixes (unlike directory paths for example) then a binary search can be significantly faster due to the fact that the string compares stop upon reaching the first non-matching char, and even repeating that log(n) times can be quicker than hashing a long string.

Another really fast method for longer strings is to use multi-way tries.

Hashing is the obvious choice when the strings aren't very long though, and there are more than about 10 of them. You need a good and fast hash though.
I always recommend and use a CRC function, with which you practically can't go wrong. It's blazingly fast, very few lines of code, and produces extremely good results.[cool]
Check out this page for more info.

Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful![sick]

[Edited by - iMalc on January 5, 2008 10:00:47 PM]
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Advertisement
Quote:now i get 12ms vs 13.3ms... mine is faster... but my benchmarks aren't very accurate... anyone willing to do a better test?

what more can i do to make it faster?


The reason you're faster right now is because unlike hash_map, you're not hashing entire string, but only one character.

Your current improvement comes from simple constant factor.

There is no better test. Hash maps are very sensitive to type and amount of data. While it's somewhat possible to theoretically analyze them, it's mostly about hash functions.

Quote:Another really fast method for longer strings is to use multi-way tries


This would be "fastest" possible way if strings must truly be arbitrary. If I absolutely had no choice, this is what I'd go with. If strings however are known in advance, a 1-to-1 mapping using ints will result in only a few cycles per lookup, or, depending on compiler, might be completely evaluated in advance.

Quote:Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful!


Might not be. Depending on actual strings, if they're actual words, scanning from back will likely yield unique/diverse values after reading less characters than scanning from start.

But as always, it's impossible to say without actual dataset.
Quote:Original post by Antheus
Quote:Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful!


Might not be. Depending on actual strings, if they're actual words, scanning from back will likely yield unique/diverse values after reading less characters than scanning from start.

But as always, it's impossible to say without actual dataset.
No really it is that awful! I know there's a loop, but he's not scanning the string as such. The code literally only calculates a hash based on reading the very last character on the string repeatedly.
For starters he'll never exceed 256 different hash values, though with text he'll most likely never get anywhere near half that value.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Quote:Original post by iMalc
Strings are the interesting case when it comes to containers indexed by them.
Depending on various factors, all sorts of different data structures can come out fastest.
For example, if the strings are quite long in length and don't tend to have common prefixes (unlike directory paths for example) then a binary search can be significantly faster due to the fact that the string compares stop upon reaching the first non-matching char, and even repeating that log(n) times can be quicker than hashing a long string.

Another really fast method for longer strings is to use multi-way tries.

Hashing is the obvious choice when the strings aren't very long though, and there are more than about 10 of them. You need a good and fast hash though.
I always recommend and use a CRC function, with which you practically can't go wrong. It's blazingly fast, very few lines of code, and produces extremely good results.[cool]
Check out this page for more info.

Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful![sick]


its not... its a setting u can set "Spread" to a larger value and it will take more characters...

in what way would CRC be better?

soo... is it good or bad?
Quote:Original post by Dragon_Strike
Quote:Original post by iMalc
Strings are the interesting case when it comes to containers indexed by them.
Depending on various factors, all sorts of different data structures can come out fastest.
For example, if the strings are quite long in length and don't tend to have common prefixes (unlike directory paths for example) then a binary search can be significantly faster due to the fact that the string compares stop upon reaching the first non-matching char, and even repeating that log(n) times can be quicker than hashing a long string.

Another really fast method for longer strings is to use multi-way tries.

Hashing is the obvious choice when the strings aren't very long though, and there are more than about 10 of them. You need a good and fast hash though.
I always recommend and use a CRC function, with which you practically can't go wrong. It's blazingly fast, very few lines of code, and produces extremely good results.[cool]
Check out this page for more info.

Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful![sick]


its not... its a setting u can set "Spread" to a larger value and it will take more characters...

No it won't. In the GetIndex() function, you only ever take the last character.
NextWar: The Quest for Earth available now for Windows Phone 7.
I am not sure you understand the bit shift operator. I can't see why you are using it the way you are.
Quote:Original post by Dragon_Strike
im experimenting with my own implementation of a hash map... im not worried about memory size and the key will always be a string... insertion speed is also not imoprtant... whats important for me is the "find" performance


In that case, you might be interested in a trie instead of a hash table, as a trie provides better performance for searching strings than binary search trees or imperfect hash tables.
Quote:
No it won't. In the GetIndex() function, you only ever take the last character.


yea... i had a typo there.. fixed now...

index += int(name[(name.size()-1)>>n]);
Quote:Original post by Dragon_Strike
Quote:
No it won't. In the GetIndex() function, you only ever take the last character.


yea... i had a typo there.. fixed now...

index += int(name[(name.size()-1)>>n]);


Tell me what you think that line does. In detail please, with an example string perhaps.
spread = 3

01234
abcde:

[(5-1)] = e
[(5-1)/2] = c
[(5-1)/4] = b

index = int(e)+int(c)+int(b)

01234
abfde:

[(5-1)] = e
[(5-1)/2] = f
[(5-1)/4] = b

index = int(e)+int(f)+int(b)

index(abcde) != index(abfde)

This topic is closed to new replies.

Advertisement