fast hash map

This topic is 3695 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

im experimenting with my own implementation of a hash map... im not worried about memory size and the key will always be a string... insertion speed is also not imoprtant... whats important for me is the "find" performance so i thought that instead of creating one big list and do linear or binary search... i create several smaller list and generate an index from the key for the list... this way i can split for example an 200 item list into 10 list with 20 items each... then all i have to do is to calculate the list index and to a linear search for 20 items instead of 200... something like this
#define CMIN 65
#define CMAX 122
#define CRANGE (CMAX-CMIN)

namespace DX
{
template <typename Data>
class fast_hash_map
{
private:
typedef std::pair<std::string, Data> Hash;
std::vector<Hash> m_Hash_Map[CRANGE+2]; // create a list for each character, set first and last as wildcard holders

public:

Data insert(std::string name, Data value)
{
m_Hash_Map[GetIndex(name)].push_back(Hash(name, value));

return value;
}

Data find(std::string key)
{
std::vector<Hash>* CurHashMap = &m_Hash_Map[GetIndex(key)];

for (UINT n = 0; n < CurHashMap->size(); n++)
{
if (CurHashMap->at(n).first == key)
return CurHashMap->at(n).second;
}
return 0;
}

inline int GetIndex(std::string name)
{
int rangeindex = int((int(name.at(name.size()-1)>>1) +
int(name.at(name.size()-1)>>2))>>1; // Take two characters when creating index to spread

return max(min(rangeindex-CMIN+1, CRANGE),0); // make sure index is within range, if not place in first or last
}
};
}


however when i compare the performance against the standard extstd::hash_map... i get results where my implementaion is slower... why is that?

Share on other sites
map's find performance is O(lgN), your class performance is O(N) divided by 10. For small values of N, your class is faster, but as N increases, your performance will be slower and slower. Read about big O notation and complexity analysis.

Share on other sites
First of all, std::hash_map is usually implemented as a balanced tree (e.g. a red-black-tree) which has a worst-case complexity of something around O(log(n)) for the find operation. Your implementation however has a worst-case complexity of O(n).

Your code is basically an implementation of a chaining hash table. In the best case, these hash-tables can outperform balanced trees by quite a bit. However, this all depends on how well your hashing function works (i.e. how many collisions it will produce for a given set of keys).

Also, are you testing the performance in debug or release mode? Where do you lose performance? Is it because the algorithm itself is slow or is it because a lot of copying is going on.

Also note that the at() method is bounds-checked and will therefore be slower than the subscript operator. Since the way you access the vector is safe to begin with, the extra bounds-checking is unnecessary.

Share on other sites
its in release mode... ill try using subscripts

Share on other sites
Quote:
 Original post by Harry HuntFirst of all, std::hash_map is usually implemented as a balanced tree (e.g. a red-black-tree) which has a worst-case complexity of something around O(log(n)) for the find operation. Your implementation however has a worst-case complexity of O(n).

Say what? The hash part of the type name indicates it's a hash table; that's the entire point. If you want a balanced tree, you can go with std::map.

Share on other sites
Assuming one map has 200 elements.

Log(n) will take 7.6 tries on average to find the element.
Splitting the item into buckets of 20 will take 10 tries on average + selection of bucket, but that's O(1).
Hash_map should perform considerably better than either of those.

Also, hash_map was written and designed by professional software engineers. If that performance is inadequate, the only problem is your hash function, not the structure.

Data find(std::string key)
You make a redundant copy of string each time you call find.
You also return a copy of Data on each call.

Data insert(std::string name, Data value)
First you make a copy of both string and Data, then GetIndex makes a copy of Data, then Data is copied again on insertion, then Data is copied again when returned from function.

Inlining may remove some of those copies, but still, that's redundancy that's completely unacceptable for such structures. Pass everything by reference, strings by const reference.

Share on other sites

Testing on a bunch of strings (only lower and uppercase a-z), it seems to continously return negatives for rangeindex-CMIN+1, which makes it always output 0, making your search a linear one with no "sorting" done.

Share on other sites
why would it give me 0?

the char -> int conversion starts at 65 = A

soo i subtract 65 from every conversion...

how do u calculate how many tries it takes? i mean log(200) != 7.6

EDIT::

thx for all the suggestions!

its just a bit faster than extstd::hash_map in my application with 30 items

13.6ms vs 13.2ms

Share on other sites
Quote:
 Original post by Dragon_Strikewhy would it give me 0?the char -> int conversion starts at 65 = Asoo i subtract 65 from every conversion...

rangeIndex has a *far* more complex operation than subtracting 65 from the source.

A sample hash is:
int GetIndex( const std::string &str ){   return str.front() % CRANGE;}

Given your class (fixed so it would compile for me), with a print statement in the insert() function:
int main(){    DX::fast_hash_map<std::string> map;    for( char i = 'a' ; i <= 'z' ; ++i )    {        std::string str;        str += i;        map.insert(str,str);        str[0] = toupper(i);        map.insert(str,str);    }}

It consistently outputs:
Quote:
 inserting into bucket: 0

Share on other sites
ok...

this is how it looks now then... performance varies depending on "Range" and "Spread"

        // By Dragon_Strike	template <typename Data, int Range, int Spread>	class fast_hash_map	{	private:		typedef std::pair<std::string, Data> Hash;		std::vector<Hash> m_Hash_Map[Range]; // create a list for each character, set first and last as wildcard holders	public:		inline Data insert(const std::string& name, const Data value)		{			m_Hash_Map[GetIndex(name)].push_back(Hash(name, value));					return value;		}		inline Data* find(const std::string& key)		{			int index = GetIndex(key);			for (UINT n = 0; n < m_Hash_Map[index].size(); n++)			{				if (m_Hash_Map[index][n].first == key)					return &m_Hash_Map[index][n].second;			}			return 0;		}		inline int GetIndex(const std::string& name)		{						 int index = 0;			 for (int n = 0; n < Spread; n++)			 {				index += int(name[(name.size()-1)>>n]);			 }			 return index % Range;		}	};

EDIT::

now i get 12ms vs 13.3ms... mine is faster... but my benchmarks aren't very accurate... anyone willing to do a better test?

what more can i do to make it faster?

EDIT2::

the way i see it... the larger i set the range and find a good spread the faster it gets... in best case every item gets its own index... which gives one index calculation and one search... that should be faster than O(log(n))??

[Edited by - Dragon_Strike on January 6, 2008 10:28:36 AM]

Share on other sites
Strings are the interesting case when it comes to containers indexed by them.
Depending on various factors, all sorts of different data structures can come out fastest.
For example, if the strings are quite long in length and don't tend to have common prefixes (unlike directory paths for example) then a binary search can be significantly faster due to the fact that the string compares stop upon reaching the first non-matching char, and even repeating that log(n) times can be quicker than hashing a long string.

Another really fast method for longer strings is to use multi-way tries.

Hashing is the obvious choice when the strings aren't very long though, and there are more than about 10 of them. You need a good and fast hash though.
I always recommend and use a CRC function, with which you practically can't go wrong. It's blazingly fast, very few lines of code, and produces extremely good results.[cool]

Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful![sick]

[Edited by - iMalc on January 5, 2008 10:00:47 PM]

Share on other sites
Quote:
 now i get 12ms vs 13.3ms... mine is faster... but my benchmarks aren't very accurate... anyone willing to do a better test?what more can i do to make it faster?

The reason you're faster right now is because unlike hash_map, you're not hashing entire string, but only one character.

Your current improvement comes from simple constant factor.

There is no better test. Hash maps are very sensitive to type and amount of data. While it's somewhat possible to theoretically analyze them, it's mostly about hash functions.

Quote:
 Another really fast method for longer strings is to use multi-way tries

This would be "fastest" possible way if strings must truly be arbitrary. If I absolutely had no choice, this is what I'd go with. If strings however are known in advance, a 1-to-1 mapping using ints will result in only a few cycles per lookup, or, depending on compiler, might be completely evaluated in advance.

Quote:
 Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful!

Might not be. Depending on actual strings, if they're actual words, scanning from back will likely yield unique/diverse values after reading less characters than scanning from start.

But as always, it's impossible to say without actual dataset.

Share on other sites
Quote:
Original post by Antheus
Quote:
 Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful!

Might not be. Depending on actual strings, if they're actual words, scanning from back will likely yield unique/diverse values after reading less characters than scanning from start.

But as always, it's impossible to say without actual dataset.
No really it is that awful! I know there's a loop, but he's not scanning the string as such. The code literally only calculates a hash based on reading the very last character on the string repeatedly.
For starters he'll never exceed 256 different hash values, though with text he'll most likely never get anywhere near half that value.

Share on other sites
Quote:
 Original post by iMalcStrings are the interesting case when it comes to containers indexed by them.Depending on various factors, all sorts of different data structures can come out fastest.For example, if the strings are quite long in length and don't tend to have common prefixes (unlike directory paths for example) then a binary search can be significantly faster due to the fact that the string compares stop upon reaching the first non-matching char, and even repeating that log(n) times can be quicker than hashing a long string.Another really fast method for longer strings is to use multi-way tries.Hashing is the obvious choice when the strings aren't very long though, and there are more than about 10 of them. You need a good and fast hash though.I always recommend and use a CRC function, with which you practically can't go wrong. It's blazingly fast, very few lines of code, and produces extremely good results.[cool]Check out this page for more info.Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful![sick]

its not... its a setting u can set "Spread" to a larger value and it will take more characters...

in what way would CRC be better?

soo... is it good or bad?

Share on other sites
Quote:
Original post by Dragon_Strike
Quote:
 Original post by iMalcStrings are the interesting case when it comes to containers indexed by them.Depending on various factors, all sorts of different data structures can come out fastest.For example, if the strings are quite long in length and don't tend to have common prefixes (unlike directory paths for example) then a binary search can be significantly faster due to the fact that the string compares stop upon reaching the first non-matching char, and even repeating that log(n) times can be quicker than hashing a long string.Another really fast method for longer strings is to use multi-way tries.Hashing is the obvious choice when the strings aren't very long though, and there are more than about 10 of them. You need a good and fast hash though.I always recommend and use a CRC function, with which you practically can't go wrong. It's blazingly fast, very few lines of code, and produces extremely good results.[cool]Check out this page for more info.Why is it that your hash function ONLY looks at the last character in the string? That's absolutely awful![sick]

its not... its a setting u can set "Spread" to a larger value and it will take more characters...

No it won't. In the GetIndex() function, you only ever take the last character.

Share on other sites
I am not sure you understand the bit shift operator. I can't see why you are using it the way you are.

Share on other sites
Quote:
 Original post by Dragon_Strikeim experimenting with my own implementation of a hash map... im not worried about memory size and the key will always be a string... insertion speed is also not imoprtant... whats important for me is the "find" performance

In that case, you might be interested in a trie instead of a hash table, as a trie provides better performance for searching strings than binary search trees or imperfect hash tables.

Share on other sites
Quote:
 No it won't. In the GetIndex() function, you only ever take the last character.

yea... i had a typo there.. fixed now...

index += int(name[(name.size()-1)>>n]);

Share on other sites
Quote:
Original post by Dragon_Strike
Quote:
 No it won't. In the GetIndex() function, you only ever take the last character.

yea... i had a typo there.. fixed now...

index += int(name[(name.size()-1)>>n]);

Tell me what you think that line does. In detail please, with an example string perhaps.

Share on other sites

01234
abcde:

[(5-1)] = e
[(5-1)/2] = c
[(5-1)/4] = b

index = int(e)+int(c)+int(b)

01234
abfde:

[(5-1)] = e
[(5-1)/2] = f
[(5-1)/4] = b

index = int(e)+int(f)+int(b)

index(abcde) != index(abfde)

Share on other sites
Okay, now that you've edited your post which makes it look as if I'm wrong when I wasn't. I'm going to post here what what it had looked like:
		inline int GetIndex(const std::string& name)		{						 int index = 0;			 for (int n = 0; n < Spread; n++)			 {				index += int(name[(name.size()-1)])>>n;			 }			 return index % Range;		}
In future, please don't confuse people by editing your posts in that manner. The thread should make sense to others reading it in future.

What you actually have at this moment is rather unique. An attempt at log(k) hashing time where k is the string length. Are the strings really so long?
Collisions normally slow hashed containers down more than the hashing itself.

Share on other sites
My apologies. I was still thinking of your earlier attempt "int(name.at(name.size()-1)>>1)" when I wrote that. I see now it picks different characters.

Still it seems convoluted to me. Given that we want to pick N characters of a string S of length L, I would use:
int value = 0;int skip = L / N;for( int i = 0 ; i < L ; i += skip ){   value += S;}return value % Range;

Which chooses at characters at regular intervals. I think your one will be biased for large values of N (AKA Spread). I.E. with a large spread value you eventually start dividing the strings length by something close to it, resulting in the first two or so characters repeating a lot.

I'm not even sure such a method is particularly good for hashing.

Share on other sites
Quote:
 Okay, now that you've edited your post which makes it look as if I'm wrong when I [/source]In future, please don't confuse people by editing your posts in that manner. The thread should make sense to others reading it in future.What you actually have at this moment is rather unique. An attempt at log(k) hashing time where k is the string length. Are the strings really so long?Collisions normally slow hashed containers down more than the hashing itself.

my apologies... i wrote "yea... i had a typo there.. fixed now..."...

i didnt want to repost everything again so i just edited the post...

im sry... rly didnt mean to make it look like your wrong..

Share on other sites
Quote:
 Original post by rip-offMy apologies. I was still thinking of your earlier attempt "int(name.at(name.size()-1)>>1)" when I wrote that. I see now it picks different characters.Still it seems convoluted to me. Given that we want to pick N characters of a string S of length L, I would use:*** Source Snippet Removed ***Which chooses at characters at regular intervals. I think your one will be biased for large values of N (AKA Spread). I.E. with a large spread value you eventually start dividing the strings length by something close to it, resulting in the first two or so characters repeating a lot.I'm not even sure such a method is particularly good for hashing.

yea your right... i was only thinking of rather small values for "spread" (1-3)... the reason i wanted to use shift is because its only one instruction... in comparison to a division...

EDIT:: now that i think about it should be enough to do that instruction once... which makes things alot faster

			 int index = int(name[0]);			 for (int n = skip; n < name.size(); n+=skip)			 {				index += int(name[n]);			 }			 return index % Range;

EDIT2::

and the mod function i could replace with ANDing a mask? and setting range = pow(2,n)

return (index & mask);

EDIT3::

anything more u can think of?

	template <typename Data, int BitRange, int Skip>	class fast_hash_map	{	private:		typedef std::pair<std::string, Data> Hash;		typedef std::vector<Hash> HashMap;		boost::scoped_array<HashMap> m_Hash_Map; // create a list for each character	public:		int mask;				fast_hash_map()		{						int Range = int(powf(2.0f, BitRange));			m_Hash_Map.reset(new HashMap[Range]);			mask = Range-1;		}		inline Data insert(const std::string& name, const Data value)		{			m_Hash_Map[GetIndex(name)].push_back(Hash(name, value));					return value;		}		inline Data* find(const std::string& key)		{			int index = GetIndex(key);			for (UINT n = 0; n < m_Hash_Map[index].size(); n++)			{				if (m_Hash_Map[index][n].first == key)					return &m_Hash_Map[index][n].second;			}			return NULL;		}		inline int GetIndex(const std::string& name)		{						 int index = int(name[0]);			 for (UINT n = Skip; n < name.size(); n+=Skip)			 {				index += int(name[n]);			 }			 return (index & mask);		}	};

[Edited by - Dragon_Strike on January 6, 2008 1:31:20 PM]

Share on other sites
Have you tried using extstd::hash_map with your new hash function? Usually hash tables have a way for you to specify the hash function they use. It seenms like most of your efforts are going into making a hash function that works better for your application rather than improving the hash table structure itself. Using custom hash functions is common since the effectiveness of a hash function can depend on the data it is expected to hash. A general purpose string hashing function tends to be wasteful when the keys consist entirely of long strings that only differ in a few places, for example.