fast hash map

Started by
33 comments, last by polygon7 16 years, 3 months ago
im experimenting with my own implementation of a hash map... im not worried about memory size and the key will always be a string... insertion speed is also not imoprtant... whats important for me is the "find" performance so i thought that instead of creating one big list and do linear or binary search... i create several smaller list and generate an index from the key for the list... this way i can split for example an 200 item list into 10 list with 20 items each... then all i have to do is to calculate the list index and to a linear search for 20 items instead of 200... something like this

#define CMIN 65
#define CMAX 122
#define CRANGE (CMAX-CMIN)

namespace DX
{
	template <typename Data>
	class fast_hash_map
	{
	private:
		typedef std::pair<std::string, Data> Hash;
		std::vector<Hash> m_Hash_Map[CRANGE+2]; // create a list for each character, set first and last as wildcard holders

	public:

		Data insert(std::string name, Data value)
		{
			m_Hash_Map[GetIndex(name)].push_back(Hash(name, value));		

			return value;
		}

		Data find(std::string key)
		{
			std::vector<Hash>* CurHashMap = &m_Hash_Map[GetIndex(key)];

			for (UINT n = 0; n < CurHashMap->size(); n++)
			{
				if (CurHashMap->at(n).first == key)
					return CurHashMap->at(n).second;
			}
			return 0;
		}

		inline int GetIndex(std::string name)
		{			
			int rangeindex = int((int(name.at(name.size()-1)>>1) + 
								  int(name.at(name.size()-1)>>2))>>1; // Take two characters when creating index to spread

			return max(min(rangeindex-CMIN+1, CRANGE),0); // make sure index is within range, if not place in first or last
		}
	};
}

however when i compare the performance against the standard extstd::hash_map... i get results where my implementaion is slower... why is that?
Advertisement
map's find performance is O(lgN), your class performance is O(N) divided by 10. For small values of N, your class is faster, but as N increases, your performance will be slower and slower. Read about big O notation and complexity analysis.
2+2=5, for big values of 2!
First of all, std::hash_map is usually implemented as a balanced tree (e.g. a red-black-tree) which has a worst-case complexity of something around O(log(n)) for the find operation. Your implementation however has a worst-case complexity of O(n).

Your code is basically an implementation of a chaining hash table. In the best case, these hash-tables can outperform balanced trees by quite a bit. However, this all depends on how well your hashing function works (i.e. how many collisions it will produce for a given set of keys).

Also, are you testing the performance in debug or release mode? Where do you lose performance? Is it because the algorithm itself is slow or is it because a lot of copying is going on.

Also note that the at() method is bounds-checked and will therefore be slower than the subscript operator. Since the way you access the vector is safe to begin with, the extra bounds-checking is unnecessary.
its in release mode... ill try using subscripts
Quote:Original post by Harry Hunt
First of all, std::hash_map is usually implemented as a balanced tree (e.g. a red-black-tree) which has a worst-case complexity of something around O(log(n)) for the find operation. Your implementation however has a worst-case complexity of O(n).

Say what? The hash part of the type name indicates it's a hash table; that's the entire point. If you want a balanced tree, you can go with std::map.
Assuming one map has 200 elements.

Log(n) will take 7.6 tries on average to find the element.
Splitting the item into buckets of 20 will take 10 tries on average + selection of bucket, but that's O(1).
Hash_map should perform considerably better than either of those.

Also, hash_map was written and designed by professional software engineers. If that performance is inadequate, the only problem is your hash function, not the structure.


Data find(std::string key)
You make a redundant copy of string each time you call find.
You also return a copy of Data on each call.

Data insert(std::string name, Data value)
First you make a copy of both string and Data, then GetIndex makes a copy of Data, then Data is copied again on insertion, then Data is copied again when returned from function.

Inlining may remove some of those copies, but still, that's redundancy that's completely unacceptable for such structures. Pass everything by reference, strings by const reference.
Your hash seems bad too.

Testing on a bunch of strings (only lower and uppercase a-z), it seems to continously return negatives for rangeindex-CMIN+1, which makes it always output 0, making your search a linear one with no "sorting" done.
why would it give me 0?

the char -> int conversion starts at 65 = A

soo i subtract 65 from every conversion...

how do u calculate how many tries it takes? i mean log(200) != 7.6

EDIT::

thx for all the suggestions!

its just a bit faster than extstd::hash_map in my application with 30 items

13.6ms vs 13.2ms
Quote:Original post by Dragon_Strike
why would it give me 0?

the char -> int conversion starts at 65 = A

soo i subtract 65 from every conversion...


rangeIndex has a *far* more complex operation than subtracting 65 from the source.

A sample hash is:
int GetIndex( const std::string &str ){   return str.front() % CRANGE;}


Given your class (fixed so it would compile for me), with a print statement in the insert() function:
int main(){    DX::fast_hash_map<std::string> map;    for( char i = 'a' ; i <= 'z' ; ++i )    {        std::string str;        str += i;        map.insert(str,str);        str[0] = toupper(i);        map.insert(str,str);    }}


It consistently outputs:
Quote:
inserting into bucket: 0
ok...

this is how it looks now then... performance varies depending on "Range" and "Spread"

        // By Dragon_Strike	template <typename Data, int Range, int Spread>	class fast_hash_map	{	private:		typedef std::pair<std::string, Data> Hash;		std::vector<Hash> m_Hash_Map[Range]; // create a list for each character, set first and last as wildcard holders	public:		inline Data insert(const std::string& name, const Data value)		{			m_Hash_Map[GetIndex(name)].push_back(Hash(name, value));					return value;		}		inline Data* find(const std::string& key)		{			int index = GetIndex(key);			for (UINT n = 0; n < m_Hash_Map[index].size(); n++)			{				if (m_Hash_Map[index][n].first == key)					return &m_Hash_Map[index][n].second;			}			return 0;		}		inline int GetIndex(const std::string& name)		{						 int index = 0;			 for (int n = 0; n < Spread; n++)			 {				index += int(name[(name.size()-1)>>n]);			 }			 return index % Range;		}	};


EDIT::

now i get 12ms vs 13.3ms... mine is faster... but my benchmarks aren't very accurate... anyone willing to do a better test?

what more can i do to make it faster?

EDIT2::

the way i see it... the larger i set the range and find a good spread the faster it gets... in best case every item gets its own index... which gives one index calculation and one search... that should be faster than O(log(n))??

[Edited by - Dragon_Strike on January 6, 2008 10:28:36 AM]

This topic is closed to new replies.

Advertisement