Sign in to follow this  
bpoint

[C++] resolving collisions with hash_map

Recommended Posts

bpoint    464
This is my first time working with stdext::hash_map, so please be gentle. :) Basically, I'm writing some code to remove duplicate vertices from a vertex buffer. Before adding a vertex to the buffer, I first check to see if it exists in the hash_map or not:
int MyIndexer::locateVertex(const MyVec &vec)
{
	// locate the vertex in our hash map
	stdext::hash_map<MyVec, int>::iterator iter = mVertexMap.find(vec);

	// if iterator points to end of map, then the vertex was not found
	if (iter == mVertexMap.end())
		return INDEX_NOT_FOUND;

	// check if returned index matches the data we are looking for
	if (mVtxBuf[iter->second] != vec)
	{
		LOG("hash map collision");

		// ...add resolve code here...
	}

	// return index to vertex
	return iter->second;
}
The hash function I'm using to process the actual vertex data uses stdext::_Hash_value() -- the same function used for hashing strings. It seems to do quite a good job, considering one data set I'm dealing with which has 5.9 million vertices only came up with 137 collisions. Now, to my actual question: how do I actually resolve a collision? I thought that perhaps the iterator returned by find() would give me a set of pairs that all had the same key, but that didn't seem to be it. At worst case, I could simply do a linear search through the entire vertex buffer from start to end -- and with that only happening 137 times, it doesn't seem like it would be too bad -- but I'm sure there has to a proper way of doing it. I've scoured Google, read through the (poor) MSDN documentation, and searched GameDev's forums as much as I could, but I simply cannot figure out what to do from here. Any help would be greatly appreciated!

Share this post


Link to post
Share on other sites
bubu LV    1436
find() gives you iterator to exactly your element (or end() if none is found). Collisions are already resolved inside find.
hash_map requires you to define not only hashing operator, but also == operator. That's why find() can find exactly your element - first with hashing, and then with == operator to resolve collisions.

Share this post


Link to post
Share on other sites
bpoint    464
If what you say is true, then the if() check on the vertex data should never fail -- but it does, and that's how I managed to count (what I thought was) the number of collisions.

As a test, I stuck a breakpoint on MyVec::operator== before calling find() and stepped over it in the debugger, but my operator==() function seemed to never be called from hash_map::find().

So... am I doing something completely wrong here?

Share this post


Link to post
Share on other sites
bpoint    464
I added some more debug info to my function, and now I'm absolutely sure hash_map::find() is not returning an iterator to the proper data on a collision:
	// check if returned index matches the data we are looking for
const MyVec &vbufvec = mVtxBuf[iter->second];
if (vbufvec != vec)
{
LOG("hash map collision: {%.2f,%.2f,%.2f [%08Xh]} / {%.2f,%.2f,%.2f [%08Xh]}", vbufvec.x, vbufvec.y, vbufvec.z, (size_t)(vbufvec), vec.x, vec.y, vec.z, (size_t)(vec));
}



This is what I'm getting from the logger:
[MyIndexer::locateVertex]: hash map collision: {4784.40,0.00,-3639.60 [D9CFAF33h]} / {4720.20,0.00,-3102.60 [D9CFAF33h]}


The vertex data is different, but the hash value for them is exactly the same -- this is definitely a collision.

I'm using MSVC2005's implementation. Is there a bug here, perhaps?

Share this post


Link to post
Share on other sites
dmail    116
Quote:

Basically, I'm writing some code to remove duplicate vertices from a vertex buffer. Before adding a vertex to the buffer, I first check to see if it exists in the hash_map or not:

There is a flaw in this that you may not have taken into consideration, which is floating point comparison. The compare operation is probably done using something like
if(!(a < b) && !(b <a)) found
but floating point comparison is not exact, in addition it is considered bad practice for a type to have a comparison/less than operation just for the sake of a container. Is there a meaningful less than operation for a vector?

Share this post


Link to post
Share on other sites
Sneftel    1788
Quote:
Original post by dmail
in addition it is considered bad practice for a type to have a comparison/less than operation just for the sake of a container.
Maybe. If he cares about that, he can define his own Traits type, but for purposes of discussion, operator< is probably clearest.
Quote:
Is there a meaningful less than operation for a vector?
If you're using it simply to define a partial order over values, yes. It looks like:

if(a.x < b.x) return true;
if(a.x > b.x) return false;
if(a.y < b.y) return true;
if(a.y > b.y) return false;
if(a.z < b.z) return true;
if(a.z > b.z) return false;
return false;

Share this post


Link to post
Share on other sites
dmail    116
Quote:
Original post by Sneftel
Quote:
Original post by dmail
in addition it is considered bad practice for a type to have a comparison/less than operation just for the sake of a container.
Maybe. If he cares about that, he can define his own Traits type, but for purposes of discussion, operator< is probably clearest.
Quote:
Is there a meaningful less than operation for a vector?
If you're using it simply to define a partial order over values, yes. It looks like:

if(a.x < b.x) return true;
if(a.x > b.x) return false;
if(a.y < b.y) return true;
if(a.y > b.y) return false;
if(a.z < b.z) return true;
if(a.z > b.z) return false;
return false;


Sneftel, yes I know how to create a sorting based on the individual components of the vector and if I were to write one it would not look like what you suggest for the point which you have skipped; floating point comparison i.e. the precision of floating points.

Share this post


Link to post
Share on other sites
alvaro    21246
There is nothing wrong with comparing floating-point numbers, except perhaps if you run into NaNs.

Maybe the OP can try to prune the program down to a 10 or 20 line version that reproduce the problem, so we can be sure we have seen all the code we need...

Share this post


Link to post
Share on other sites
jyk    2094
Quote:
Original post by dmail
Quote:
Original post by Sneftel

if(a.x < b.x) return true;
if(a.x > b.x) return false;
if(a.y < b.y) return true;
if(a.y > b.y) return false;
if(a.z < b.z) return true;
if(a.z > b.z) return false;
return false;


Sneftel, yes I know how to create a sorting based on the individual components of the vector and if I were to write one it would not look like what you suggest for the point which you have skipped; floating point comparison i.e. the precision of floating points.
The code that Sneftel posted is fine. What is required in this case is a strict weak ordering, which is exactly what the code implements.

Share this post


Link to post
Share on other sites
implicit    504
Quote:
Original post by jyk
Quote:
Original post by dmail
Sneftel, yes I know how to create a sorting based on the individual components of the vector and if I were to write one it would not look like what you suggest for the point which you have skipped; floating point comparison i.e. the precision of floating points.
The code that Sneftel posted is fine. What is required in this case is a strict weak ordering, which is exactly what the code implements.
Unfortunately you can't rely on anything like that when dealing with floating point numbers.
The classic counterexample is the x87 FPU which internally performs calculations on 80-bit registers, and typically only stores truncated 32-bit floats or 64-bit doubles in memory. This means that a < b and b < a can both be true, if the optimizer decides to spill the a or b register to memory between the comparisons.
And then there's NaNs of course, which always compare false. Not that you're likely to encounter any in this example but you can design some pretty nifty exploits by forcing one into, say, an algorithm which uses sentinels.

Share this post


Link to post
Share on other sites
dmail    116
Quote:
Original post by jyk
Quote:
Original post by dmail
Quote:
Original post by Sneftel

if(a.x < b.x) return true;
if(a.x > b.x) return false;
if(a.y < b.y) return true;
if(a.y > b.y) return false;
if(a.z < b.z) return true;
if(a.z > b.z) return false;
return false;


Sneftel, yes I know how to create a sorting based on the individual components of the vector and if I were to write one it would not look like what you suggest for the point which you have skipped; floating point comparison i.e. the precision of floating points.
The code that Sneftel posted is fine. What is required in this case is a strict weak ordering, which is exactly what the code implements.


Am I missing something here?
The hash map when finding/looking for the hash will compare the values to see if they are equal when there is collision in the hash. For two values if they are the same requires them to be equal, this maybe done using something like the logic I posted earlier using just the less than operator.

Quote:
msdn
For a hash_map m, if two elements e1(k1, d1) and e2(k2, d2) are objects of type value_type, where k1 and k2 are their keys of type key_type and d1 and d2 are their data of type mapped_type, then m.value_comp( )(e1, e2) is equivalent to m.key_comp( ) (k1, k2). A stored object defines the member function

bool operator(value_type& _Left, value_type& _Right);

which returns true if the key value of _Left precedes and is not equal to the key value of _Right in the sort order.

Share this post


Link to post
Share on other sites
jyk    2094
Quote:
Original post by implicit
Quote:
Original post by jyk
Quote:
Original post by dmail
Sneftel, yes I know how to create a sorting based on the individual components of the vector and if I were to write one it would not look like what you suggest for the point which you have skipped; floating point comparison i.e. the precision of floating points.
The code that Sneftel posted is fine. What is required in this case is a strict weak ordering, which is exactly what the code implements.
Unfortunately you can't rely on anything like that when dealing with floating point numbers.
The classic counterexample is the x87 FPU which internally performs calculations on 80-bit registers, and typically only stores truncated 32-bit floats or 64-bit doubles in memory. This means that a < b and b < a can both be true, if the optimizer decides to spill the a or b register to memory between the comparisons.
And then there's NaNs of course, which always compare false. Not that you're likely to encounter any in this example but you can design some pretty nifty exploits by forcing one into, say, an algorithm which uses sentinels.
That's an interesting point about the less-than comparison. I'd assumed dmail was talking about using tolerances in the comparison (which isn't necessary or desirable in this case), but I may have been wrong about that.

Share this post


Link to post
Share on other sites
bubu LV    1436
Quote:
Original post by dmail
Am I missing something here?


Quoting your quote:
Quote:
returns true if the key value of _Left precedes and is not equal to the key value of _Right in the sort order.

Share this post


Link to post
Share on other sites
Sneftel    1788
Quote:
Original post by dmail
Sneftel, yes I know how to create a sorting based on the individual components of the vector and if I were to write one it would not look like what you suggest for the point which you have skipped; floating point comparison i.e. the precision of floating points.
Read the original post. He's looking to remove duplicated vertices, not welding vertices based on proximity. Floating point precision is not a relevant topic.
Quote:
Original post by implicit
The classic counterexample is the x87 FPU which internally performs calculations on 80-bit registers, and typically only stores truncated 32-bit floats or 64-bit doubles in memory. This means that a < b and b < a can both be true, if the optimizer decides to spill the a or b register to memory between the comparisons.
IEEE-754 defines rounding in a way that doesn't allow this to happen... two values can go from unequal to equal when one or both of them is truncated, but they can't switch from a<b to a>b. In any case, this isn't an issue that arises in the context of containers, since the data is truncated for writing to memory.

Share this post


Link to post
Share on other sites
implicit    504
Quote:
Original post by Sneftel
Quote:
Original post by implicit
The classic counterexample is the x87 FPU which internally performs calculations on 80-bit registers, and typically only stores truncated 32-bit floats or 64-bit doubles in memory. This means that a < b and b < a can both be true, if the optimizer decides to spill the a or b register to memory between the comparisons.
IEEE-754 defines rounding in a way that doesn't allow this to happen... two values can go from unequal to equal when one or both of them is truncated, but they can't switch from a<b to a>b. In any case, this isn't an issue that arises in the context of containers, since the data is truncated for writing to memory.
Are you sure? I'm not saying you're wrong, it just seems like an effect that would be hard to achieve when only one of the variables is truncated.
Say you've got two 80-bit floats differing by only a small amount (so they'll both be truncated down to the same 32-bit value), then spill the higher of the pair to memory, reload it into an 80-bit register at a later stage, and finally perform a comparison against the original version of the other value.
At any rate only the data actually inside of the container will have to be stored to memory, the search keys might well survive intact inside of registers.

Share this post


Link to post
Share on other sites
Sneftel    1788
Quote:
Original post by implicit
Say you've got two 80-bit floats differing by only a small amount (so they'll both be truncated down to the same 32-bit value), then spill the higher of the pair to memory, reload it into an 80-bit register at a later stage, and finally perform a comparison against the original version of the other value.

Oh, I see what you are saying. I was thinking of both values ending up as the truncated length, when one or both of them had started longer.
Quote:
At any rate only the data actually inside of the container will have to be stored to memory, the search keys might well survive intact inside of registers.
The data in question is already only 32 bits. It came out of a float variable. There is no floating point arithmetic going on here.

[Edited by - Sneftel on April 30, 2009 3:03:43 PM]

Share this post


Link to post
Share on other sites
chairthrower    440

Floating point values are traditionally not regarded as good candidates for hash tables. What is the source of your vertex data ? If it is the result of a computation then you need to be very careful that the compuatation always gives a deterministic result or else use another data structure that allows you to control for round-off and imprecision in the least significant bits. For example an intersection test on two segments to produce a new vertex will yield a slightly different binary result (and therefore produce different hash values) depending on the ordering of calculation. The order of calculation here is determined by the ordering of each vertex in each segment and the order of segment argument pushing to the segment test function.

By contrast, if the vertex source is coming straight from deserializing a file then you should be ok. I recently wrote a polygon to polyline routine to eliminate duplicate adjacent segments in vectorial data using unordered_map which works fine - but I limit it to to unprocessed data with exactly represented values.

As a side note, I don't believe that a comparator less< T> test is a good choice for a hash data structure to resolve collisions. I expect that the designers of the hash_map were thinking about leaving open the possibility of using some type of balanced tree for improved bucket searching, but really the focus should really be on reducing collisions and then using simple traversal and match. The tr1 unordered hash maps use equality testing predicates (ie equal_to< T> ) rather than comparators which imho is a better choice.

Share this post


Link to post
Share on other sites
bpoint    464
Quote:
Original post by bubu LV
Sorry, my mistake. It needs < operator, not == operator.

Aaah, that was it -- I didn't have an operator < for MyVec. I added it and hash_map::find() always returns a proper result now.

Sneftel, I hope you don't mind that I swiped your code. ;)

Share this post


Link to post
Share on other sites
ddlox    168
Quote: "This is my first time working with stdext::hash_map, so please be gentle. :)"

Just out of curiosity, why did you decide to use a hash map?

(I usually use a std::map to do the same kind of search)
I know what a hash map is (buckets and all that...), but I'm just being curious, maybe there's something I could learn from your decision-making on this choice.

Cheers.

Share this post


Link to post
Share on other sites
alvaro    21246
Quote:
Original post by ddlox
Quote: "This is my first time working with stdext::hash_map, so please be gentle. :)"

Just out of curiosity, why did you decide to use a hash map?

(I usually use a std::map to do the same kind of search)
I know what a hash map is (buckets and all that...), but I'm just being curious, maybe there's something I could learn from your decision-making on this choice.

Cheers.

Looking for an element or inserting it in a map takes O(log(n)) time, where n is the number of elements in the container. This results in O(N*log(N)) time to eliminate duplicates in a list of N elements. A hash_map can do those operations in amortized O(1) time, so the resulting algorithm takes O(N).

You should definitely give it a try.

Share this post


Link to post
Share on other sites
DevFred    840
Quote:
Original post by Sneftel

if(a.x < b.x) return true;
if(a.x > b.x) return false;
if(a.y < b.y) return true;
if(a.y > b.y) return false;
if(a.z < b.z) return true;
if(a.z > b.z) return false;
return false;

Isn't the last if kinda pointless?

Share this post


Link to post
Share on other sites
Sneftel    1788
Quote:
Original post by DevFred
Quote:
Original post by Sneftel

if(a.x < b.x) return true;
if(a.x > b.x) return false;
if(a.y < b.y) return true;
if(a.y > b.y) return false;
if(a.z < b.z) return true;
if(a.z > b.z) return false;
return false;

Isn't the last if kinda pointless?

Yes, but it makes the structure clearer, and it'll be elided by the compiler when optimizing anyway.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this