Jump to content
  • Advertisement
Sign in to follow this  
noatom

Dealing with collision in hash tables

This topic is 1151 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So, Im a bit confused about some of the methods used to handle collisions, when a has function outputs the same result for 2 keys, and you want to store that result.

 

Say for example:

 

key: rat -> hash(rat) = 13

key: tar -> hash(tar) = 13

 

(those are pure examples, I came up with the values on my own)

 

Some methods used to handle this situation, when trying to store the results in a hash table:

 

If multiple items are assigned the same hash code, we “chain” them together. Each position in the hash table serves as a bucket that is able to store multiple data items.

 

Both rat and tar 's values would be stored in bucket 13.

So, if I had some values I wanted to store for each of them, unrelated like: rat's value = 9 and tar's value = 51

Bucket 13 would contain: 9, 51

 

But my question is, if I wanted to retrieve back the value, and I hash "tar", I get 13, how do I know WHICH value in that bucket corresponds to my key?! How would I know it's not 9, but 51?

 

When the position assigned by the hash function is occupied, find another open position. • Example: key “wasp” has a hash code of 22, but it's value ends up in position 23 , because position 22 is occupied.

 

Again, how would I be able to ever retrieve back the value for wasp? If I hash "wasp", get 22, and retrieve the value found in element no 22, should I use that value? Or should I travel +5 elements down? 

 

If I hash rat first, I get 13, and the element in the table is not used, I store the value 9 there.

If I hash tar, I get 13, but now the element is used, so I just increment until I find a free position, 14,15,16 if 16 is free, I'll the value 51 there.

 

But how would I ever know how to retrieve back the value for tar? I hash tar, get 13, and retrieve the value at element no 13, but it turns out it's actually rat's value, 9! not tar's 51.

Edited by noatom

Share this post


Link to post
Share on other sites
Advertisement

Remember that a hashtable stores the values AND the keys so the lookup routine can simply compare the key that was stored with the key being sought. If the keys are equal then the correct value has been found, otherwise it's not a match. 

Share this post


Link to post
Share on other sites

In the case of hash collections, you cannot just pull the value N from the hash table, and know you have what you want. Instead, each entry in the table becomes a "list" (in various forms), and the key associated with the value is also stored in the table.

When you find hash value N, you get the list, and compare the key you provide with the key that is stored, for each element in the list, until you find a match, or you have tried all elements of the list.

 

The "list" here can take various forms. Some hash tables do a single linked list, other tables use "the next free entry" in the table. Thus, if I want to store a new value at position N, I check if that position is free. If it is, store the key+value, else try entry N+1, and so on.

Retrieving is the other way around. If you want an entry, you start at the index position N. If the key matches, you're done. If it doesn't match, you try entry N+1, and so on, until you hit an empty entry. At that point you know the entry is not in the table. (If it was in the table, it would have been added before the free entry.)

 

A lot of research has been done on this problem. See also https://en.wikipedia.org/wiki/Hash_table

 

Also, there exists "perfect hashing". If you have a known set of key values that you want to store, you can compute a hashing function such that collisions never happen. This is eg used in recognizing keyword in compilers.

Edited by Alberth

Share this post


Link to post
Share on other sites

You could store a tree at each hash location that has multiple entries. A sorted tree like an r-b tree would mean less iterations to find any item by key name...

Share this post


Link to post
Share on other sites

You could store a tree at each hash location that has multiple entries. A sorted tree like an r-b tree would mean less iterations to find any item by key name...

I would expect that someone has tried that already. It's a valid solution, but any extension to deal with collisions moves you further away from the O(1) access speed that you like to have. On the other hand, if you have a collision, you need to have something to deal with it.

 

I experienced that for small enough collections of values, the setup and memory costs of more complex data structures may cause more trouble than the gains in reducing the number of iterations. Sequentially inspecting the next hash table entry works well with CPU caches, which is going to be hard to beat with a binary tree of say 3 elements. If the number of elements is (much) bigger, there is probably something wrong with your hashing, which you should fix instead.

Edited by Alberth

Share this post


Link to post
Share on other sites

 

You could store a tree at each hash location that has multiple entries. A sorted tree like an r-b tree would mean less iterations to find any item by key name...

I would expect that someone has tried that already. It's a valid solution, but any extension to deal with collisions moves you further away from the O(1) access speed that you like to have. On the other hand, if you have a collision, you need to have something to deal with it.

 

I experienced that for small enough collections of values, the setup and memory costs of more complex data structures may cause more trouble than the gains in reducing the number of iterations. Sequentially inspecting the next hash table entry works well with CPU caches, which is going to be hard to beat with a binary tree of say 3 elements. If the number of elements is (much) bigger, there is probably something wrong with your hashing, which you should fix instead.

 

 

 

Not exactly. You have a few options with hashing.

To avoid collisions, you can dynamically extend the data structure, and let the program modify it's own hashing. But you still run into the same problem if you do not have enough buckets to compensate. You can maintain the O(1) access speed... but you're wasting a lot of space in the process... and time.

So... that leaves you with the Closed and Open hashing. Both of which has a O(N) access on collision depending on implementation. 

Share this post


Link to post
Share on other sites
There's the other option: just don't have collisions.

In many cases this is quite achievable, actually. In games you'll see a few of these pretty often:

(1) The hash table can be baked, such as the lookup table in a packed asset file, so you can generate an actual perfect hash. You can store a hash seed with the table to accomplish this. When baking the table, first try to hash all the items and then if anything conflicts just increment the seed and repeat. At load time, when looking up an asset in the table just use the stored seed when computing the hash.

(2) For in-game values controlled purely by the game developers, just error on collision and make the developers pick a different name. The odds of a collision with a good high-quality hash function are very very low. In the highly unlikely case that a developer manages to find a name that causes a collision it's trivial to pick a slight variation of the name that doesn't collide with anything.

(3) For programmitically-chosen names, force uniqueness. For instance, in an engine that has the idea of object blueprints and instances of those blueprints (e.g., a blueprint named "orc" and an instance named "orc#1"), just keep regenerating new names until there's no collision. If the generated instance name "orc#1" collides with something, just automatically rename it to "orc#2" and so on until it doesn't collide with anything.

If you have user-supplied values, you can't necessarily do any of those, though.

Simple linear probing with chaining works very well for speed. If unknown agents can supply names (e.g., over the Internet), you _really_ want to use the rb-tree for your buckets approach! Otherwise, a remote attacker could potentially pick values that hash into a chain and turn your supposedly O(1) table into an O(N) linked list and hence execute a DOS attack. Using the rb-tree means that your data structure is still O(1) in the best/average case and only O(logN) in the worst case. This was a pretty major issue for a lot of languages used for Web programming in the last decade, and as games become more and more online, it's important for games too.

Share this post


Link to post
Share on other sites

There's also the option of cuckoo hashing, with constant time worst case lookup performance at the expense of worst-case insertion performance, memory use and possibly speed: use two hash functions to compute two places for a key (usually addresses in two separate half-size arrays), look for it in both places when searching (the key cannot be anywhere else), place it in either designated slot when inserting (if empty), displace and insert in its other designed place the stored value if both places are occupied by other keys. This chain of reinsertions can also fail if a displaced value wants the place of the newly inserted key, requiring a rebuild with different hash functions or a different array size. Many obvious variants, like using more hash functions or small fixed-size buckets, can reduce the probability of expensive reorganizations and allow higher load.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!