Hash Table Element Lookup

Started by
6 comments, last by SeanMiddleditch 8 years, 10 months ago

Hey all,

I'm using a hash table to load/remove/retrieve my resources and as of now what I am doing is I have an array of linked lists. I have a hash function that generates an index to place the resource in and it gets added to the linked list in that index. To retreive a resource I call the hash function again to find the index of the element and then find the resource thats key value corresponds with what was used in the hash function. The key value happens to be a string. Now my question is should I also hash the key that's paired to the resource so it would be comparing integers instead of strings? Obviously integer comparison is much faster, but that introduces two more problems. Is the integer comparision still faster than the string comparision considering that it'll have to be hashed again? And the second problem, if two strings generate the same hash, two resources in the linked list will share a key. Obviously that second problem occuring can be minimized by having a hash function that results in mininum collisions, but in the rare case that they do collide, that would be an issue.

Sorry for any typos, btw. It's late :P

Advertisement

If you did that, you would also have to compare the actual values at the end (due to possibility of collisions).

So either you:

A. Compare each key in list until correct one is found

B. Do A, except add a hash as part of the key to more quickly reject nonmatching keys

Given that:

1. You probably wont have very many keys in the list

2. The keys will probaby be rejected pretty quick anyways (for random strings in the map, most can be rejected after a single char-char or string size comparison!)

3. You have to do the full comparison anyways at least once

Storing the hash and using that doesnt seem to add much when it comes to optimizing the search for the correct element in the list. Perhaps if all your strings tend to begin with the same initial characters and theres many of those characters (so you can use the hash to 'early out' from the costly equivalence check).

But, there is another good reason to store the hash with the key-value pair. If your key is expensive to hash (depends on key and hash func), resizing the hash table will be very expensive since all keys have to be hashed to place them in the newly resized table. If you store the hash, no rehashing needs to be done. This wont be very useful if the table doesnt need to be resized, or if the keys are not expensive to hash as mentioned.

o3o

Thank you for the reply!

My hash function is pretty simple. All it does is multiply each ascii value of each char by a 16 bit prime number and add them up then take the modulo of that value by the table size. Not very complex, but it works well for the application.

I wasn't really looking to completely optimize it at this point, just see the differences between the types of comparisons. You answered that. Thank you.

I'm using a hash table to load/remove/retrieve my resources and as of now what I am doing is I have an array of linked lists. I have a hash function that generates an index to place the resource in and it gets added to the linked list in that index. To retreive a resource I call the hash function again to find the index of the element and then find the resource thats key value corresponds with what was used in the hash function


You are almost literally describing the exact algorithm used by `std::unordered_map`. Just use that and stop worrying about it.

Your conceptual problem is just a fundamental one with hash tables. There are sometimes collisions. Bad hash functions (...like the one you're using) can exasperate those problems and make them far worse.

The only way to avoid collisions in a hash table is to use a perfect hash function. There is no universal perfect hash function. You generally have to find one for a given data set, which is not an operation you can generally perform during runtime of a game. Some games when baking their resources for the game will generate a perfect hash function for that fixed set of resources and then uses that hash function to generate a very efficient index in the game's pack file(s), while other games will just roll the dice and use a high-fidelity hash and hope that there's no collisions (possibly spitting out a content error if there are). Yet other games generate unique fixed-size keys for resources (i.e. UUIDs) and then use those in all their tables and references.

Also, when you're walking a linked list of items, thinking that using an integer key is going to make a huge different is a bit pointless. Yes, comparing integers is more efficient, but if you're already doing something inefficient... comparing small strings in a std::string will be much faster in many cases than following a single linked list pointer (std::string typically employs the small-buffer optimization, so a std::string can contain small strings with no external pointers, so comparing small string keys requires no extra memory accesses while following a linked list pointer is quite likely going to cause a cache miss + memory fetch + pipeline stall.)

Now, if you replaced your mediocre hash table with a more efficient open-addressing hash table (with a sensible probing strategy) then you will possibly find that comparing keys is one of your bottlenecks. Switching to comparing the hashes while following chains could be a big help. But more likely you'll find that either your probing strategy or your memory layout is a bigger cost (e.g., do you store keys and values interleaved in memory? that'll make probing result in more cache misses...).

Ultimately, though, just profile your code and find out what's slow for sure rather than guessing randomly. If you want decent performance without lots of analysis work then use the STL containers like std::unordered_map instead of writing your own and then only replace them if/when they become a bottleneck.

Sean Middleditch – Game Systems Engineer – Join my team!

Now, if you replaced your mediocre hash table with a more efficient open-addressing hash table (with a sensible probing strategy) then you will possibly find that comparing keys is one of your bottlenecks.

What would be a sensible probing strategy?

What would be a sensible probing strategy?


Pretty much any of the ones outlined on that Wikipedia page. Just don't try to invent your own (much like inventing your own hash, it's just a bad idea). smile.png

I favor linear probing, based on benchmarks I've seen (it's simple and it works amazingly well on real-life game datasets).

If your hash table isn't overloaded and you have a good hash function, though, you won't be probing all that often anyway, so perhaps I overstated the importance of the probing strategy.

Sean Middleditch – Game Systems Engineer – Join my team!

Did you compare linear probing with coalesced hashing? Just out of curiosity since I am currently reading about hash tables.

Since you mention the Wikipedia site. They offer a constant time deletion method compared to the mark DELETE approach I am finding everywhere else. Did you ever compare those? I am trying to get a reference for that method and bought the cited book, but I couldn't find the described method. If you accidentally know a good reference for deletion using open addressing that would be awesome! smile.png

Did you compare linear probing with coalesced hashing? Just out of curiosity since I am currently reading about hash tables.


I've never tried it. From my understand, it's just bloat (extra link pointers, etc.) that's just attempting to avoid a problem you shouldn't have (clustering).

Since you mention the Wikipedia site. They offer a constant time deletion method compared to the mark DELETE approach I am finding everywhere else. Did you ever compare those?


I've never measured a performance problem with deletions going the simple route, so no.

Again, you shouldn't have bad clustering. A lot of those algorithms are meant for dealing with millions of entries in smaller amounts of memory rather than the situations typical to games (thousands of entries and you have some room for keeping less loaded for your important hotpath tables).


My rough suggestion is to just copy what Lua does for their tables: a simple open-addressed linearly-probed hash table, simple forward-link (index-based) chains, immediate deletes. Build up from there. Or if you aren't using these in hotpaths in your engine, just use std::unordered_map until you have reason not to. You get more bang for your buck by paying attention to CPU cache access patterns and having a fast but high-quality hash function (fnv1a being the low bar) than you do anything else. At least in my experience. Different use cases than I'm used to may be different, of course, so _always_ profile and test to be sure you aren't getting bad advice. ;)

Sean Middleditch – Game Systems Engineer – Join my team!

This topic is closed to new replies.

Advertisement