Help with String Hashing

Started by
14 comments, last by dougbinks 11 years, 4 months ago
Hello,

I am trying to implement a user-friendly way to label and identify my in-game assets and entities. I have been using an enumeration system where each class of entity or asset will have an enumeration entry, making the program easier to understand. For me, this works. However, I have read about string hashing and how this can be used to give designers and level editors (i.e. myself in the future) the ability to search for and create entities and assets using strings, but still maintains the run-time performance of integer identification.

However, I have not yet been able to find a suitable example of how (and when) hashing could be used in a video-game case to provide this functionality. I assume that the hashing would have to be done at compile time to make the system as fast as integer identification, but then, how would that enable the level editor and other offline tools to use strings?

If someone could provide a concise example of how string hashing works with regard to asset and entity identification I would be greatly appreciative.

Thank you as always.

- Dave Ottley

I wonder as I wander...

http://www.davesgameoflife.com

Advertisement
Basically, these kinds of systems boil down to:int foo = TurnStringIntoIdentifier("foo");
int bar = TurnStringIntoIdentifier("bar");
int other = TurnStringIntoIdentifier(string("f") + "oo");
bool isFoo = other == foo;
bool isBar = other == bar;
It doesn't have to be done at compile time, just as long as you're not constantly calling [font=courier new,courier,monospace]TurnStringIntoIdentifier[/font]. As long as you cache the result in an integer variable, then the code's gonna be faster than using string comparisons everywhere.

Aside from hashing, another way to make one of these systems is via string interning, which boils down to something like:container<string> strings;
int TurnStringIntoIdentifier( string input )
{
if strings does not contain input then
add input to strings
endif
return index of input within strings
}


For a basic version using hashing, the [font=courier new,courier,monospace]TurnStringIntoIdentifier[/font] is just a hash function. I often use the Fnv-1a hash, but there's plenty to choose from.
The problem with this basic version is that it's possible for two strings to return the same hash, which means when you compare the integers you can get false-positives. e.g. above, [font=courier new,courier,monospace]isFoo[/font] and [font=courier new,courier,monospace]isBar[/font] could both be true!

One way to deal with this is to still perform regular string comparisons, but only when the hash compare succeeds. If the hash comparison is false, you can skip the string comparison.

Another way is to simply assume that this event will never occur cool.png
In my engine, in development builds, whenever I hash a string, I add it to a global map, with the hash as the key and the string as the value. If that key is already present in the map, then I assert that the existing value matches the string that I'm currently hashing. If this assertion fails, it means I've ended up in the unfortunate situation where two strings have generated the same hash. When this occurs, you can change the "salt" constant used in your hashing algorithm until the problem goes away.
In shipping builds, I skip this assertion checking, and assume that no strings will generate the same hash.
You probably don't want a 'real' hash function for this - just associate each string with an integer. For example you could start at zero increment for each string added, putting them as key and value in an std::map. Things get more complicated if you want to remove strings and reuse integers, but you probably won't need to.

Edit: turns out this is called string interning. Thanks Hodgman!
[size="1"]
Hodgeman and MrBastard,

Is there any advantage to a hash function over interning? It would seem that interning is always faster. Is it just the memory usage for the string container?

Best,

Dave

I wonder as I wander...

http://www.davesgameoflife.com

The memory usage isn't that bad - you're not likely to have megabytes of identifiers in your game.

Interestingly, many 'map' structures are 'hash tables' (the C++ one is a 'red black tree', but anyway), so when you check if a string is present in your container, the container my be accelerating that lookup by hashing the string.
So if you implement interning, you might also be using hashing somewhat ;)

As for speed, the cost of generating the hash of a string is relative to its length, whereas looking up a string in a sorted container is relative (logarithmically) to the number of items already in the container and the length of the string. This of course varies from container to container, e.g. Lookup in a hash table will be comparable to the cost of the hash!
In any case, the interning/hashing is ok to be expensive, as long as it results in cheap integer comparisons afterwards.

One issue with interning is that it generates different IDs for the same string depending on the order for which you requested your IDs. This means it's hard to make use of these IDs in save games, for example.
Also, if your interned string collection is created at runtime, then you don't have the option of generating IDs at build time. E.g. My asset names are hashed when the files are created, and assets that link to other assets refer to them by these hashes only. At runtime, filenames don't exist.
If a script wants to load an asset by name, it can run the same hashing algorithm at runtime, and get the same ID that was generated at build time.
If you want to use the hash directly, you of course have to worry about collisions (different strings with the same hash). A decent hash function will result in hashes that are indistinguishable from random numbers. If you use 32-bit hash keys, the probability of getting at least one collision becomes 1% at around 9300 identifiers, and 50% at around 77000 identifiers. If you use 64-bit hash keys, you won't reach probability of 1% until 610 million identifiers, so you are probably safe there.

You can find the numbers here: http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
For when to use hashing, you should use it when you'd need to otherwise do lots of string comparisons. What you do is when you get a string (say an object name) you hash it and store it with the object (in addition to the name, usually). Then, when you need to look up an object by a string name, you can instead hash what you're searching for, then compare the hashes with a single integer comparison. You'll probably get collisions, so once you find a matching hash then you do a string comparison to make sure you have the right one.

In the case where you have 10,000 objects with an average name length of lets say 9 characters, and the worst case scenario of the object you're after being the last one, you could do either an average of 100,000 char-comparisons (including null-terminators at the end of the string) or you could do 10,000 int-comparisons and maybe 1000 char comparisons. ~11,000 < ~100,000, so I know what I'd choose.

The hashing function I've been considering using is the Fowler–Noll–Vo 1a variant, though you can look at some others to see what might best fit your needs.


uint32_t getHash(const char *str)
{
// Implementation of the Fowler–Noll–Vo-1a hash function
uint32_t hash = 0x811C9DC5; // Actually I use a defined constant here, but this is the seed for a 32-bit hash
for(size_t i = 0; str; ++i)
{
// The FNV-1a variation
hash ^= str;
hash *= 0x01000193; // Same for this, this is the 32-bit prime number
}
// Wow that was hard.
return hash;
}

(excuse the lack of indentation, HTML eats "extraneous" whitespace).

I treat it basically like a one-function library, I don't try to understand how it works as long as it does what it says it should. There's a lot of math behind the structure and numbers used and I write games, not hashing functions.
I've used the Dan Bernstein hash for stuff like this in the past, but as pointed out, the number of bits you use will dictate how frequent a collision occurs. If you get too many collisions you lose the O(1) access time and revert to O(n) plus the cost of doing the hash function.

If you use STL, you could try something like this:

std::map< std::string, DWORD > identMap;

This will essentially hash a std::string into a DWORD and vice versa.
Is there any reason not to use std::hash<std::string> ? Is it slower than the other hash functions mentioned above?

Is there any reason not to use std::hash<std::string> ? Is it slower than the other hash functions mentioned above?

std::hash will skip characters of the string if the string is longer then 10 characters, at least the VS2010 implementation does this. Other then that it won't be any slower or faster then your own implementation, the actual implementation looks like FNV anyway.


template<>
class hash<_STD string>
: public unary_function<_STD string, size_t>
{ // hash functor
public:
typedef _STD string _Kty;
size_t operator()(const _Kty& _Keyval) const
{ // hash _Keyval to size_t value by pseudorandomizing transform
size_t _Val = 2166136261U;
size_t _First = 0;
size_t _Last = _Keyval.size();
size_t _Stride = 1 + _Last / 10;
for(; _First < _Last; _First += _Stride)
_Val = 16777619U * _Val ^ (size_t)_Keyval[_First];
return (_Val);
}
};

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

This topic is closed to new replies.

Advertisement