#1 Members - Reputation: 301
Posted 22 November 2012 - 05:54 AM
I am trying to implement a user-friendly way to label and identify my in-game assets and entities. I have been using an enumeration system where each class of entity or asset will have an enumeration entry, making the program easier to understand. For me, this works. However, I have read about string hashing and how this can be used to give designers and level editors (i.e. myself in the future) the ability to search for and create entities and assets using strings, but still maintains the run-time performance of integer identification.
However, I have not yet been able to find a suitable example of how (and when) hashing could be used in a video-game case to provide this functionality. I assume that the hashing would have to be done at compile time to make the system as fast as integer identification, but then, how would that enable the level editor and other offline tools to use strings?
If someone could provide a concise example of how string hashing works with regard to asset and entity identification I would be greatly appreciative.
Thank you as always.
- Dave Ottley
I wonder as I wander...
#2 Moderators - Reputation: 13471
Posted 22 November 2012 - 06:25 AM
int foo = TurnStringIntoIdentifier("foo");
int bar = TurnStringIntoIdentifier("bar");
int other = TurnStringIntoIdentifier(string("f") + "oo");
bool isFoo = other == foo;
bool isBar = other == bar;It doesn't have to be done at compile time, just as long as you're not constantly calling TurnStringIntoIdentifier. As long as you cache the result in an integer variable, then the code's gonna be faster than using string comparisons everywhere.Aside from hashing, another way to make one of these systems is via string interning, which boils down to something like:
container<string> strings;
int TurnStringIntoIdentifier( string input )
{
if strings does not contain input then
add input to strings
endif
return index of input within strings
}For a basic version using hashing, the TurnStringIntoIdentifier is just a hash function. I often use the Fnv-1a hash, but there's plenty to choose from.
The problem with this basic version is that it's possible for two strings to return the same hash, which means when you compare the integers you can get false-positives. e.g. above, isFoo and isBar could both be true!
One way to deal with this is to still perform regular string comparisons, but only when the hash compare succeeds. If the hash comparison is false, you can skip the string comparison.
Another way is to simply assume that this event will never occur
In my engine, in development builds, whenever I hash a string, I add it to a global map, with the hash as the key and the string as the value. If that key is already present in the map, then I assert that the existing value matches the string that I'm currently hashing. If this assertion fails, it means I've ended up in the unfortunate situation where two strings have generated the same hash. When this occurs, you can change the "salt" constant used in your hashing algorithm until the problem goes away.
In shipping builds, I skip this assertion checking, and assume that no strings will generate the same hash.
Edited by Hodgman, 22 November 2012 - 06:29 AM.
#3 Members - Reputation: 1564
Posted 22 November 2012 - 06:26 AM
Edit: turns out this is called string interning. Thanks Hodgman!
#5 Moderators - Reputation: 13471
Posted 22 November 2012 - 08:29 PM
Interestingly, many 'map' structures are 'hash tables' (the C++ one is a 'red black tree', but anyway), so when you check if a string is present in your container, the container my be accelerating that lookup by hashing the string.
So if you implement interning, you might also be using hashing somewhat ;)
As for speed, the cost of generating the hash of a string is relative to its length, whereas looking up a string in a sorted container is relative (logarithmically) to the number of items already in the container and the length of the string. This of course varies from container to container, e.g. Lookup in a hash table will be comparable to the cost of the hash!
In any case, the interning/hashing is ok to be expensive, as long as it results in cheap integer comparisons afterwards.
One issue with interning is that it generates different IDs for the same string depending on the order for which you requested your IDs. This means it's hard to make use of these IDs in save games, for example.
Also, if your interned string collection is created at runtime, then you don't have the option of generating IDs at build time. E.g. My asset names are hashed when the files are created, and assets that link to other assets refer to them by these hashes only. At runtime, filenames don't exist.
If a script wants to load an asset by name, it can run the same hashing algorithm at runtime, and get the same ID that was generated at build time.
#6 Members - Reputation: 5796
Posted 22 November 2012 - 09:09 PM
You can find the numbers here: http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
#7 Members - Reputation: 350
Posted 25 November 2012 - 08:44 PM
In the case where you have 10,000 objects with an average name length of lets say 9 characters, and the worst case scenario of the object you're after being the last one, you could do either an average of 100,000 char-comparisons (including null-terminators at the end of the string) or you could do 10,000 int-comparisons and maybe 1000 char comparisons. ~11,000 < ~100,000, so I know what I'd choose.
The hashing function I've been considering using is the Fowler–Noll–Vo 1a variant, though you can look at some others to see what might best fit your needs.
uint32_t getHash(const char *str)
{
// Implementation of the Fowler–Noll–Vo-1a hash function
uint32_t hash = 0x811C9DC5; // Actually I use a defined constant here, but this is the seed for a 32-bit hash
for(size_t i = 0; str[i]; ++i)
{
// The FNV-1a variation
hash ^= str[i];
hash *= 0x01000193; // Same for this, this is the 32-bit prime number
}
// Wow that was hard.
return hash;
}
(excuse the lack of indentation, HTML eats "extraneous" whitespace).I treat it basically like a one-function library, I don't try to understand how it works as long as it does what it says it should. There's a lot of math behind the structure and numbers used and I write games, not hashing functions.
#8 Members - Reputation: 836
Posted 26 November 2012 - 01:50 AM
If you use STL, you could try something like this:
std::map< std::string, DWORD > identMap;
This will essentially hash a std::string into a DWORD and vice versa.
Edited by Steve_Segreto, 26 November 2012 - 01:51 AM.
#10 Crossbones+ - Reputation: 1142
Posted 26 November 2012 - 03:47 AM
std::hash will skip characters of the string if the string is longer then 10 characters, at least the VS2010 implementation does this. Other then that it won't be any slower or faster then your own implementation, the actual implementation looks like FNV anyway.Is there any reason not to use std::hash<std::string> ? Is it slower than the other hash functions mentioned above?
template<>
class hash<_STD string>
: public unary_function<_STD string, size_t>
{ // hash functor
public:
typedef _STD string _Kty;
size_t operator()(const _Kty& _Keyval) const
{ // hash _Keyval to size_t value by pseudorandomizing transform
size_t _Val = 2166136261U;
size_t _First = 0;
size_t _Last = _Keyval.size();
size_t _Stride = 1 + _Last / 10;
for(; _First < _Last; _First += _Stride)
_Val = 16777619U * _Val ^ (size_t)_Keyval[_First];
return (_Val);
}
};
#11 Members - Reputation: 5796
Posted 26 November 2012 - 09:11 AM
I have used CRC32 in the past as a string hash, and it works great. I think it was suggested in "The Art of Computer Programming".
Edited by Álvaro, 26 November 2012 - 09:13 AM.
#13 Crossbones+ - Reputation: 1142
Posted 26 November 2012 - 01:03 PM
Yeah I saw they fixed it, it's also a FNV1a hash function in VS2012 so there is no reason to write your own anymore.Definitely not a feature of std::hash in general. This was fixed in MSVC 2012.
std::hash will skip characters of the string if the string is longer then 10 characters, at least the VS2010 implementation does this.
#15 Members - Reputation: 1236
Posted 27 November 2012 - 02:51 AM
If by "enumeration" you mean code likeI am trying to implement a user-friendly way to label and identify my in-game assets and entities. I have been using an enumeration system where each class of entity or asset will have an enumeration entry, making the program easier to understand. For me, this works. However, I have read about string hashing and how this can be used to give designers and level editors (i.e. myself in the future) the ability to search for and create entities and assets using strings, but still maintains the run-time performance of integer identification.
... STANDING_SOLDIER_SPRITE= "soldier1.png", CROUCHING_SOLDIER_SPRITE= "soldier2.png", ...or (with objects)
...
/*spritesheet, frame count, mode*/
STANDING_SOLDIER_SPRITE= AnimatedSprite("soldier1.png",5,AnimatedSprite.PING_PONG),
CROUCHING_SOLDIER_SPRITE= AnimatedSprite("soldier2.png",3,AnimatedSprite.ONCE_FORWARD),
...
the inflexibility in case you want to edit levels and make mods is obvious; you need to move almost all references to identifiers of any kind (symbols in code, strings, arbitrary numbers, file positions, etc.) from code to data files.But in a level editor and in data files, human-friendly strings are an optional attribute of assets. When you create an asset, you can give it a unique ID (usually a short string), used to reference the asset from game levels and other assets, and prompt the user to enter a friendly name and possibly tags and arbitrary attributes (which will only be used in debug output and in the level editor).
Hashing has no place in generating unique asset IDs: in a mod-oriented architecture, with multiple archives partly replacing each other's content, you need to use the "standard" names without being clever (given a set of archives, tools can index asset IDs to tell the user what assets are being intentionally or unintentionally shadowed by another with the same ID, whether a referenced asset ID is missing, and whether the added IDs of added assets are actually unique as they should); if instead everything is compiled and consolidated, you have a finite and closed set of asset IDs in use and if you want integers for efficiency reasons you can simply number them in any arbitrary order.
#16 Members - Reputation: 369
Posted 27 November 2012 - 10:52 AM
Edited by dougbinks, 27 November 2012 - 10:53 AM.






