Jump to content

  • Log In with Google      Sign In   
  • Create Account


Help with String Hashing


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
15 replies to this topic

#1 KingofNoobs   Members   -  Reputation: 301

Like
0Likes
Like

Posted 22 November 2012 - 05:54 AM

Hello,

I am trying to implement a user-friendly way to label and identify my in-game assets and entities. I have been using an enumeration system where each class of entity or asset will have an enumeration entry, making the program easier to understand. For me, this works. However, I have read about string hashing and how this can be used to give designers and level editors (i.e. myself in the future) the ability to search for and create entities and assets using strings, but still maintains the run-time performance of integer identification.

However, I have not yet been able to find a suitable example of how (and when) hashing could be used in a video-game case to provide this functionality. I assume that the hashing would have to be done at compile time to make the system as fast as integer identification, but then, how would that enable the level editor and other offline tools to use strings?

If someone could provide a concise example of how string hashing works with regard to asset and entity identification I would be greatly appreciative.

Thank you as always.

- Dave Ottley

I wonder as I wander...

http://www.davesgameoflife.com


Sponsor:

#2 Hodgman   Moderators   -  Reputation: 28603

Like
3Likes
Like

Posted 22 November 2012 - 06:25 AM

Basically, these kinds of systems boil down to:
int foo = TurnStringIntoIdentifier("foo");
int bar = TurnStringIntoIdentifier("bar");
int other = TurnStringIntoIdentifier(string("f") + "oo");
bool isFoo = other == foo;
bool isBar = other == bar;
It doesn't have to be done at compile time, just as long as you're not constantly calling TurnStringIntoIdentifier. As long as you cache the result in an integer variable, then the code's gonna be faster than using string comparisons everywhere.

Aside from hashing, another way to make one of these systems is via string interning, which boils down to something like:
container<string> strings;
int TurnStringIntoIdentifier( string input )
{
  if strings does not contain input then
	add input to strings
  endif
  return index of input within strings
}

For a basic version using hashing, the TurnStringIntoIdentifier is just a hash function. I often use the Fnv-1a hash, but there's plenty to choose from.
The problem with this basic version is that it's possible for two strings to return the same hash, which means when you compare the integers you can get false-positives. e.g. above, isFoo and isBar could both be true!

One way to deal with this is to still perform regular string comparisons, but only when the hash compare succeeds. If the hash comparison is false, you can skip the string comparison.

Another way is to simply assume that this event will never occur Posted Image
In my engine, in development builds, whenever I hash a string, I add it to a global map, with the hash as the key and the string as the value. If that key is already present in the map, then I assert that the existing value matches the string that I'm currently hashing. If this assertion fails, it means I've ended up in the unfortunate situation where two strings have generated the same hash. When this occurs, you can change the "salt" constant used in your hashing algorithm until the problem goes away.
In shipping builds, I skip this assertion checking, and assume that no strings will generate the same hash.

Edited by Hodgman, 22 November 2012 - 06:29 AM.


#3 mrbastard   Members   -  Reputation: 1573

Like
1Likes
Like

Posted 22 November 2012 - 06:26 AM

You probably don't want a 'real' hash function for this - just associate each string with an integer. For example you could start at zero increment for each string added, putting them as key and value in an std::map. Things get more complicated if you want to remove strings and reuse integers, but you probably won't need to.

Edit: turns out this is called string interning. Thanks Hodgman!


#4 KingofNoobs   Members   -  Reputation: 301

Like
0Likes
Like

Posted 22 November 2012 - 04:36 PM

Hodgeman and MrBastard,

Is there any advantage to a hash function over interning? It would seem that interning is always faster. Is it just the memory usage for the string container?

Best,

Dave

I wonder as I wander...

http://www.davesgameoflife.com


#5 Hodgman   Moderators   -  Reputation: 28603

Like
3Likes
Like

Posted 22 November 2012 - 08:29 PM

The memory usage isn't that bad - you're not likely to have megabytes of identifiers in your game.

Interestingly, many 'map' structures are 'hash tables' (the C++ one is a 'red black tree', but anyway), so when you check if a string is present in your container, the container my be accelerating that lookup by hashing the string.
So if you implement interning, you might also be using hashing somewhat ;)

As for speed, the cost of generating the hash of a string is relative to its length, whereas looking up a string in a sorted container is relative (logarithmically) to the number of items already in the container and the length of the string. This of course varies from container to container, e.g. Lookup in a hash table will be comparable to the cost of the hash!
In any case, the interning/hashing is ok to be expensive, as long as it results in cheap integer comparisons afterwards.

One issue with interning is that it generates different IDs for the same string depending on the order for which you requested your IDs. This means it's hard to make use of these IDs in save games, for example.
Also, if your interned string collection is created at runtime, then you don't have the option of generating IDs at build time. E.g. My asset names are hashed when the files are created, and assets that link to other assets refer to them by these hashes only. At runtime, filenames don't exist.
If a script wants to load an asset by name, it can run the same hashing algorithm at runtime, and get the same ID that was generated at build time.

#6 Álvaro   Crossbones+   -  Reputation: 12471

Like
4Likes
Like

Posted 22 November 2012 - 09:09 PM

If you want to use the hash directly, you of course have to worry about collisions (different strings with the same hash). A decent hash function will result in hashes that are indistinguishable from random numbers. If you use 32-bit hash keys, the probability of getting at least one collision becomes 1% at around 9300 identifiers, and 50% at around 77000 identifiers. If you use 64-bit hash keys, you won't reach probability of 1% until 610 million identifiers, so you are probably safe there.

You can find the numbers here: http://en.wikipedia.org/wiki/Birthday_problem#Probability_table

#7 Firestryke31   Members   -  Reputation: 350

Like
0Likes
Like

Posted 25 November 2012 - 08:44 PM

For when to use hashing, you should use it when you'd need to otherwise do lots of string comparisons. What you do is when you get a string (say an object name) you hash it and store it with the object (in addition to the name, usually). Then, when you need to look up an object by a string name, you can instead hash what you're searching for, then compare the hashes with a single integer comparison. You'll probably get collisions, so once you find a matching hash then you do a string comparison to make sure you have the right one.

In the case where you have 10,000 objects with an average name length of lets say 9 characters, and the worst case scenario of the object you're after being the last one, you could do either an average of 100,000 char-comparisons (including null-terminators at the end of the string) or you could do 10,000 int-comparisons and maybe 1000 char comparisons. ~11,000 < ~100,000, so I know what I'd choose.

The hashing function I've been considering using is the Fowler–Noll–Vo 1a variant, though you can look at some others to see what might best fit your needs.

uint32_t getHash(const char *str)
{
// Implementation of the Fowler–Noll–Vo-1a hash function
uint32_t hash = 0x811C9DC5; // Actually I use a defined constant here, but this is the seed for a 32-bit hash
for(size_t i = 0; str[i]; ++i)
{
  // The FNV-1a variation
  hash ^= str[i];
  hash *= 0x01000193; // Same for this, this is the 32-bit prime number
}
// Wow that was hard.
return hash;
}
(excuse the lack of indentation, HTML eats "extraneous" whitespace).

I treat it basically like a one-function library, I don't try to understand how it works as long as it does what it says it should. There's a lot of math behind the structure and numbers used and I write games, not hashing functions.

#8 Steve_Segreto   Crossbones+   -  Reputation: 1492

Like
0Likes
Like

Posted 26 November 2012 - 01:50 AM

I've used the Dan Bernstein hash for stuff like this in the past, but as pointed out, the number of bits you use will dictate how frequent a collision occurs. If you get too many collisions you lose the O(1) access time and revert to O(n) plus the cost of doing the hash function.

If you use STL, you could try something like this:

std::map< std::string, DWORD > identMap;

This will essentially hash a std::string into a DWORD and vice versa.

Edited by Steve_Segreto, 26 November 2012 - 01:51 AM.


#9 Nanook   Members   -  Reputation: 476

Like
0Likes
Like

Posted 26 November 2012 - 02:08 AM

Is there any reason not to use std::hash<std::string> ? Is it slower than the other hash functions mentioned above?

Edited by Nanook, 26 November 2012 - 02:11 AM.


#10 NightCreature83   Crossbones+   -  Reputation: 2702

Like
0Likes
Like

Posted 26 November 2012 - 03:47 AM

Is there any reason not to use std::hash<std::string> ? Is it slower than the other hash functions mentioned above?

std::hash will skip characters of the string if the string is longer then 10 characters, at least the VS2010 implementation does this. Other then that it won't be any slower or faster then your own implementation, the actual implementation looks like FNV anyway.

template<>
class hash<_STD string>
  : public unary_function<_STD string, size_t>
{ // hash functor
public:
typedef _STD string _Kty;
size_t operator()(const _Kty& _Keyval) const
  { // hash _Keyval to size_t value by pseudorandomizing transform
  size_t _Val = 2166136261U;
  size_t _First = 0;
  size_t _Last = _Keyval.size();
  size_t _Stride = 1 + _Last / 10;
  for(; _First < _Last; _First += _Stride)
   _Val = 16777619U * _Val ^ (size_t)_Keyval[_First];
  return (_Val);
  }
};

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, Mad Max

#11 Álvaro   Crossbones+   -  Reputation: 12471

Like
0Likes
Like

Posted 26 November 2012 - 09:11 AM

That trick of only looking at a maximum of 10 characters seems like a cheat to do better at some benchmark, but it's kind of dangerous for a general string hash. Note that IMG_0001.JPG, IMG_0002.JPG etc. will all collide. You may also think of using a hash to check if a file has changed, but that will fail all the time.

I have used CRC32 in the past as a string hash, and it works great. I think it was suggested in "The Art of Computer Programming".

Edited by Álvaro, 26 November 2012 - 09:13 AM.


#12 SiCrane   Moderators   -  Reputation: 9496

Like
0Likes
Like

Posted 26 November 2012 - 09:24 AM

std::hash will skip characters of the string if the string is longer then 10 characters, at least the VS2010 implementation does this.

Definitely not a feature of std::hash in general. This was fixed in MSVC 2012.

#13 NightCreature83   Crossbones+   -  Reputation: 2702

Like
0Likes
Like

Posted 26 November 2012 - 01:03 PM


std::hash will skip characters of the string if the string is longer then 10 characters, at least the VS2010 implementation does this.

Definitely not a feature of std::hash in general. This was fixed in MSVC 2012.

Yeah I saw they fixed it, it's also a FNV1a hash function in VS2012 so there is no reason to write your own anymore.
Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, Mad Max

#14 Hodgman   Moderators   -  Reputation: 28603

Like
0Likes
Like

Posted 26 November 2012 - 06:10 PM

You might still want to write your own if you're supporting pre C++11 compilers, or if you need to ensure that your hashing algorithm exactly matches some external specification or data-set.

#15 LorenzoGatti   Crossbones+   -  Reputation: 2616

Like
0Likes
Like

Posted 27 November 2012 - 02:51 AM

I am trying to implement a user-friendly way to label and identify my in-game assets and entities. I have been using an enumeration system where each class of entity or asset will have an enumeration entry, making the program easier to understand. For me, this works. However, I have read about string hashing and how this can be used to give designers and level editors (i.e. myself in the future) the ability to search for and create entities and assets using strings, but still maintains the run-time performance of integer identification.

If by "enumeration" you mean code like
...
STANDING_SOLDIER_SPRITE= "soldier1.png",
CROUCHING_SOLDIER_SPRITE= "soldier2.png",
...
or (with objects)
...
/*spritesheet, frame count, mode*/
STANDING_SOLDIER_SPRITE= AnimatedSprite("soldier1.png",5,AnimatedSprite.PING_PONG),
CROUCHING_SOLDIER_SPRITE= AnimatedSprite("soldier2.png",3,AnimatedSprite.ONCE_FORWARD),
...
the inflexibility in case you want to edit levels and make mods is obvious; you need to move almost all references to identifiers of any kind (symbols in code, strings, arbitrary numbers, file positions, etc.) from code to data files.

But in a level editor and in data files, human-friendly strings are an optional attribute of assets. When you create an asset, you can give it a unique ID (usually a short string), used to reference the asset from game levels and other assets, and prompt the user to enter a friendly name and possibly tags and arbitrary attributes (which will only be used in debug output and in the level editor).

Hashing has no place in generating unique asset IDs: in a mod-oriented architecture, with multiple archives partly replacing each other's content, you need to use the "standard" names without being clever (given a set of archives, tools can index asset IDs to tell the user what assets are being intentionally or unintentionally shadowed by another with the same ID, whether a referenced asset ID is missing, and whether the added IDs of added assets are actually unique as they should); if instead everything is compiled and consolidated, you have a finite and closed set of asset IDs in use and if you want integers for efficiency reasons you can simply number them in any arbitrary order.
Produci, consuma, crepa

#16 dougbinks   Members   -  Reputation: 480

Like
0Likes
Like

Posted 27 November 2012 - 10:52 AM

You may want to check out: http://www.altdevblogaday.com/2011/10/27/quasi-compile-time-string-hashing/ for quasi compile time hashing and some further information. Also the murmur hash from Bitsquid's open source foundation may be useful https://bitbucket.org/bitsquid/foundation/src (also see the blog about this here: http://www.altdevblogaday.com/2012/11/01/bitsquid-foundation-library/ ).

Edited by dougbinks, 27 November 2012 - 10:53 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS