Fastest string hash algorithm

Started by
18 comments, last by _Sigma 16 years, 10 months ago
I'm looking for the most lightweight, fastest hash function for hashing strings of approx 10char length. Faster than O(logn) is needed (this is what I'm currently using - Ie: the std::map). I'm using this in a time critical section of my game engine, so speed and small overhead is required. Max I expect to be storing is 20 or 30 strings. I have some ideas on how to do this, I just didn't want to put all the time and effort into this if there is some well know, fast algorithm for this. Cheers
Advertisement
If you're using it in a time-critical area, the best thing to do is to stop using it in a time-critical area. Boil the strings down to identifiers before you ever do time-critical stuff. String comparisons are a poor substitute for an enumeration.

Failing that, if you want maximum possible speed, and in particular if the set of strings is known ahead of time, look into using a perfect hash function.
Quote:If you're using it in a time-critical area, the best thing to do is to stop using it in a time-critical area.

This really isn't an option. This is used in the scripting portion of the engine (using Angel Script) where I need to get at the ID of a function using its function declaration (given in string). So when one calls a function, I need to be able to quickly get at the function ID, as generating it on the fly is very, very slow.

Quote:Boil the strings down to identifiers
What do you mean?
How about not hashing in the first place?

Use a custom string class (or std::string wrapper), that, on creation, or prior to insertion into the map, calculates the hash.

Better yet, provide this wrapper at the point of insertion into the map.

This way you make "hashing" cost the same as an int comparison.

I'm using this aproach for containers with hundreds of items, and the access rates are at around 20 million/second. No other optimization whatsoever. If I wanted, I could probably double or triple that. The lookup performance may be O( something ), but in reality it behaves close to O(1) for up to 100 items.

Keep in mind that keys cannot (or at very least should not) change.

Quote:What do you mean?


class StringKey{public:  StringKey( std::string &s )    : crc( calculateHashOfString(s) )  {}  // lookup the syntax for this, it escapes me right now  bool operator<( const String &other ) {    return crc < other.crc;  }private:  int crc;}


Don't even need to modify your std::map.
Quote:Original post by _Sigma
[...]
Quote:Boil the strings down to identifiers
What do you mean?
He means do the lookup before you get into someplace time critical, and use the ID foreverafter. You should be able to convert all the strings before they're used unless they're dynamically generated, in which case there isn't much (if anything) you can do to improve speed.

One possible speedup would be to put the strings into a single character array and in the same memory block create your own balanced tree. This would significantly improve cache hits, but would also be a lot of extra code for probably little benefit.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Quote:Original post by _Sigma
Quote:Boil the strings down to identifiers
What do you mean?

Well, let's take a closer look at your particular situation.

Quote:This is used in the scripting portion of the engine (using Angel Script) where I need to get at the ID of a function using its function declaration (given in string).

Where did the function declaration come from? Why is something storing the declaration instead of the ID in the first place?
Sorry, I'd hoped that an explination of my implimentation wouldn't be needed.

My engine is designed to be rather modular. That is, as long as you expose the same interface, you can drop out the 2D rendered, and drop in a 3D one with little code modifications. Obviously there will be, but you get the idea.

Like I said previous, I am using Angel Script as a way of controling the game.

As some of you may or may not know, to call a functon(that is, call a function that resides in a script) from the client(game engine), you need to obtain a function ID by passing AS the declaration of the function. For example "void foo(int x)" is function ID 46. This can only be obtained at runtime, and is rather expensive.

The current desgin is for control of the higher level concepts to be handled in the script (trigger event "X" when the player moves to tile (x,y). Or "On left arrow, move the sprite left 10px), while the "low level" concepts (rendering, processing input, loading config files, logging) is to be handled in a transparent fashion by the engine. Execution of a single iteration of the game loop looks like:
GetTimeGetInputCallScriptMainProcessAIProcessSoundRenderEndFPS


If we want to maintain the state of variables for each call to CallScriptMain, we need either global varaibles, or the below solution
//scriptvoid main(){StartUP()while(1==1){//processengine.pause()//returns control to the engine by pausing the script.}}
Execution needs to be returned to the engine after each frame, so Engine.Pause() allows us to stop the script execution, process the rendering etc, then on the next call to CallScriptMain, we know that we have a paused script, so we can continue where we left off, and the script is none the wiser.

With this design, we can then register script function callbacks to handle various events. For example, we can register a KEY_DOWN event on pressing the LEFT_ARROW key.
//scriptvoid left(){//move}void main(){Engine.RegisterKeyDown(ARROW_LEFT,"void left()");while(1==1){//...engine.pause();}}

When we poll the keyboard in the main loop, we can call the left() funciton. The first time we call a function, we need to generate the ID then store it. Currently I'm using a std::map<std::pair<std::string,int>> to store this, but I worry that it is a little slow, expecially if we have callbacks for lots of keys.

So for each iteration of the game loop, we need to be looking up the IDs. That being said, we can compute the "void main()" fucntion ID upon initializationg, but the others need to be looked up, as we don't know any of the functions at compile time - only run time.

Therein lies the trouble. My idea was to use a hash table which, with a nice quick hash function, would provide O(1) access into the table.

One idea I had was to use this to produce a DLL containing the hashfunctions based on the script(s), which I could then use from the engine in a dynamic fashion. Seems a little roundabout though. As well, it limits the simplicity of the script, decreasing the quick turnaround for editing them. that is, we are now 'compiling' portions of the script (for lack of a better word) which is potentially going to take longer than a recompile of the engine (maybe...probably not).

It should be noted that this is not only storing callbacks for keys, but rather, the function IDs for any fucntion we want to call that resides in a script.

Cheers
You've got a lot of extraneous information there, but from what you're saying, it sounds like Engine.RegisterKeyDown(ARROW_LEFT,"void left()"); stores "void left()" somewhere, and each time through the game loop it that is resolved to a function ID. So why not just resolve it to a function ID in the first place, and store that instead of the string?
Quote:Original post by Sneftel
You've got a lot of extraneous information there,
Sorry, wasn't sure how much was needed!

Quote:So why not just resolve it to a function ID in the first place, and store that instead of the string?
Because we have no way of knowing what function goes with a given ID. It has to be a declaration-int pair.
Quote:Original post by _Sigma
Quote:So why not just resolve it to a function ID in the first place, and store that instead of the string?
Because we have no way of knowing what function goes with a given ID.

I'm still a little confused by your situation, then. I'll try to give you some general methodology.

It sounds like you're doing some sort of late binding thing. A bunch of objects A1, A2, ... An need to map to a bunch of other objects B1, B2, ... Bm. In your case, the Bs are the functions, and the As are... well, something. Now, for a given i, you have no idea what the j is such that Ai binds to Bj. However, each Ai will produce a "name", and you have a lookup function which maps those names to the proper Bs. So denoting the function which extracts the "B name" which Ai cares about as Name(), and the lookup from a name to a B as Lookup(), you're doing Lookup(Name(Ai)) for each Ai. Every frame. In other words, each frame you are doing a lot of lookups which always resolve the same way. You're doing work over and over again, which is okay, but it's better if that work is lightweight.

First the easy way: instead of doing Lookup(Name(Ai)) every frame, do Lookup(Name(Ai)) when its value first becomes known. Assuming that Lookup() never changes (that is, functions aren't being defined each frame) that cached value means that you only have to do an array lookup.

But wait! What if Bs are being added and deleted? What if a name is decided on by an A BEFORE the corresponding B even exists? This is harder, but still perfectly doable: whenever a B is created, just fix up all the references to it. This is a slow-ish operation if you do it wrong, a reasonably fast operation if you do it right, and you only have to do it when a new B is created. Assuming that happens much less often than the lookups were, this is a significant gain.

This topic is closed to new replies.

Advertisement