Sign in to follow this  
_Sigma

Fastest string hash algorithm

Recommended Posts

I'm looking for the most lightweight, fastest hash function for hashing strings of approx 10char length. Faster than O(logn) is needed (this is what I'm currently using - Ie: the std::map). I'm using this in a time critical section of my game engine, so speed and small overhead is required. Max I expect to be storing is 20 or 30 strings. I have some ideas on how to do this, I just didn't want to put all the time and effort into this if there is some well know, fast algorithm for this. Cheers

Share this post


Link to post
Share on other sites
If you're using it in a time-critical area, the best thing to do is to stop using it in a time-critical area. Boil the strings down to identifiers before you ever do time-critical stuff. String comparisons are a poor substitute for an enumeration.

Failing that, if you want maximum possible speed, and in particular if the set of strings is known ahead of time, look into using a perfect hash function.

Share this post


Link to post
Share on other sites
Quote:
If you're using it in a time-critical area, the best thing to do is to stop using it in a time-critical area.

This really isn't an option. This is used in the scripting portion of the engine (using Angel Script) where I need to get at the ID of a function using its function declaration (given in string). So when one calls a function, I need to be able to quickly get at the function ID, as generating it on the fly is very, very slow.

Quote:
Boil the strings down to identifiers
What do you mean?

Share this post


Link to post
Share on other sites
How about not hashing in the first place?

Use a custom string class (or std::string wrapper), that, on creation, or prior to insertion into the map, calculates the hash.

Better yet, provide this wrapper at the point of insertion into the map.

This way you make "hashing" cost the same as an int comparison.

I'm using this aproach for containers with hundreds of items, and the access rates are at around 20 million/second. No other optimization whatsoever. If I wanted, I could probably double or triple that. The lookup performance may be O( something ), but in reality it behaves close to O(1) for up to 100 items.

Keep in mind that keys cannot (or at very least should not) change.

Quote:
What do you mean?



class StringKey
{
public:
StringKey( std::string &s )
: crc( calculateHashOfString(s) )
{}
// lookup the syntax for this, it escapes me right now
bool operator<( const String &other ) {
return crc < other.crc;
}
private:
int crc;
}




Don't even need to modify your std::map.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
[...]
Quote:
Boil the strings down to identifiers
What do you mean?
He means do the lookup before you get into someplace time critical, and use the ID foreverafter. You should be able to convert all the strings before they're used unless they're dynamically generated, in which case there isn't much (if anything) you can do to improve speed.

One possible speedup would be to put the strings into a single character array and in the same memory block create your own balanced tree. This would significantly improve cache hits, but would also be a lot of extra code for probably little benefit.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
Quote:
Boil the strings down to identifiers
What do you mean?

Well, let's take a closer look at your particular situation.

Quote:
This is used in the scripting portion of the engine (using Angel Script) where I need to get at the ID of a function using its function declaration (given in string).

Where did the function declaration come from? Why is something storing the declaration instead of the ID in the first place?

Share this post


Link to post
Share on other sites
Sorry, I'd hoped that an explination of my implimentation wouldn't be needed.

My engine is designed to be rather modular. That is, as long as you expose the same interface, you can drop out the 2D rendered, and drop in a 3D one with little code modifications. Obviously there will be, but you get the idea.

Like I said previous, I am using Angel Script as a way of controling the game.

As some of you may or may not know, to call a functon(that is, call a function that resides in a script) from the client(game engine), you need to obtain a function ID by passing AS the declaration of the function. For example "void foo(int x)" is function ID 46. This can only be obtained at runtime, and is rather expensive.

The current desgin is for control of the higher level concepts to be handled in the script (trigger event "X" when the player moves to tile (x,y). Or "On left arrow, move the sprite left 10px), while the "low level" concepts (rendering, processing input, loading config files, logging) is to be handled in a transparent fashion by the engine. Execution of a single iteration of the game loop looks like:

GetTime
GetInput
CallScriptMain
ProcessAI
ProcessSound
Render
EndFPS


If we want to maintain the state of variables for each call to CallScriptMain, we need either global varaibles, or the below solution

//script
void main()
{
StartUP()
while(1==1)
{

//process
engine.pause()//returns control to the engine by pausing the script.
}
}



Execution needs to be returned to the engine after each frame, so Engine.Pause() allows us to stop the script execution, process the rendering etc, then on the next call to CallScriptMain, we know that we have a paused script, so we can continue where we left off, and the script is none the wiser.

With this design, we can then register script function callbacks to handle various events. For example, we can register a KEY_DOWN event on pressing the LEFT_ARROW key.

//script
void left()
{
//move
}

void main()
{
Engine.RegisterKeyDown(ARROW_LEFT,"void left()");
while(1==1)
{
//...
engine.pause();
}
}




When we poll the keyboard in the main loop, we can call the left() funciton. The first time we call a function, we need to generate the ID then store it. Currently I'm using a std::map<std::pair<std::string,int>> to store this, but I worry that it is a little slow, expecially if we have callbacks for lots of keys.

So for each iteration of the game loop, we need to be looking up the IDs. That being said, we can compute the "void main()" fucntion ID upon initializationg, but the others need to be looked up, as we don't know any of the functions at compile time - only run time.

Therein lies the trouble. My idea was to use a hash table which, with a nice quick hash function, would provide O(1) access into the table.

One idea I had was to use this to produce a DLL containing the hashfunctions based on the script(s), which I could then use from the engine in a dynamic fashion. Seems a little roundabout though. As well, it limits the simplicity of the script, decreasing the quick turnaround for editing them. that is, we are now 'compiling' portions of the script (for lack of a better word) which is potentially going to take longer than a recompile of the engine (maybe...probably not).

It should be noted that this is not only storing callbacks for keys, but rather, the function IDs for any fucntion we want to call that resides in a script.

Cheers

Share this post


Link to post
Share on other sites
You've got a lot of extraneous information there, but from what you're saying, it sounds like Engine.RegisterKeyDown(ARROW_LEFT,"void left()"); stores "void left()" somewhere, and each time through the game loop it that is resolved to a function ID. So why not just resolve it to a function ID in the first place, and store that instead of the string?

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
You've got a lot of extraneous information there,
Sorry, wasn't sure how much was needed!

Quote:
So why not just resolve it to a function ID in the first place, and store that instead of the string?
Because we have no way of knowing what function goes with a given ID. It has to be a declaration-int pair.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
Quote:
So why not just resolve it to a function ID in the first place, and store that instead of the string?
Because we have no way of knowing what function goes with a given ID.

I'm still a little confused by your situation, then. I'll try to give you some general methodology.

It sounds like you're doing some sort of late binding thing. A bunch of objects A1, A2, ... An need to map to a bunch of other objects B1, B2, ... Bm. In your case, the Bs are the functions, and the As are... well, something. Now, for a given i, you have no idea what the j is such that Ai binds to Bj. However, each Ai will produce a "name", and you have a lookup function which maps those names to the proper Bs. So denoting the function which extracts the "B name" which Ai cares about as Name(), and the lookup from a name to a B as Lookup(), you're doing Lookup(Name(Ai)) for each Ai. Every frame. In other words, each frame you are doing a lot of lookups which always resolve the same way. You're doing work over and over again, which is okay, but it's better if that work is lightweight.

First the easy way: instead of doing Lookup(Name(Ai)) every frame, do Lookup(Name(Ai)) when its value first becomes known. Assuming that Lookup() never changes (that is, functions aren't being defined each frame) that cached value means that you only have to do an array lookup.

But wait! What if Bs are being added and deleted? What if a name is decided on by an A BEFORE the corresponding B even exists? This is harder, but still perfectly doable: whenever a B is created, just fix up all the references to it. This is a slow-ish operation if you do it wrong, a reasonably fast operation if you do it right, and you only have to do it when a new B is created. Assuming that happens much less often than the lookups were, this is a significant gain.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
[....]
Quote:
So why not just resolve it to a function ID in the first place, and store that instead of the string?
Because we have no way of knowing what function goes with a given ID. It has to be a declaration-int pair.
You obviously know which function goes with a given ID at some point or you wouldn't be able to look it up ever. If you're actually loading and parsing the script inside your time-critical area, that is your problem right there. If not, you can look up the ID when it first becomes available and then you don't need to look it up again.

Share this post


Link to post
Share on other sites
Quote:
but I worry that it is a little slow


So you've not actually profiled you're just optimizing early for no good reason?

Now, as Sneftel said, you have no valid or good reason why the function Engine.RegisterKeyDown() cannot, given the string "void left()" turn that into (say) an int and then store that in std::map<std::pair<int, int> > instead of storing the string and calculating the ID each iteration. I don't think you get what we're saying as I see NO problem here like you're describing if you just look at our solutions.

You may not know it until runtime but who cares. My entire engine is data driven and nothing is known until it starts up. But once you register the left arrow key to be handled by a function you now KNOW what the function name is and can calculate the ID and store it. No need to recalculate this each iteration of the engine.

I think you're just hung up on hashing when none is needed :)

Share this post


Link to post
Share on other sites
Quote:
Original post by Mike2343
Quote:
but I worry that it is a little slow

So you've not actually profiled you're just optimizing early for no good reason?
Well I have, yet I have nothing to compare it too...Therefore I worry that it is not the best solution, and I'm looking for something faster. I'm pulling about 130fps with a debug build, and there is no rendering that is taking place, so something isn't running as fast as it should.

Quote:

Now, as Sneftel said, you have no valid or good reason why the function Engine.RegisterKeyDown() cannot, given the string "void left()" turn that into (say) an int and then store that in std::map<std::pair<int, int> >

Yes this seems like a good sugestion. Does this not require some form of hashing though?

Quote:

and calculating the ID each iteration.
I'm not...

Quote:
If you're actually loading and parsing the script inside your time-critical area, that is your problem right there.

I'm not doing this, because, as you say, its slow.
Quote:

If not, you can look up the ID when it first becomes available and then you don't need to look it up again.

Well not quite. I need to look up the ID whenever a script funciton is called, so computing it is done once, but I need to hold onto it, else I have to recompute it which is very slow...

Quote:
Quote:
Original post by _Sigma

Quote:
So why not just resolve it to a function ID in the first place, and store that instead of the string?

Because we have no way of knowing what function goes with a given ID.


I'm still a little confused by your situation, then.

I miss read your reply. I think that this is a good solution...

For the moment, it sounds like you guys are saying storing a string,int pair is dumb. So i'll move to a int,int pair. Yet, is there an even faster solution? Ontop of that, what is a good function for creating the int for the string?


Share this post


Link to post
Share on other sites
Quote:

I'm pulling about 130fps with a debug build,...so something isn't running as fast as it should.

Metrics taken on unoptimized (e.g., debug) code are worthless, so that's not useful information in the slightest.

Quote:

Well not quite. I need to look up the ID whenever a script funciton is called, so computing it is done once, but I need to hold onto it, else I have to recompute it which is very slow... For the moment, it sounds like you guys are saying storing a string,int pair is dumb. So i'll move to a int,int pair. Yet, is there an even faster solution? Ontop of that, what is a good function for creating the int for the string?

It's not that slow.

It's also possible to get constant-time actual lookup of the data you're after if you take some precautions and appropriately cook the data (you'll have to be willing to accept the use of a proxy object, though, that is, you only use the string when you specify the mapping initially, and the process of creating the mapping produces something like a FunctionIdentifier which is used in all subsequent lookups; this is usually sufficient as there is often no need to look up via the actual string more than once, initially at start up time).

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
I'm pulling about 130fps with a debug build, and there is no rendering that is taking place, so something isn't running as fast as it should.


It's a debug build. It's designed not to run as fast as it should. If you wish to measure the performance of your program, turn optimizations on.

Quote:

Yes this seems like a good sugestion. Does this not require some form of hashing though?


This hashing would occur at script loading time in a dozen different ways (such as incrementing UIDs).

Quote:
Well not quite. I need to look up the ID whenever a script funciton is called, so computing it is done once, but I need to hold onto it, else I have to recompute it which is very slow...


Stop representing your functions with strings, and start representing them witha scriptFunction type, which you can then implement using either a string or an ID.

Quote:
Yet, is there an even faster solution?


Yes, there is an even faster solution. Use an integer index to represent your functions, making comparisons and lookups O(1) with a simple array. Then, use the time you saved (because the solution is extremely simple to implement) doing this to optimize a more critical operation.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
[...]I miss read your reply. I think that this is a good solution...

For the moment, it sounds like you guys are saying storing a string,int pair is dumb. So i'll move to a int,int pair. Yet, is there an even faster solution? Ontop of that, what is a good function for creating the int for the string?
You don't need a map at all.
First, you associate "somestring" with KEY_LEFT which is some kind of constant.
Next, you make a vector<ID>, size it appropriately, then do KeyFuncs[KEY_LEFT] = Lookup("somestring") and then you have constant time access very easily.

Also, as others have said, using debug mode for anything other than actually debugging is worthless. STL will be always excessively slow in debug mode.

Share this post


Link to post
Share on other sites
Damn, I gotta quite posting when I'm in a bit of a rush. I appologize for not giving more concise replies above.

In regards to the debug build, yes, I understand it is un-optomized, and slow. Its just that it seems *quite* slow? Guess it doesn't matter. I'll rebuild with release and take a look there.

Quote:
You don't need a map at all.
First, you associate "somestring" with KEY_LEFT which is some kind of constant.
Next, you make a vector<ID>, size it appropriately, then do KeyFuncs[KEY_LEFT] = Lookup("somestring") and then you have constant time access very easily.
Yes, this dawned on me a little while ago. I'm glad its a good approach! :) I will do this. Thanks.

Quote:
It's also possible to get constant-time actual lookup of the data you're after if you take some precautions and appropriately cook the data (you'll have to be willing to accept the use of a proxy object, though, that is, you only use the string when you specify the mapping initially, and the process of creating the mapping produces something like a FunctionIdentifier which is used in all subsequent lookups; this is usually sufficient as there is often no need to look up via the actual string more than once, initially at start up time).
Ok, sounds like a good solution. I'll look into this. What would be a good way of making the Identifier from the string?

Cheers

Share this post


Link to post
Share on other sites
Quote:
In regards to the debug build, yes, I understand it is un-optomized, and slow. Its just that it seems *quite* slow? Guess it doesn't matter. I'll rebuild with release and take a look there.


If it were not optimized, it would be great. It contains completely different debug-grade implementation. Which means, every memory allocation is checked, safeguards are inserted before and after memory allocations, all STL concepts use different implementations that perform multiple sanity checks, everything is interlaced with exceptions and exception handlers.

Even if you were to compile this code with release settings, it would be considerably slower. But fool-proof code, along with no optimizations is disaster.

If I consider the performance tests that include STL classes, the performance drop between release and debug mode is around 100-times. This doesn't mean 5 fps in debug, 500 in release, but the SC++L dependant part is insanely slow.

Share this post


Link to post
Share on other sites
Quote:

Ok, sounds like a good solution. I'll look into this. What would be a good way of making the Identifier from the string?

Essentially the way others have suggested. An intermediary object takes the string and the value the string represents, stores the value in a dynamic array (vector, probably) and returns a "handle" object that is a wrapper around the index that the object was stored at.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this