Sign in to follow this  
induction

Designing a hashtable

Recommended Posts

Hi. I'm designing a hashtable implementation for a game that has a voting system. Each vote has a timestamp, unique number (each voter has it's own number), and of course the votenumber. Unique numbers are 1-15 characters long. There isn't exactly a maximum for the votecount, so it could be anything between one to 2^32-1. So here's what I've got now. My hashfunction calculates each number of the unique number and multiplies it with the according index number. For example "51455" would become 5*5 + 1*4 + 4*3 + 5*2 + 5*1 = 56. Now the vote should be added to the hashtable index 56. Each index has a linked list, so the node is added to the end of the linked list. Each node has a field of the timestamp, unique number and a table which includes the vote candidates (12 contestants). This is because each person can vote max. 120 times (12 contestants, max. 10 votes per contestant). And then the problem. It's too slow. I tested it with 400k votes and it took a minute. Any ideas for improvements? I guess the problem is in the hashfunction. Thanks.

Share this post


Link to post
Share on other sites
Quote:
Original post by induction
Any ideas for improvements? I guess the problem is in the hashfunction.


There's your first problem, all right - guessing.

Did you roll your own hash table, or use an existing implementation?
Why not use a profiler to determine what the actual bottlenecks are?

Depending on language, representation of digits and hashes is critical.

First improvement - use (if not already) an existing hash table. The fact that you mention linked lists (de-facto slowest, possibly by a factor of 50 vs. alternate implementations) hints at using homegrown table.

Secondly - calculate hashes once and once only. Store it with the string. Depending on cache- and other micro-optimizations, the order of magnitude you should be looking here is 1 second to populate (after pre-calculating the hashes).

IO and vote parsing could, depending on other factors, take much longer than that.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Did you roll your own hash table, or use an existing implementation?

Own.

Quote:
Original post by Antheus
Why not use a profiler to determine what the actual bottlenecks are?

I've never used a profiler before, but I guess I'll have to try one.

Quote:
Original post by Antheus
Depending on language, representation of digits and hashes is critical.

C++.

Quote:
Original post by Antheus
First improvement - use (if not already) an existing hash table. The fact that you mention linked lists (de-facto slowest, possibly by a factor of 50 vs. alternate implementations) hints at using homegrown table.

True, I could use an existing hash table. But this is for learning purposes also. :)

Quote:
Original post by Antheus
Secondly - calculate hashes once and once only. Store it with the string. Depending on cache- and other micro-optimizations, the order of magnitude you should be looking here is 1 second to populate (after pre-calculating the hashes).

Thanks, I'll try this.

Share this post


Link to post
Share on other sites
Quote:
Original post by induction

Own.


Run this:
std::map<int, std::string> foo;

for (int i = 0; i < 400000; i++) { foo.push_back( (i*17) % 10, ""); }

Time how long it takes. Compare to your version.

Quote:
C++.

struct Entry {
Entry(..., ...) {
hash_code = ... // compute hash_code here
}
int get_hash_code() const { return hash_code; }
private:
int hash_code;
};


Quote:
True, I could use an existing hash table. But this is for learning purposes also. :)


To learn about hash tables, you will need to study the theory. To learn about hash table implementations, you'll need to study CPU and memory architecture.

Today's implementations have a lot of mileage behind them, and hash tables being the most important data structure in existence today, a lot of PhD and Google-grade research has gone into them.

It is no longer possible to just whip up something naively and have it outperform existing implementations.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Quote:
Original post by induction

Own.


Run this:
std::map<int, std::string> foo;

for (int i = 0; i < 400000; i++) { foo.push_back( (i*17) % 10, ""); }

Time how long it takes. Compare to your version.

Quote:
C++.

struct Entry {
Entry(..., ...) {
hash_code = ... // compute hash_code here
}
int get_hash_code() const { return hash_code; }
private:
int hash_code;
};


Yeah, that is a lot faster.

Quote:

To learn about hash tables, you will need to study the theory. To learn about hash table implementations, you'll need to study CPU and memory architecture.

Today's implementations have a lot of mileage behind them, and hash tables being the most important data structure in existence today, a lot of PhD and Google-grade research has gone into them.

It is no longer possible to just whip up something naively and have it outperform existing implementations.


Guess so. I googled some implementations and they really weren't that trivial. :)

Anyway, I tried the gprof-profiler and it says that 42% of the time it runs this:

__gnu_cxx::__enable_if<std::__is_char<char>::__value, bool>::__type std::operator==<char>(std::basic_string<char, std::c har_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)


Does that mean a string comparison if(string1 == string2)?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this