Hashtable size - prime or power of 2?

Started by
17 comments, last by ToohrVyk 16 years, 9 months ago
Quote:Original post by Extrarius
For example, you could construct a "perfect table" but instead of indexing it using table[state] use table[state ^ key] as a very simple "encryption" method, and then you can have unlimited functions and unknown worst case behavior while still theoretically having superior typical case behavior.


I dont have the time right now to investigate, but my first impression is that you arent getting twice as many good hash bits by using two "encryption" keys.

Anyways, I would argue that the benefits of a "perfect" table are extremely marginal.

It is definately trivial to guarantee that precisely half of the hash bits are effected by a single change to the input by filling the zobrist table with values that have precisely half their bits set.

There are 601,080,390 32-bit values that qualify and there are algorthms that can both enumerate them in lexigraphical sequence (bit twiddling) as well as pluck them arbitrarily out of the sequence based on their position within it (walking a depth-32 pascals triangle)

Advertisement
... Or, you could, you know, just use some form of BST from your language's standard library until you've established that this is the bottleneck. :)
Quote:Original post by Zahlman
... Or, you could, you know, just use some form of BST from your language's standard library until you've established that this is the bottleneck. :)


It is generally easy to produce worst-case behavior from the typical binary search tree implimentation.

Usualy something as trivial as throwing a sorted list at it for insertion will do the trick. At a minimum you want a guarantee that its a splay, red-black, or other self-balancing tree architecture.

Even with a self-balancing tree in place, we have not eliminated some obvious worst-case behavior attacks such as inserting aaaaaaaaaaa1, aaaaaaaaaaa2, aaaaaaaaaaa3, aaaaaaaaaaa4, aaaaaaaaaaa5, ... which forces searches in that area to run full comparisons that cannot early-out. So even tho its O(log n) the per node constant can be pushed to its worst-case.

It will be hard to identify what the worst-case performance of a BST will be unless you actualy attack it. You cannot simply profile your BST-using program with typical inputs and decide that there isnt a bottleneck. You have to actively attack it.

Don't use BST's to impliment associative containers.
Quote:Original post by Rockoon1
Quote:Original post by Zahlman
... Or, you could, you know, just use some form of BST from your language's standard library until you've established that this is the bottleneck. :)


It is generally easy to produce worst-case behavior from the typical binary search tree implimentation.

Usualy something as trivial as throwing a sorted list at it for insertion will do the trick. At a minimum you want a guarantee that its a splay, red-black, or other self-balancing tree architecture.

Even with a self-balancing tree in place, we have not eliminated some obvious worst-case behavior attacks such as inserting aaaaaaaaaaa1, aaaaaaaaaaa2, aaaaaaaaaaa3, aaaaaaaaaaa4, aaaaaaaaaaa5, ... which forces searches in that area to run full comparisons that cannot early-out. So even tho its O(log n) the per node constant can be pushed to its worst-case.

It will be hard to identify what the worst-case performance of a BST will be unless you actualy attack it. You cannot simply profile your BST-using program with typical inputs and decide that there isnt a bottleneck. You have to actively attack it.

Don't use BST's to impliment associative containers.
Zahlman was hinting at the SC++L which usually uses a red-black tree. These have a variance in depth of a factor of two between the shortest path to leaf and longest path to leaf. Worst case is never more than twice the best case. You get logn vs 2xlogn which are both O(logn).
Even your aaaaaaaaaaa1 etc example still only adds a constant multiplier to the time taken, and pales in comparison to an order of magnitude change from O(1) to O(n).

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Quote:Original post by Rockoon1
Quote:Original post by Zahlman
... Or, you could, you know, just use some form of BST from your language's standard library until you've established that this is the bottleneck. :)


It is generally easy to produce worst-case behavior from the typical binary search tree implimentation.

Usualy something as trivial as throwing a sorted list at it for insertion will do the trick. At a minimum you want a guarantee that its a splay, red-black, or other self-balancing tree architecture.


Well, I can't vouch for other languages, but C++'s std::set effectively makes such a guarantee (by making guarantees about order of complexity of the various operations). And indeed, it is most commonly (from what I've seen) implemented with a red-black tree.

Quote:
It will be hard to identify what the worst-case performance of a BST will be unless you actualy attack it. You cannot simply profile your BST-using program with typical inputs and decide that there isnt a bottleneck. You have to actively attack it.


Is this somehow more true of BSTs than of hashes? I'm not convinced.

Quote:Don't use BST's to impliment associative containers.


Well, I'd like to think that the C++ standards committee had a reason of some sort for including std::map in the 1998 standard and not any kind of hash_map. Feel free to speculate. :)

But anyway, why would I *implement* an associative container when one is already provided for me?

I would argue, BTW, that saying a server can be "attacked" by creation of a large number of bogus accounts with a certain pattern of usernames, is quite meaningless, because it requires much less intelligence to attack a server by DDoS, so that's what you'll normally see.
Quote:Original post by iMalc
Even your aaaaaaaaaaa1 etc example still only adds a constant multiplier to the time taken, and pales in comparison to an order of magnitude change from O(1) to O(n).


Yes, I pointed out that it was a constant. But note that this constant can be made to be significantly different, on purpose and trivialy, from what will be measured by a profiler using "typical" inputs.

A hash table leveraging multiple good hash functions (such as universal hash functions), on the other hand, cannot be made to degrade to O(n) trivialy. This is true even when the hash functions are known, and is computationally infeasible when they are unknown, and thats the point.

Quote:Original post by Zahlman
Well, I can't vouch for other languages, but C++'s std::set effectively makes such a guarantee (by making guarantees about order of complexity of the various operations). And indeed, it is most commonly (from what I've seen) implemented with a red-black tree.


Hey, i'm not bothered by the fact that it is there. It is the fact that it isnt performant that bothers many people. Yes, its O(log n) and all that. This wheel is often reinvented with hash tables for a reason.

Quote:
Is this somehow more true of BSTs than of hashes? I'm not convinced.


It is not more "true" - it is more relevant.

Quote:
Well, I'd like to think that the C++ standards committee had a reason of some sort for including std::map in the 1998 standard and not any kind of hash_map. Feel free to speculate. :)


Free speculation: Because its a committe that has a track record of not making a decision, even when its long overdue (standardized threading anyone??)

Quote:
But anyway, why would I *implement* an associative container when one is already provided for me?


You answered this question with your first post to this thread. You stated precisely when and why you would.

Quote:
I would argue, BTW, that saying a server can be "attacked" by creation of a large number of bogus accounts with a certain pattern of usernames, is quite meaningless, because it requires much less intelligence to attack a server by DDoS, so that's what you'll normally see.


Ah yes, the old "there are worse things" justification.

This same arguement can be used to justify flawed implimentations that allow stack-busting attacks - guess they dont need fixing after all
heh, I though this thread had died!

Thanks for all the feedback!
Hm. I would assume that your teacher illustrated his course by describing a simple hashing function, such as a rolling hash: old and inefficient, but simple to explain.

Such a hashing function naturally provides best results when within a range that is prime-number-sized, so he probably meant that the hash table in his example was optimal when the hash size was a prime (because no memory would be wasted or no skew would occur), not all hash tables in general.

This topic is closed to new replies.

Advertisement