Jump to content
  • Advertisement
Sign in to follow this  
tanzanite7

birthday "paradox" trouble.

This topic is 4869 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi. I'm searching for a hash function for thousands of little strings (1-64 bytes ... rarely more) to identify them with enough probability. Unfortunately ... 32bit functions run a bit short because of the so called birthday "paradox" :( 64bit needed. Google has not helped so far. What it turns out doesn't seem to fit my needs OR i just don't know. Needs: 64bit of course. fast! designed towards minimal amount of collisions. Does not need: Security. I'm completely ruling out any kind of bad intentions (as it won't make any sense ... and if anyone wants to break it - they are free to do so). Any pointers to what i could/should use/investigate?

Share this post


Link to post
Share on other sites
Advertisement
I've used 32-bit FNV (Fowler-Noll-Vo) successfully on dictionaries holding ~300K words (a few collisions were acceptable). You sure you have enough strings to warrant 64 bit hashes, or is it that they tend to be short? If so, you can try the 64-bit extension to FNV.
As to efficiency, it's pretty tight for a process-each-byte-sequentially algo (the usual case); the loop does {load, mul, xor}.

Share this post


Link to post
Share on other sites
Now google is talking to me (http://www.google.com/search?hs=tfO&hl=en&lr=&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&q=FNV+64&btnG=Search) turned up quite a few nice links about FNV (haven't ever heard about it before).

thanks :)

>You sure you have enough strings to warrant 64 bit hashes, or is it that they tend to be short? If so, you can try the 64-bit extension to FNV.<

The strings are usualy short (around 64bytes ... some sets are around 256 too tho). ... but the problem lies in the fact that i can't detect collisions (i don't have the original string around to check - i thow them out of the window prety fast (too much memory used)). When collisions do heppen and i get some false positives - it won't be nice, but cities won't burn to ground eather. => 64b

Share this post


Link to post
Share on other sites
Quote:
Original post by tanzanite7
The strings are usualy short (around 64bytes ... some sets are around 256 too tho). ... but the problem lies in the fact that i can't detect collisions (i don't have the original string around to check - i throw them out of the window prety fast (too much memory used)). When collisions do happen and i get some false positives - it won't be nice, but cities won't burn to ground either. => 64b
Then what you want to do is not possible to do perfectly. You have to either keep the original around, or make you hash as big as the largest original item, which would defeat the point. Other than that you would have to re-read in the original if you ever got a collision, meaning again that you have to store it somewhere.

How much memory are we really talking about here? If you're programming for a PC then surely RAM couldn't possibly be an issue.

There's probably a good chance you could store all of the original strings by utilising various compression techniques. LZW might work well for example.

Share this post


Link to post
Share on other sites
Glad to :)

Quote:
The strings are usualy short (around 64bytes ... some sets are around 256 too tho). ... but the problem lies in the fact that i can't detect collisions (i don't have the original string around to check - i thow them out of the window prety fast (too much memory used)). When collisions do heppen and i get some false positives - it won't be nice, but cities won't burn to ground eather. => 64b

OK, understood. The reason I could get away with collisions is that this was for a search engine. It indexes some texts, thereby hashing words and immediately discarding the word contents. When searching, it only looks for the hashed value and may get some false positives. However, the engine returns surrounding text, so the text file has to be opened anyway and it is scanned for all search terms. (anything in the search box that doesn't come up in the file was a false positive and is just dropped)
If you cannot at all reconstruct the original data, though (as with my search terms), then this isn't workable.


Quote:
Then what you want to do is not possible to do perfectly.

Perfect hashing is possible :) Just a matter of how well you know the input space.

Quote:
How much memory are we really talking about here? If you're programming for a PC then surely RAM couldn't possibly be an issue.

Whoa, I would have said: there's never enough memory! (but that's because my thesis is on speeding up IO/file caching, which wants to vacuum up anything left ;) )

Share this post


Link to post
Share on other sites
Quote:
Original post by Jan Wassenberg
Quote:
Then what you want to do is not possible to do perfectly.

Perfect hashing is possible :) Just a matter of how well you know the input space.
True, but in order to do that you have to be able to look at each string more than once, for the most part.
Quote:


Quote:
How much memory are we really talking about here? If you're programming for a PC then surely RAM couldn't possibly be an issue.

Whoa, I would have said: there's never enough memory! (but that's because my thesis is on speeding up IO/file caching, which wants to vacuum up anything left ;) )
Well he mentioned "thousands of little strings (1-64 bytes ... rarely more)" which obviously wont take up more than 64K/per thousand strings.
I suppose he could have meant "tens of thousands", or even "hundreds of thousands", but even a million of these is only going to be a few megs.

But I get your point also.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
I think some explanation is in order.

32bit/64bit +original strings:
*pros:
no false positives
*cons:
The program actualy NEVER needs thous strings (human readable identifiers) - wasting memory for dealing with exceptional conditions (collisions)
The program has it's own memory manager, designed for speed and 0 fragmentation - 16KB is the smallest amount it provides. (need to pack individual strings into them to get reasonable memory usage)
Extra copy of the string for everything that needs to find a match (A LOT!) - given the 16KB thing => anninoning.

64bit -original strings:
*pros:
Not wasting space for strings i don't need (my program is memory heavy = already using all it can get and relies on hdd for the rest).
No extra copies of the strings needed.
*cons:
Quite small possibility for a false positive.

Bottom line - i don't like to deal with thous little stringy buggers :p
So - i would say, it is more like a design choice. (KISS)

:)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!