Jump to content
  • Advertisement
Sign in to follow this  
Floru

String fingerprint algorithms

This topic is 4752 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi. I'm looking for string fingerprint algorithms with following requirements: - If fingerprint is different strings are different - If fingerprint is same then strings are same or almost same So I cannot use MD5 etc. because I consider almost same string as same. After searching from Google I found that Rabin's fingerprint algorithm should take care of this problem but I have not found any implementations for it although it's old algorithm (if I'm correct)...

Share this post


Link to post
Share on other sites
Advertisement

Hello Floru,

This looked like an interesting problem so I did a bit of searching around and found the following link that describes Rabin's algorithm:

* "Some applications of Rabin's fingerprinting method" can be downloaded in various formats from the top right of this page. http://citeseer.ist.psu.edu/broder93some.html

* PDF download

After skimming though it (with my basic understanding of the maths) I don't believe Rabin's algorithm guarantees the collisions (identical fingerprints) will only occur from similar strings. It just guarantees the probability of collision is low:

collision probability (approximately) <= n * (m^2) / (2^k)

where:
n = number of unique strings finger printed
m = maximum string length
k = degree of irreducible polynomial (see link)

Given you get to choose k, the collision probability can be quite small (but not zero)

Regards,

Tom

Share this post


Link to post
Share on other sites
Opps, after a bit more reading, I'm seeing lots of the following type of statement:

"We compute the checksums using Rabin s fingerprinting algorithm [3, 21], which has good spectral properties and gives exponentially small probabilistic bounds on the likelihood of a collision."

I guess it should have been obvious to me, k doesn't need to be huge before your collision probability gets very small.

Anyway, the important bit a Java example of rabin's algorithm can be downloaded from sourceforge under the GPL license (or according to its document the Apache license)

* Rabin Hash Project (Java)

Cheer,

Tom

[Edited by - TomH on October 10, 2005 8:18:01 AM]

Share this post


Link to post
Share on other sites
Hi Tom.

Thanks for a very helpful answer. I checked "Some applications of Rabin's fingerprinting method" and it seems ok but maybe a little too mathematical for me. However the Java implementation is very helpful, thanks!

I wonder if they are any other fingerprint algorithms with properties that I described in my first post?

Share this post


Link to post
Share on other sites
Hello Floru,

As far as I can see 'normal' hash function will not fit your second property ( the "or almost same" bit ) Although this does come down to what we mean by "almost the same".

If you're comparing "hello world", "helllo world" and "hello worldabcdefghijhklm" are they almost the same?

I'm defining a 'normal' hash function as any function that:
- given a variable length of data produces a fixed length output (hash)
- given identical input produces the same output.

If we assume the data can be any length then there are an infinite number of different inputs. If the output is of a fixed length then there are a finite number of outputs. This would mean that for some outputs can be produced from multiple inputs*.

*In fact I think some outputs can be produced from an infinite number of inputs.

So given no hash function will completely meet your second property, you should be looking for a function where it is unlikely any of your (different) inputs will produce the same hash. This comes down to how many different strings you will be finger printing. And you should select a hashing algorithm that has a 'hash space' large enough to make this very unlikely. If the hashes are uniformly distributed then from the birthday paradox you can calculate the probability of a collision from n different inputs in a hash space of s.

probability (approx) = 1 - e ^ ( -n.( n - 1) / 2.s )

I think the real advantage from Rabin's algorithm is it's support for any hash space size. (You just use longer or shorter polynomials.)

Without known exactly what your requirements are it's difficult to suggest an alternative, but if you wanted a safe mainstream one the i'd suggest going with SHA-512. It's almost certainly an overkill and it's likely you can find a reference implementation. Not too suprisingly the key space is 2 ^ 512. So you should be unlikely to see collisions until you've run for a long time.

Cheers,

Tom

Share this post


Link to post
Share on other sites
One of the first things you're taught in a linear algebra class is that for any two ordered sets where one's number of elements is smaller than the other, than there is no way to produce a one for one mapping of the elements in the larger set to the elements of the smaller set.

Hashing functions see to it that you can at least spread out the mapping so that like elements in the large set are significantly different in the smaller set. I've never seen this proven, and I would love to see it, that no two words in the english language share the same MD5 or SHA1 hash. The data isn't availiable to run the test, but I'd like to see if any english literature, published between 1600 and 2005, manages to share MD5 or SHA1 hashes. My bet would be that its a no. Just from the angle of likelyhood, 128-bits are more than enough to index every particle in the observed universe.

Now, MD5 and SHA1 do have its limits. They've been moved within the realm of feasible collision producing because of flaws in the math, so that malicious people can abuse them to produce different strings that produce like hashes. That means that its possible to find at least one bizarre string of seemingly random punctuation marks that shares the same hash as Moby Dick.

Share this post


Link to post
Share on other sites
Thanks for your answers Tom and Inmate.

Question about when strings are considered almost the same is a very good one. It's perhaps difficult to explain exactly what I ment, but I mean strings that by human understanding are considered the same. I'm not interested about string containing other string as a substring but instead how much similarity the strings contain. From strings:
1. "hello world"
2. "helllo world"
3. "hello worldabcdefghijhklm"

In my opinion 1 and 2 are almost the same.

I have tried for example PHP's similar_text-function (http://www.php.net/manual/fi/function.similar-text.php) but of course it's not perfect. Sometimes you see that two strings (sentences) have the same information but still you get quite low similarity. But that's ok.

Comparing strings using hashing can be quite good depending of the situation (instead of comparing long strings directly just compare the hash). But before reading about Rabin's algorithm I thought that hashing always tries to mimimize collisions. Like Inmate wrote it's quite difficult to produce a collision using MD5 or SHA1.

Share this post


Link to post
Share on other sites
Quote:
Original post by Floru
Question about when strings are considered almost the same is a very good one. It's perhaps difficult to explain exactly what I ment, but I mean strings that by human understanding are considered the same. I'm not interested about string containing other string as a substring but instead how much similarity the strings contain.

That's not a particularly great definition. Strings that appear virtually identical can mean very different things. These two strings are identical except for a single character, but they have radically different meanings (at least in American English):
"I helped Jack off a horse."
"I helped jack off a horse."

(In addition, they're identical if you ignore case!)

Even going by your original "hello world" example, you don't have any guidelines for when to stop considering them almost the same. For instance, given "hello world" as the original string, which of the following strings should we consider almost the same?
"hEllo world"
"heello world"
"hello world"
"heel world"
"small world"
"it's a small world after all"
"hi, world!"

You need to come up with some concrete rules before you'll be able to proceed properly.

Share this post


Link to post
Share on other sites
Quote:
Original post by kSquared
That's not a particularly great definition. Strings that appear virtually identical can mean very different things. These two strings are identical except for a single character, but they have radically different meanings (at least in American English):
"I helped Jack off a horse."
"I helped jack off a horse."

(In addition, they're identical if you ignore case!)

That is true. My definition is very inaccurate and perhaps my post went a little bit offtopic. The main point was to find information about hashing functions that might create same hash for strings that share something same.

If I ask about hashing functions I might get some examples like MD5 or SHA1. But for me Rabin's algorithm was something new and I was hoping to get some general information about it and perhaps also information about similar algorithms. How exactly the "same" is defined is not so important, in my opinion.

Share this post


Link to post
Share on other sites
Quote:
Original post by Floru
The main point was to find information about hashing functions that might create same hash for strings that share something same.

It's extremely unlikely that an algorithm exists for the general case, given that there is no general case. A solution, therefore, is going to be domain specific. For instance, the Soundex and NYSIIS algorithms were specifically designed to identify similar surnames despite minor spelling differences. Metaphone is similar in that it values pronunciation over spelling but is not merely designed for names and does not recognize certain surname-specific similarities (Johnson and Johansson, for instance). IIRC, Metaphone is what PHP's similarity function uses. Because of this, Soundex and NYSIIS are better for surnames than Metaphone, but not as good otherwise. So is there a particular domain you're interested in?

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!