String fingerprint algorithms

Started by
9 comments, last by Floru 18 years, 6 months ago
Hi. I'm looking for string fingerprint algorithms with following requirements: - If fingerprint is different strings are different - If fingerprint is same then strings are same or almost same So I cannot use MD5 etc. because I consider almost same string as same. After searching from Google I found that Rabin's fingerprint algorithm should take care of this problem but I have not found any implementations for it although it's old algorithm (if I'm correct)...
Advertisement

Hello Floru,

This looked like an interesting problem so I did a bit of searching around and found the following link that describes Rabin's algorithm:

* "Some applications of Rabin's fingerprinting method" can be downloaded in various formats from the top right of this page. http://citeseer.ist.psu.edu/broder93some.html

* PDF download

After skimming though it (with my basic understanding of the maths) I don't believe Rabin's algorithm guarantees the collisions (identical fingerprints) will only occur from similar strings. It just guarantees the probability of collision is low:

collision probability (approximately) <= n * (m^2) / (2^k)

where:
n = number of unique strings finger printed
m = maximum string length
k = degree of irreducible polynomial (see link)

Given you get to choose k, the collision probability can be quite small (but not zero)

Regards,

Tom
Opps, after a bit more reading, I'm seeing lots of the following type of statement:

"We compute the checksums using Rabin s fingerprinting algorithm [3, 21], which has good spectral properties and gives exponentially small probabilistic bounds on the likelihood of a collision."

I guess it should have been obvious to me, k doesn't need to be huge before your collision probability gets very small.

Anyway, the important bit a Java example of rabin's algorithm can be downloaded from sourceforge under the GPL license (or according to its document the Apache license)

* Rabin Hash Project (Java)

Cheer,

Tom

[Edited by - TomH on October 10, 2005 8:18:01 AM]
Hi Tom.

Thanks for a very helpful answer. I checked "Some applications of Rabin's fingerprinting method" and it seems ok but maybe a little too mathematical for me. However the Java implementation is very helpful, thanks!

I wonder if they are any other fingerprint algorithms with properties that I described in my first post?
Hello Floru,

As far as I can see 'normal' hash function will not fit your second property ( the "or almost same" bit ) Although this does come down to what we mean by "almost the same".

If you're comparing "hello world", "helllo world" and "hello worldabcdefghijhklm" are they almost the same?

I'm defining a 'normal' hash function as any function that:
- given a variable length of data produces a fixed length output (hash)
- given identical input produces the same output.

If we assume the data can be any length then there are an infinite number of different inputs. If the output is of a fixed length then there are a finite number of outputs. This would mean that for some outputs can be produced from multiple inputs*.

*In fact I think some outputs can be produced from an infinite number of inputs.

So given no hash function will completely meet your second property, you should be looking for a function where it is unlikely any of your (different) inputs will produce the same hash. This comes down to how many different strings you will be finger printing. And you should select a hashing algorithm that has a 'hash space' large enough to make this very unlikely. If the hashes are uniformly distributed then from the birthday paradox you can calculate the probability of a collision from n different inputs in a hash space of s.

probability (approx) = 1 - e ^ ( -n.( n - 1) / 2.s )

I think the real advantage from Rabin's algorithm is it's support for any hash space size. (You just use longer or shorter polynomials.)

Without known exactly what your requirements are it's difficult to suggest an alternative, but if you wanted a safe mainstream one the i'd suggest going with SHA-512. It's almost certainly an overkill and it's likely you can find a reference implementation. Not too suprisingly the key space is 2 ^ 512. So you should be unlikely to see collisions until you've run for a long time.

Cheers,

Tom
One of the first things you're taught in a linear algebra class is that for any two ordered sets where one's number of elements is smaller than the other, than there is no way to produce a one for one mapping of the elements in the larger set to the elements of the smaller set.

Hashing functions see to it that you can at least spread out the mapping so that like elements in the large set are significantly different in the smaller set. I've never seen this proven, and I would love to see it, that no two words in the english language share the same MD5 or SHA1 hash. The data isn't availiable to run the test, but I'd like to see if any english literature, published between 1600 and 2005, manages to share MD5 or SHA1 hashes. My bet would be that its a no. Just from the angle of likelyhood, 128-bits are more than enough to index every particle in the observed universe.

Now, MD5 and SHA1 do have its limits. They've been moved within the realm of feasible collision producing because of flaws in the math, so that malicious people can abuse them to produce different strings that produce like hashes. That means that its possible to find at least one bizarre string of seemingly random punctuation marks that shares the same hash as Moby Dick.
william bubel
Thanks for your answers Tom and Inmate.

Question about when strings are considered almost the same is a very good one. It's perhaps difficult to explain exactly what I ment, but I mean strings that by human understanding are considered the same. I'm not interested about string containing other string as a substring but instead how much similarity the strings contain. From strings:
1. "hello world"
2. "helllo world"
3. "hello worldabcdefghijhklm"

In my opinion 1 and 2 are almost the same.

I have tried for example PHP's similar_text-function (http://www.php.net/manual/fi/function.similar-text.php) but of course it's not perfect. Sometimes you see that two strings (sentences) have the same information but still you get quite low similarity. But that's ok.

Comparing strings using hashing can be quite good depending of the situation (instead of comparing long strings directly just compare the hash). But before reading about Rabin's algorithm I thought that hashing always tries to mimimize collisions. Like Inmate wrote it's quite difficult to produce a collision using MD5 or SHA1.
Quote:Original post by Floru
Question about when strings are considered almost the same is a very good one. It's perhaps difficult to explain exactly what I ment, but I mean strings that by human understanding are considered the same. I'm not interested about string containing other string as a substring but instead how much similarity the strings contain.

That's not a particularly great definition. Strings that appear virtually identical can mean very different things. These two strings are identical except for a single character, but they have radically different meanings (at least in American English):
"I helped Jack off a horse.""I helped jack off a horse."

(In addition, they're identical if you ignore case!)

Even going by your original "hello world" example, you don't have any guidelines for when to stop considering them almost the same. For instance, given "hello world" as the original string, which of the following strings should we consider almost the same?
"hEllo world""heello world""hello                                            world""heel world""small world""it's a small world after all""hi, world!"

You need to come up with some concrete rules before you'll be able to proceed properly.
- k2"Choose a job you love, and you'll never have to work a day in your life." — Confucius"Logic will get you from A to B. Imagination will get you everywhere." — Albert Einstein"Money is the most egalitarian force in society. It confers power on whoever holds it." — Roger Starr{General Programming Forum FAQ} | {Blog/Journal} | {[email=kkaitan at gmail dot com]e-mail me[/email]} | {excellent webhosting}
Quote:Original post by kSquared
That's not a particularly great definition. Strings that appear virtually identical can mean very different things. These two strings are identical except for a single character, but they have radically different meanings (at least in American English):
"I helped Jack off a horse.""I helped jack off a horse."

(In addition, they're identical if you ignore case!)

That is true. My definition is very inaccurate and perhaps my post went a little bit offtopic. The main point was to find information about hashing functions that might create same hash for strings that share something same.

If I ask about hashing functions I might get some examples like MD5 or SHA1. But for me Rabin's algorithm was something new and I was hoping to get some general information about it and perhaps also information about similar algorithms. How exactly the "same" is defined is not so important, in my opinion.
Quote:Original post by Floru
The main point was to find information about hashing functions that might create same hash for strings that share something same.

It's extremely unlikely that an algorithm exists for the general case, given that there is no general case. A solution, therefore, is going to be domain specific. For instance, the Soundex and NYSIIS algorithms were specifically designed to identify similar surnames despite minor spelling differences. Metaphone is similar in that it values pronunciation over spelling but is not merely designed for names and does not recognize certain surname-specific similarities (Johnson and Johansson, for instance). IIRC, Metaphone is what PHP's similarity function uses. Because of this, Soundex and NYSIIS are better for surnames than Metaphone, but not as good otherwise. So is there a particular domain you're interested in?

This topic is closed to new replies.

Advertisement