# Floru

Member

110

124 Neutral

• Rank
Member
1. ## String fingerprint algorithms

Thanks for your post Sneftel. I understand that there is unlikely a general algorithm. That's ok because I'm just studying these algorithms not searching one for a specific purpose. Your examples (Soundex, Metaphone, ...) are very good and I will check them. And of course I appreciate the information you gave about these algorithms.
2. ## String fingerprint algorithms

Quote:Original post by kSquared That's not a particularly great definition. Strings that appear virtually identical can mean very different things. These two strings are identical except for a single character, but they have radically different meanings (at least in American English): "I helped Jack off a horse." "I helped jack off a horse." (In addition, they're identical if you ignore case!) That is true. My definition is very inaccurate and perhaps my post went a little bit offtopic. The main point was to find information about hashing functions that might create same hash for strings that share something same. If I ask about hashing functions I might get some examples like MD5 or SHA1. But for me Rabin's algorithm was something new and I was hoping to get some general information about it and perhaps also information about similar algorithms. How exactly the "same" is defined is not so important, in my opinion.
3. ## String fingerprint algorithms

Thanks for your answers Tom and Inmate. Question about when strings are considered almost the same is a very good one. It's perhaps difficult to explain exactly what I ment, but I mean strings that by human understanding are considered the same. I'm not interested about string containing other string as a substring but instead how much similarity the strings contain. From strings: 1. "hello world" 2. "helllo world" 3. "hello worldabcdefghijhklm" In my opinion 1 and 2 are almost the same. I have tried for example PHP's similar_text-function (http://www.php.net/manual/fi/function.similar-text.php) but of course it's not perfect. Sometimes you see that two strings (sentences) have the same information but still you get quite low similarity. But that's ok. Comparing strings using hashing can be quite good depending of the situation (instead of comparing long strings directly just compare the hash). But before reading about Rabin's algorithm I thought that hashing always tries to mimimize collisions. Like Inmate wrote it's quite difficult to produce a collision using MD5 or SHA1.
4. ## String fingerprint algorithms

Hi Tom. Thanks for a very helpful answer. I checked "Some applications of Rabin's fingerprinting method" and it seems ok but maybe a little too mathematical for me. However the Java implementation is very helpful, thanks! I wonder if they are any other fingerprint algorithms with properties that I described in my first post?
5. ## String fingerprint algorithms

Hi. I'm looking for string fingerprint algorithms with following requirements: - If fingerprint is different strings are different - If fingerprint is same then strings are same or almost same So I cannot use MD5 etc. because I consider almost same string as same. After searching from Google I found that Rabin's fingerprint algorithm should take care of this problem but I have not found any implementations for it although it's old algorithm (if I'm correct)...
6. ## Text analyze

Sorry if question like this has been posted before. Unfortunately the search is not working currently. I recently studied Markov chain algorithm. I created some "random" text using it. Now I'm interested about text analyzing. Are there some algorithms similar to Markov chain? And how could I analyze relevant words from text (text without noise words)? Any kind of pointers to text analysis would be greatly appreciated. Also some source code would be nice.