Generating an ID

Started by
4 comments, last by RuneLancer 18 years, 1 month ago
I'm trying to acheive something at work but can't quite figure out how to do it. Frankly, I'm not even sure where to start looking. A few pointers or some advice would be quite welcomed. I'm essentially trying to generate a sort of identifier I can tag onto a file that will vary very little as the file changes. Kinda like MD5, but not something that would change completely if you were to change an 'A' into a 'B' or insert a few bytes, or whatnot. For instance, with a theorical algorithm, I could take a file containing "The quick brown fox jumped over the lazy dog." and it would output, say, A0 F2 55 1C 00 08 E5 32. Then I could change the file to contain "The quick brown fox jumped over the lazy cat." and I'd end up with A0 F2 55 1C 00 08 E6 4E. For the sake of context, this would more or less be to create some sort of "difference map" where one file could quickly be looked up by this ID in a database and similare files could be acquired by looking for those with a similare ID.
Advertisement
So you want a hash that changes proportionally to the amount of change in the file? You could just sum the data and do a modulus or something like that.

Are you comparing text or binary data?
I haven't read about this kind of stuff, but an easy way to do something like that, would be to take the sum of all characters. If the sum gets larger than your last ID you just start at 0 again. So if we use ASCII to translate the characters to numbers and we take your example:
The quick brown fox jumped over the lazy cat. = 84 + 104 + 101 + 32 + 113 + 117 + 105 + 99 + 107 + 32 + 98 + 114 + 111 + 119 + 110 + 32 + 102 + 111 + 120 + 32 + 106 + 117 + 109 + 112 + 101 + 100 + 32 + 111 + 118 + 101 + 114 + 32 + 116 + 104 + 101 + 32 + 108 + 97 + 122 + 121 + 32 + 99 + 97 + 116 + 46 = 4187.

Now if we change car to dog.
The quick brown fox jumped over the lazy dog. = 84 + 104 + 101 + 32 + 113 + 117 + 105 + 99 + 107 + 32 + 98 + 114 + 111 + 119 + 110 + 32 + 102 + 111 + 120 + 32 + 106 + 117 + 109 + 112 + 101 + 100 + 32 + 111 + 118 + 101 + 114 + 32 + 116 + 104 + 101 + 32 + 108 + 97 + 122 + 121 + 32 + 100 + 111 + 103 + 46 = 4189
There is only 2 in difference because dog and cat is so similar. If we instead try with a rhino.
The quick brown fox jumped over the lazy rhino. = 84 + 104 + 101 + 32 + 113 + 117 + 105 + 99 + 107 + 32 + 98 + 114 + 111 + 119 + 110 + 32 + 102 + 111 + 120 + 32 + 106 + 117 + 109 + 112 + 101 + 100 + 32 + 111 + 118 + 101 + 114 + 32 + 116 + 104 + 101 + 32 + 108 + 97 + 122 + 121 + 32 + 114 + 104 + 105 + 110 + 111 + 46 = 4419
Now there is over 200 in difference (each lowercase letter gives around 100 and rhino is two more letters).

This might not be such a good idea for small files since you will get a very low idea. Maybe if you give characters values like 40036, 40100 etc. then it would be better to use with small files.

EDIT: Damn Scuppy beat me because I just had to write a little program to convert a string to a lot of numbers.
Interesting.

Consider word count, and word comparison where you keep the number of words that are different in the same sequence. The issue is what if the files are opposites?

a b c
c b a

Kuphryn
Read here:
http://en.wikipedia.org/wiki/Levenshtein_distance

Basically you compute how many steps are needed to transform one string into another. So if your files are almost the same, you will end up with a small value. The more changes are applied, the bigger the computed value will be.
I tried summing up the bytes, but the results weren't very satisfactory (particularly for exceedingly large files, such as in the order of a few hundred megs; we don't expect to work with these too often though ;) ) The main problem was with strings like...

"This is a test."
"Tbs hsi ruteh ."

The two strings are very different but sum up the same. Still, it's a good start.

The link nmi posted seems VERY insightful, however. My day's pretty much over but I'll give it a shot at home and see what comes out of it. Thanks :)

This topic is closed to new replies.

Advertisement