Back to General and Gameplay Programming

Generating an ID

General and Gameplay Programming Programming

Started by RuneLancer March 14, 2006 03:10 PM

4 comments, last by RuneLancer 18 years, 1 month ago

RuneLancer

253

Author

March 14, 2006 03:10 PM

I'm trying to acheive something at work but can't quite figure out how to do it. Frankly, I'm not even sure where to start looking. A few pointers or some advice would be quite welcomed. I'm essentially trying to generate a sort of identifier I can tag onto a file that will vary very little as the file changes. Kinda like MD5, but not something that would change completely if you were to change an 'A' into a 'B' or insert a few bytes, or whatnot. For instance, with a theorical algorithm, I could take a file containing "The quick brown fox jumped over the lazy dog." and it would output, say, A0 F2 55 1C 00 08 E5 32. Then I could change the file to contain "The quick brown fox jumped over the lazy cat." and I'd end up with A0 F2 55 1C 00 08 E6 4E. For the sake of context, this would more or less be to create some sort of "difference map" where one file could quickly be looked up by this ID in a database and similare files could be acquired by looking for those with a similare ID.

Scuppy

126

March 14, 2006 03:32 PM

So you want a hash that changes proportionally to the amount of change in the file? You could just sum the data and do a modulus or something like that.

Are you comparing text or binary data?

CTar

1,134

March 14, 2006 03:41 PM

I haven't read about this kind of stuff, but an easy way to do something like that, would be to take the sum of all characters. If the sum gets larger than your last ID you just start at 0 again. So if we use ASCII to translate the characters to numbers and we take your example:
The quick brown fox jumped over the lazy cat. = 84 + 104 + 101 + 32 + 113 + 117 + 105 + 99 + 107 + 32 + 98 + 114 + 111 + 119 + 110 + 32 + 102 + 111 + 120 + 32 + 106 + 117 + 109 + 112 + 101 + 100 + 32 + 111 + 118 + 101 + 114 + 32 + 116 + 104 + 101 + 32 + 108 + 97 + 122 + 121 + 32 + 99 + 97 + 116 + 46 = 4187.

Now if we change car to dog.
The quick brown fox jumped over the lazy dog. = 84 + 104 + 101 + 32 + 113 + 117 + 105 + 99 + 107 + 32 + 98 + 114 + 111 + 119 + 110 + 32 + 102 + 111 + 120 + 32 + 106 + 117 + 109 + 112 + 101 + 100 + 32 + 111 + 118 + 101 + 114 + 32 + 116 + 104 + 101 + 32 + 108 + 97 + 122 + 121 + 32 + 100 + 111 + 103 + 46 = 4189
There is only 2 in difference because dog and cat is so similar. If we instead try with a rhino.
The quick brown fox jumped over the lazy rhino. = 84 + 104 + 101 + 32 + 113 + 117 + 105 + 99 + 107 + 32 + 98 + 114 + 111 + 119 + 110 + 32 + 102 + 111 + 120 + 32 + 106 + 117 + 109 + 112 + 101 + 100 + 32 + 111 + 118 + 101 + 114 + 32 + 116 + 104 + 101 + 32 + 108 + 97 + 122 + 121 + 32 + 114 + 104 + 105 + 110 + 111 + 46 = 4419
Now there is over 200 in difference (each lowercase letter gives around 100 and rhino is two more letters).

This might not be such a good idea for small files since you will get a very low idea. Maybe if you give characters values like 40036, 40100 etc. then it would be better to use with small files.

EDIT: Damn Scuppy beat me because I just had to write a little program to convert a string to a lot of numbers.

kuphryn

210

March 14, 2006 03:46 PM

Interesting.

Consider word count, and word comparison where you keep the number of words that are different in the same sequence. The issue is what if the files are opposites?

a b c
c b a

Kuphryn

nmi

978

March 14, 2006 03:49 PM

Read here:
http://en.wikipedia.org/wiki/Levenshtein_distance

Basically you compute how many steps are needed to transform one string into another. So if your files are almost the same, you will end up with a small value. The more changes are applied, the bigger the computed value will be.

RuneLancer

253

Author

March 14, 2006 03:55 PM

I tried summing up the bytes, but the results weren't very satisfactory (particularly for exceedingly large files, such as in the order of a few hundred megs; we don't expect to work with these too often though ;) ) The main problem was with strings like...

"This is a test."
"Tbs hsi ruteh ."

The two strings are very different but sum up the same. Still, it's a good start.

The link nmi posted seems VERY insightful, however. My day's pretty much over but I'll give it a shot at home and see what comes out of it. Thanks :)

Generating an ID

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Generating an ID

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines