#### Archived

This topic is now archived and is closed to further replies.

# hashing/checksum proof

This topic is 5449 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, Is there some way to "prove" the "goodness" of a hashing/checksum algorithm? For example, say I was calculating a checksum in two different ways:
1) byte checksum += byte_buffer[i];
2) byte checksum ^= byte_buffer[i];

Would there be some way of determining which one is "better"? Is the only way to do this by passing the actual data into the functions and seeing how well distributed the resulting checksums are? Then, given that building the entire possible dictionary of data is impossible, whats the best way of manually constructing a subset (in order to ensure you try out all the 'tricky' cases)? edit: added code tags [edited by - aboeing on November 10, 2003 1:36:53 AM]

##### Share on other sites
Well, if you want to do it mathematically:

Find the probability that a random data element will end up on a certain hash value. Basically, every hash value should have an equal value of occurring.

That''s pretty much as far as you can go, I think, without actually testing the data itself. The above method assumes the data will be randomly distributed, but in reality it may not be. Thus you will have to actually test it with a sample of the data you are using.

From here I got this:

The MD5 hash function generates 128-bit results. There are thus 2^128 (approximately 3.402×10^38) possible MD5 hash values. If the MD5 function is a good hash function, the chance of a document having a particular hash value is 2^-128, a value that can be regarded as equivalent to zero for most practical purposes.

So that''s basically what I''m talking about.

##### Share on other sites
Yeah okay, sorry I asked the wrong question,
given that I know I have a certain data pattern, is there some way to prove the "goodness" of the hash/checksum?

So for example, lets say I was to create a hash for english language data, (alphabet&punctuation), then the XOR method i proposed above would not be good, because (assuming im using ASCII) the high bits would never get flipped.

So, I would assume the add method would be better.
.. and soforth..

As for the 128bit point, that doenst make much sense, just because md5 have the ability to generate 2^128 values doesnt mean it will be. There must be some way to prove that one method produce more-random results than another method.. right?

How do you actually go about proving how often a certain hash value will turn up?

##### Share on other sites
The only effective way to test a hashing function is to profile it with actual data. Take a list of english words, run it through, check the results.

A hashing algorithm has no "tricky" cases, because individual "tricky" cases don''t matter, because they''re invoked so seldom. In contrast, the _data_ you use may have tricky cases. This is why it''s important _NOT_ to simply test the entire data set: because that''s not what you''re hashing.

For instance, I know of a scripting language that uses a hashing algorithm that only considers the first 32 bytes of a string. This worked fine for awhile. Then somebody wanted to hash a bunch of strings that had the same first 40 or so characters, and (surprise surprise) things suddenly got very bad. The moral here is to test with as close to the "real" dataset as possible. Within that, a random sampling will be fine.

"Sneftel is correct, if rather vulgar." --Flarelocke

##### Share on other sites
Ok, Thanks, I guess I''ll just have to spend a few hours typing out sample data. =)

##### Share on other sites
Well, your hash function takes a previous hash value and a next byte and produces a next hash value. The probability of a given next hash value is the sum of the probability of the previous hash value times the probability of the next byte for all combinations of the previous hash values and next bytes resulting in that given next hash value. The challenge lies in determining the probability of the next byte. If it is text then if the previous byte was ''q'' then the next byte is most likely ''u'', but if the previous byte was ''t'' then it is less likely to be ''u''. If you carry that far enough you end up with a set of probabilities that uniquely identifies your data and could be used to generate your data. At that point you might as well just run the data through your hashing algorithm and see what you get.

##### Share on other sites
This doesn''t work so well with your data, but the general tset of a hash is that by changing the input by the smallest amount (generally one bit) changes the entire hash. In fact, the best hash proofs involve having these two as far apart (in the Hamilton sense) as possible. You might also want to demonstrate its power as error correction if you intend to use it for checksums.

##### Share on other sites
Okay, kool thanks for those replies.
sadwanmage, are you basically saying a good indication for how good a hash value is, is:

oldhash = hash(data)
newhash = hash(data+smallchange)

if (hamming_distance(oldhash,newhash) == LARGE)
then goodhashalgorithm;
?

LilBudyWizer:
In that case the best hashing system you could come up with is probably something like hufmann compression with a precalculated dictionary (for english)..
the problem being you would still need to build the dictionary :/

I think that this would be a bit too much work though

(my specific application is to generate a unique identifier from around 10-100 character strings)

##### Share on other sites
Yep, you summed it up there pretty well in pseudo-code. I got mixed up between Hamming and Hamilton, but you saw what I meant anyway.
But its slightly more than you said. A good hash has a very large change for a small one, yes, but it is any small change, not just one particular one. Generally this means any of the bits being corrupted.

PS Wouldn''t it be >= LARGE?

##### Share on other sites
As an example to the OP, the problem with XOR is correct, the high bit would never get set.

The adding method isn''t very good either. consider: "ABS" "BAS"

A very very simple hashing algorithm that I think works quite well (for simple purposes) is just ''hash += 5*hash + new_char'' it will take into account the order of letters and the letters themselves.

1. 1
2. 2
3. 3
4. 4
Rutin
15
5. 5

• 14
• 9
• 9
• 10
• 12
• ### Forum Statistics

• Total Topics
632911
• Total Posts
3009182
• ### Who's Online (See full list)

There are no registered users currently online

×