#### Archived

This topic is now archived and is closed to further replies.

# hashing/checksum proof

This topic is 5147 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, Is there some way to "prove" the "goodness" of a hashing/checksum algorithm? For example, say I was calculating a checksum in two different ways:
1) byte checksum += byte_buffer[i];
2) byte checksum ^= byte_buffer[i];

Would there be some way of determining which one is "better"? Is the only way to do this by passing the actual data into the functions and seeing how well distributed the resulting checksums are? Then, given that building the entire possible dictionary of data is impossible, whats the best way of manually constructing a subset (in order to ensure you try out all the 'tricky' cases)? edit: added code tags [edited by - aboeing on November 10, 2003 1:36:53 AM]

##### Share on other sites
Well, if you want to do it mathematically:

Find the probability that a random data element will end up on a certain hash value. Basically, every hash value should have an equal value of occurring.

That''s pretty much as far as you can go, I think, without actually testing the data itself. The above method assumes the data will be randomly distributed, but in reality it may not be. Thus you will have to actually test it with a sample of the data you are using.

From here I got this:

The MD5 hash function generates 128-bit results. There are thus 2^128 (approximately 3.402×10^38) possible MD5 hash values. If the MD5 function is a good hash function, the chance of a document having a particular hash value is 2^-128, a value that can be regarded as equivalent to zero for most practical purposes.

So that''s basically what I''m talking about.

##### Share on other sites
Yeah okay, sorry I asked the wrong question,
given that I know I have a certain data pattern, is there some way to prove the "goodness" of the hash/checksum?

So for example, lets say I was to create a hash for english language data, (alphabet&punctuation), then the XOR method i proposed above would not be good, because (assuming im using ASCII) the high bits would never get flipped.

So, I would assume the add method would be better.
.. and soforth..

As for the 128bit point, that doenst make much sense, just because md5 have the ability to generate 2^128 values doesnt mean it will be. There must be some way to prove that one method produce more-random results than another method.. right?

How do you actually go about proving how often a certain hash value will turn up?

##### Share on other sites
The only effective way to test a hashing function is to profile it with actual data. Take a list of english words, run it through, check the results.

A hashing algorithm has no "tricky" cases, because individual "tricky" cases don''t matter, because they''re invoked so seldom. In contrast, the _data_ you use may have tricky cases. This is why it''s important _NOT_ to simply test the entire data set: because that''s not what you''re hashing.

For instance, I know of a scripting language that uses a hashing algorithm that only considers the first 32 bytes of a string. This worked fine for awhile. Then somebody wanted to hash a bunch of strings that had the same first 40 or so characters, and (surprise surprise) things suddenly got very bad. The moral here is to test with as close to the "real" dataset as possible. Within that, a random sampling will be fine.

"Sneftel is correct, if rather vulgar." --Flarelocke

##### Share on other sites
Ok, Thanks, I guess I''ll just have to spend a few hours typing out sample data. =)

##### Share on other sites
Well, your hash function takes a previous hash value and a next byte and produces a next hash value. The probability of a given next hash value is the sum of the probability of the previous hash value times the probability of the next byte for all combinations of the previous hash values and next bytes resulting in that given next hash value. The challenge lies in determining the probability of the next byte. If it is text then if the previous byte was ''q'' then the next byte is most likely ''u'', but if the previous byte was ''t'' then it is less likely to be ''u''. If you carry that far enough you end up with a set of probabilities that uniquely identifies your data and could be used to generate your data. At that point you might as well just run the data through your hashing algorithm and see what you get.

##### Share on other sites
This doesn''t work so well with your data, but the general tset of a hash is that by changing the input by the smallest amount (generally one bit) changes the entire hash. In fact, the best hash proofs involve having these two as far apart (in the Hamilton sense) as possible. You might also want to demonstrate its power as error correction if you intend to use it for checksums.

##### Share on other sites
Okay, kool thanks for those replies.
sadwanmage, are you basically saying a good indication for how good a hash value is, is:

oldhash = hash(data)
newhash = hash(data+smallchange)

if (hamming_distance(oldhash,newhash) == LARGE)
then goodhashalgorithm;
?

LilBudyWizer:
In that case the best hashing system you could come up with is probably something like hufmann compression with a precalculated dictionary (for english)..
the problem being you would still need to build the dictionary :/

I think that this would be a bit too much work though

(my specific application is to generate a unique identifier from around 10-100 character strings)

##### Share on other sites
Yep, you summed it up there pretty well in pseudo-code. I got mixed up between Hamming and Hamilton, but you saw what I meant anyway.
But its slightly more than you said. A good hash has a very large change for a small one, yes, but it is any small change, not just one particular one. Generally this means any of the bits being corrupted.

PS Wouldn''t it be >= LARGE?

##### Share on other sites
As an example to the OP, the problem with XOR is correct, the high bit would never get set.

The adding method isn''t very good either. consider: "ABS" "BAS"

A very very simple hashing algorithm that I think works quite well (for simple purposes) is just ''hash += 5*hash + new_char'' it will take into account the order of letters and the letters themselves.

##### Share on other sites
PS Wouldn''t it be >= LARGE?
Yep. =P

The adding method isn''t very good either. consider: "ABS" "BAS"
Thats a very good point.
The method you propose would probably work very well for generating a single character hashing value, but what I actually had in mind was something like
uint32 x;char *string;uin8 *z=&x;for (i=0;i<strlen(string);i++)z[i&3]+=string[i];//but this is probably better in all cases:z[i&3]+=5*z[i&3] + string[i];

i want to keep the hashing function fast..
thanks for the help guys! ill try out a couple

##### Share on other sites
another thing to consider when judging a hash is the hash range, compared to the data set variation ... for instance, if hashing down to 1 byte codes, the absolute BEST you can do is cause hashing of different sets of data to seperate into the 256 buckets fairly evenly ... so the key for a good algorithm like this is an algorithm that looks good on YOUR DATA (because there is no algorithm that looks good on ALL DATA) ...

it is impossible to create a hash for which some types of data will not hash improperly (meaning bunch up onto a smaller subset of hash results) ...

all hashes can only (easily) be judges by the following - how they hash RANDOM DATA, or how they hash data with a certain shape (aka WORDS, or SORTED DATA, or whatever you are dealing with) ...

the key is NOT always how good the algorithm handles SMALL CHANGES, because sometimes that is not what your data set consists of ... that IS a way to objectively judge between fairly equal hashing algorithms for RANDOM DATA that are already proven to distrubute a large set fairly evenly (once the overall distrobution is detemined, the small change rule is usefull because of tradition, traditionally people were using checksums to find (usually small) errors in saving or transmission ... in these cases small error detection is critical ... but for a given domain, this may not apply)

for instance ... many domain specific compression formats such as MP3 and JPG do NOT change based on 1 bit source changes, because their goal is the big picture become small ... which is the generic goal of all hashing ... but they want similar to stay similar, oposite of hashing used to distribute a set into buckets ...

the real thing I wanted to say is ... an algorithm which turns ABC and CAB into the same thing may be fine for random data, but would NOT be acceptable to most uses dealing with text exclusively ... when dealing with text, it is almost always benificial to have some form of shift in the algorithm, in order to move the subset domain (ASCII does not use all 256 source characters) into a complete destination range ...