hashing/checksum proof

Started by
11 comments, last by aboeing 20 years, 5 months ago
PS Wouldn''t it be >= LARGE?
Yep. =P

The adding method isn''t very good either. consider: "ABS" "BAS"
Thats a very good point.
The method you propose would probably work very well for generating a single character hashing value, but what I actually had in mind was something like
uint32 x;char *string;uin8 *z=&xfor (i=0;i<strlen(string);i++)z[i&3]+=string;<br>//but this is probably better in all cases:<br>z[i&3]+=5*z[i&3] + string;<br></pre><!–ENDSCRIPT–><br>i want to keep the hashing function fast..<br>thanks for the help guys! ill try out a couple <img src="smile.gif" width=15 height=15 align=middle><br><br><br>   
Advertisement
another thing to consider when judging a hash is the hash range, compared to the data set variation ... for instance, if hashing down to 1 byte codes, the absolute BEST you can do is cause hashing of different sets of data to seperate into the 256 buckets fairly evenly ... so the key for a good algorithm like this is an algorithm that looks good on YOUR DATA (because there is no algorithm that looks good on ALL DATA) ...

it is impossible to create a hash for which some types of data will not hash improperly (meaning bunch up onto a smaller subset of hash results) ...

all hashes can only (easily) be judges by the following - how they hash RANDOM DATA, or how they hash data with a certain shape (aka WORDS, or SORTED DATA, or whatever you are dealing with) ...

the key is NOT always how good the algorithm handles SMALL CHANGES, because sometimes that is not what your data set consists of ... that IS a way to objectively judge between fairly equal hashing algorithms for RANDOM DATA that are already proven to distrubute a large set fairly evenly (once the overall distrobution is detemined, the small change rule is usefull because of tradition, traditionally people were using checksums to find (usually small) errors in saving or transmission ... in these cases small error detection is critical ... but for a given domain, this may not apply)

for instance ... many domain specific compression formats such as MP3 and JPG do NOT change based on 1 bit source changes, because their goal is the big picture become small ... which is the generic goal of all hashing ... but they want similar to stay similar, oposite of hashing used to distribute a set into buckets ...

the real thing I wanted to say is ... an algorithm which turns ABC and CAB into the same thing may be fine for random data, but would NOT be acceptable to most uses dealing with text exclusively ... when dealing with text, it is almost always benificial to have some form of shift in the algorithm, in order to move the subset domain (ASCII does not use all 256 source characters) into a complete destination range ...

Having a hash change signifcantly for a small change practically guarrantees all the various other things, like even distribution. (Thats not to say its impossible, just unlikely).

I don''t believe MP3s and JPGs using hashing algorithsm. I think the disticntion is hashes never involve recoverable data purely from the hash.

This topic is closed to new replies.

Advertisement