# Checksum aggregation

This topic is 2532 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Having a bit of a brain fart; had too many meetings today. I need to take a set of files on a remote (mostly uncontrolled) machine and identify if those files are all the files it's supposed to have, and those files are unmodified. This was going along fine. MD5 the files, xor the hashes, check versus the server version. Unfortunately, requirements have made it such that the file list often contains duplicates. The xor causes the hashes to cancel out, leading to a lot more collisions than is acceptable.

So what's a better way to aggregate unordered sets of hashes? I could unique the list first, but that causes us to fail to detect the duplicates if the number of them is all that changes. I could add in the number of files to the hash, but that seems... dirty. This is probably just one of those things I missed by not studying CS formally and for some reason can't google effectively.

##### Share on other sites
Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).

##### Share on other sites

Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).

The paths of the files are not important, but the thing that duplicates them is unique. That might be one way; unfortunately that means we need to adjust the checksums as we copy the files which seems problematic.

##### Share on other sites

Complicated solution: Sort the hashes, remove duplicates but keep track of the number of times each one appeared, then compute md5 of the resulting text. In a Unix command line:
> md5sum * | perl -ane 'print "\$F[0]\n"' | sort | uniq -c | md5sum

##### Share on other sites
Assuming I'm understanding your situation (and there's a very good chance that I don't), as you encounter the files add the hashes to a sorted container, and then after traversing the whole tree generate a hash of the sorted container. This should detect that the right number of duplicates exist.

##### Share on other sites
No, not exactly. The aggregation is literally

 result = 16 bytes of 0; for each checksum, result := result xor checksum; return result; 

I suppose then that putting them into a collection and hashing that would deal with dupes (similar to your description of hashing the ordered collection), but my collection is unordered. I suppose I have enough info to deterministically order the records on both ends... *nod* that would work I think.

##### Share on other sites
What's wrong with my simple solution?

##### Share on other sites

What's wrong with my simple solution?

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.

##### Share on other sites

[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.

Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.

##### Share on other sites

[quote name='Telastyn' timestamp='1310648810' post='4835233']
[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.
[/quote]

Ah. Nothing I can think of off hand except the top-most bits might be lost leading to more than expected collisions. Though I don't suppose they need to be lost, and I suspect the increase wouldn't be significant.

Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.
[/quote]

Because you've got essentially 16 possible values per byte instead of 256.... right, but the string is longer. So the hashing would take longer, but we're not doing it that frequently.

• 38
• 12
• 10
• 10
• 9
• ### Forum Statistics

• Total Topics
631365
• Total Posts
2999581
×