Checksum aggregation

Started by
10 comments, last by Telastyn 12 years, 9 months ago
Having a bit of a brain fart; had too many meetings today. I need to take a set of files on a remote (mostly uncontrolled) machine and identify if those files are all the files it's supposed to have, and those files are unmodified. This was going along fine. MD5 the files, xor the hashes, check versus the server version. Unfortunately, requirements have made it such that the file list often contains duplicates. The xor causes the hashes to cancel out, leading to a lot more collisions than is acceptable.

So what's a better way to aggregate unordered sets of hashes? I could unique the list first, but that causes us to fail to detect the duplicates if the number of them is all that changes. I could add in the number of files to the hash, but that seems... dirty. This is probably just one of those things I missed by not studying CS formally and for some reason can't google effectively.
Advertisement
Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).

Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).


The paths of the files are not important, but the thing that duplicates them is unique. That might be one way; unfortunately that means we need to adjust the checksums as we copy the files which seems problematic.
Simple solution: You could add the hashes instead of XORing them.

Complicated solution: Sort the hashes, remove duplicates but keep track of the number of times each one appeared, then compute md5 of the resulting text. In a Unix command line:
> md5sum * | perl -ane 'print "$F[0]\n"' | sort | uniq -c | md5sum
Assuming I'm understanding your situation (and there's a very good chance that I don't), as you encounter the files add the hashes to a sorted container, and then after traversing the whole tree generate a hash of the sorted container. This should detect that the right number of duplicates exist.
No, not exactly. The aggregation is literally


result = 16 bytes of 0;
for each checksum, result := result xor checksum;
return result;


I suppose then that putting them into a collection and hashing that would deal with dupes (similar to your description of hashing the ordered collection), but my collection is unordered. I suppose I have enough info to deterministically order the records on both ends... *nod* that would work I think.
What's wrong with my simple solution?

What's wrong with my simple solution?


Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.

[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?


Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.

Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.

[quote name='Telastyn' timestamp='1310648810' post='4835233']
[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?


Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.
[/quote]

Ah. Nothing I can think of off hand except the top-most bits might be lost leading to more than expected collisions. Though I don't suppose they need to be lost, and I suspect the increase wouldn't be significant.


Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.
[/quote]

Because you've got essentially 16 possible values per byte instead of 256.... right, but the string is longer. So the hashing would take longer, but we're not doing it that frequently.

This topic is closed to new replies.

Advertisement