Sign in to follow this  
Telastyn

Checksum aggregation

Recommended Posts

Telastyn    3777
Having a bit of a brain fart; had too many meetings today. I need to take a set of files on a remote (mostly uncontrolled) machine and identify if those files are all the files it's supposed to have, and those files are unmodified. This was going along fine. MD5 the files, xor the hashes, check versus the server version. Unfortunately, requirements have made it such that the file list often contains duplicates. The xor causes the hashes to cancel out, leading to a lot more collisions than is acceptable.

So what's a better way to aggregate unordered sets of hashes? I could unique the list first, but that causes us to fail to detect the duplicates if the number of them is all that changes. I could add in the number of files to the hash, but that seems... dirty. This is probably just one of those things I missed by not studying CS formally and for some reason can't google effectively.

Share this post


Link to post
Share on other sites
osmanb    2082
Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).

Share this post


Link to post
Share on other sites
Telastyn    3777
[quote name='osmanb' timestamp='1310586222' post='4834965']
Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).
[/quote]

The paths of the files are not important, but the thing that duplicates them is unique. That might be one way; unfortunately that means we need to adjust the checksums as we copy the files which seems problematic.

Share this post


Link to post
Share on other sites
alvaro    21246
Simple solution: You could add the hashes instead of XORing them.

Complicated solution: Sort the hashes, remove duplicates but keep track of the number of times each one appeared, then compute md5 of the resulting text. In a Unix command line:
> md5sum * | perl -ane 'print "$F[0]\n"' | sort | uniq -c | md5sum

Share this post


Link to post
Share on other sites
SiCrane    11839
Assuming I'm understanding your situation (and there's a very good chance that I don't), as you encounter the files add the hashes to a sorted container, and then after traversing the whole tree generate a hash of the sorted container. This should detect that the right number of duplicates exist.

Share this post


Link to post
Share on other sites
Telastyn    3777
No, not exactly. The aggregation is literally

[code]
result = 16 bytes of 0;
for each checksum, result := result xor checksum;
return result;
[/code]

I suppose then that putting them into a collection and hashing [i]that[/i] would deal with dupes (similar to your description of hashing the ordered collection), but my collection is unordered. I suppose I have enough info to deterministically order the records on both ends... *nod* that would work I think.

Share this post


Link to post
Share on other sites
Telastyn    3777
[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?
[/quote]

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.

Share this post


Link to post
Share on other sites
alvaro    21246
[quote name='Telastyn' timestamp='1310648810' post='4835233']
[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?
[/quote]

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.

Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.

Share this post


Link to post
Share on other sites
Telastyn    3777
[quote name='alvaro' timestamp='1310649006' post='4835234']
[quote name='Telastyn' timestamp='1310648810' post='4835233']
[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?
[/quote]

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.
[/quote]

Ah. Nothing I can think of off hand except the top-most bits might be lost leading to more than expected collisions. Though I don't suppose they need to be lost, and I suspect the increase wouldn't be significant.

[quote]
Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.
[/quote]

Because you've got essentially 16 possible values per byte instead of 256.... right, but the string is longer. So the hashing would take longer, but we're not doing it that frequently.

Share this post


Link to post
Share on other sites
alvaro    21246
[quote name='Telastyn' timestamp='1310649644' post='4835237']
[quote name='alvaro' timestamp='1310649006' post='4835234']
[quote name='Telastyn' timestamp='1310648810' post='4835233']
[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?
[/quote]

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.
[/quote]

Ah. Nothing I can think of off hand except the top-most bits might be lost leading to more than expected collisions. Though I don't suppose they need to be lost, and I suspect the increase wouldn't be significant.
[/quote]
Yes, I thought of that. If you want to be obsessive about it, you can add the carry flag after each sum, and I think that fixes the problem.

[quote]
[quote]
Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.
[/quote]

Because you've got essentially 16 possible values per byte instead of 256.... right, but the string is longer. So the hashing would take longer, but we're not doing it that frequently.
[/quote]

Oh, you *are* being obsessive. :) The time spent hashing the hashes will be dwarfed by the time spent computing the hashes from the files, to the point where I can't imagine you'll be able to notice the difference between methods.

Share this post


Link to post
Share on other sites
Telastyn    3777
Sorry, been working on a lot of things with lots of concurrency and high scalability requirements where little details matter.

You are of course correct.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this