Back to General and Gameplay Programming

Checksum aggregation

General and Gameplay Programming Programming

Started by Telastyn July 13, 2011 07:30 PM

10 comments, last by Telastyn 12 years, 9 months ago

Telastyn

3,778

Author

July 13, 2011 07:30 PM

Having a bit of a brain fart; had too many meetings today. I need to take a set of files on a remote (mostly uncontrolled) machine and identify if those files are all the files it's supposed to have, and those files are unmodified. This was going along fine. MD5 the files, xor the hashes, check versus the server version. Unfortunately, requirements have made it such that the file list often contains duplicates. The xor causes the hashes to cancel out, leading to a lot more collisions than is acceptable.

So what's a better way to aggregate unordered sets of hashes? I could unique the list first, but that causes us to fail to detect the duplicates if the number of them is all that changes. I could add in the number of files to the hash, but that seems... dirty. This is probably just one of those things I missed by not studying CS formally and for some reason can't google effectively.

-- Tangent Developer Journal - Delegates - Part 2

osmanb

2,082

July 13, 2011 07:43 PM

Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).

Telastyn

3,778

Author

July 13, 2011 08:29 PM

Are the files required to be in any particular location, such as relative to some base location? (ie are you trying to ensure that you have some directory structure mirrored across the machines). If so, you could concatenate the relative path+name of each file with the contents before doing the hash. If not, then I'd do what you suggested - make the list unique, and for each unique hash, perform another operation based on the number of files with those identical contents - ideally do a rotate left if the number of duplicates is always less than 128 (or if you can afford the risk of someone having the wrong number of dupes mod 128).

The paths of the files are not important, but the thing that duplicates them is unique. That might be one way; unfortunately that means we need to adjust the checksums as we copy the files which seems problematic.

-- Tangent Developer Journal - Delegates - Part 2

alvaro

21,604

July 13, 2011 08:51 PM

Simple solution: You could add the hashes instead of XORing them.

Complicated solution: Sort the hashes, remove duplicates but keep track of the number of times each one appeared, then compute md5 of the resulting text. In a Unix command line:
> md5sum * | perl -ane 'print "$F[0]\n"' | sort | uniq -c | md5sum

SiCrane

11,840

July 13, 2011 08:59 PM

Assuming I'm understanding your situation (and there's a very good chance that I don't), as you encounter the files add the hashes to a sorted container, and then after traversing the whole tree generate a hash of the sorted container. This should detect that the right number of duplicates exist.

Telastyn

3,778

Author

July 13, 2011 09:45 PM

No, not exactly. The aggregation is literally



result = 16 bytes of 0;

for each checksum, result := result xor checksum;

return result;

I suppose then that putting them into a collection and hashing that would deal with dupes (similar to your description of hashing the ordered collection), but my collection is unordered. I suppose I have enough info to deterministically order the records on both ends... *nod* that would work I think.

-- Tangent Developer Journal - Delegates - Part 2

alvaro

21,604

July 14, 2011 12:16 AM

What's wrong with my simple solution?

Telastyn

3,778

Author

July 14, 2011 01:06 PM

What's wrong with my simple solution?

-- Tangent Developer Journal - Delegates - Part 2

alvaro

21,604

July 14, 2011 01:10 PM

[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.

Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.

Telastyn

3,778

Author

July 14, 2011 01:20 PM

[quote name='Telastyn' timestamp='1310648810' post='4835233']
[quote name='alvaro' timestamp='1310602581' post='4835059']
What's wrong with my simple solution?

Nothing really, but it is essentially SiCrane's suggestion (since removing dupes makes us have to track extra data) and I suspect that hashing the string representation of concatenated hashes is weaker than hashing their in-memory representation.
[/quote]

Hmmm... no. It is my complicated solution that is essentially SiCrane's suggestion. Please, re-read my post.
[/quote]

Ah. Nothing I can think of off hand except the top-most bits might be lost leading to more than expected collisions. Though I don't suppose they need to be lost, and I suspect the increase wouldn't be significant.

Why do you think that hashing the string is any worse than hashing the in-memory representation? A decent hash of the string (and MD5 is decent) would work perfectly well.
[/quote]

Because you've got essentially 16 possible values per byte instead of 256.... right, but the string is longer. So the hashing would take longer, but we're not doing it that frequently.

-- Tangent Developer Journal - Delegates - Part 2

Checksum aggregation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Checksum aggregation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines