**2**

# Can't reverse new compression algorithm

###
#1
Members - Reputation: **259**

Posted 26 April 2011 - 11:33 AM

Basically what I do is this:

Given raw data: D

I compute the MD5 hash: H = MD5(D)

At this point I've compressed by a factor of Size(D) / Size(H) (which for any data larger than 128-bits is significant)

My problem is uncompressing the data.

Currently what I'm doing is as follows: (keep in mind i'm working with bit values here)

G = {0}[trillion]

while (MD5(G) != H) G++;

I know this has the problem of unintentionally hashing the preceding zeros but the main issue I'm running into is runtime.

So far 20-30 bit lengths have run in feasible time but when I tried to run this with a 32-bit guess, it took several minutes to verify.

My MD5 implementation may be inefficient but it's also possible that this algorithm suffers from a runtime of 2^n.

Any tips? Or is this simply an unfeasible compression algorithm?

###
#2
Senior Moderators - Reputation: **6584**

Posted 26 April 2011 - 11:46 AM

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

ScapeCode - Blog | SlimDX

###
#3
Moderators - Reputation: **9373**

Posted 26 April 2011 - 11:46 AM

1) Hashes are designed to be extremely difficult to reverse from the outset.

2) Multiple, completely different data patterns can produce the same output hash. (Think about it.) So even if you find something that matches your hash, odds are you still have the wrong results.

###
#4
Moderators - Reputation: **27971**

Posted 26 April 2011 - 12:09 PM

Correct, it is not a feasible compression algorithm.Any tips? Or is this simply an unfeasible compression algorithm?

Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.

Also check out my personal website at bryanwagstaff.com, where I write about assorted stuff.

###
#5
Members - Reputation: **259**

Posted 26 April 2011 - 01:45 PM

I'm going to assume that this isn't a troll and that you're just ignorant of the behavior of hashing functions.

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.

Ok.. but if the hash produces a 128-bit digest, that's 2^128 possible hash outputs. I would think the chances of two bits of data producing the same hash would be 1 in 2^128.. which is better than the safety guarantee of a a redundant storage drive.

I'm aware it's

*possible*but to say (as a later post said) that odds are in favor of me having the wrong result seems illogical.

Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.

And it's not like a hard drive will ever store 2^128 bits.. And even if it could, then you could just generate a longer hash.

What is it that doesn't work about this? Or what am I seeing wrong?

###
#6
Senior Moderators - Reputation: **1781**

Posted 26 April 2011 - 01:51 PM

Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.

Correct. So consider storing a twenty-byte file. That's 2^160 possibilities (unhashed) / 2^128 possibilities (hashed) = 0.00000002328% chance of getting the right one.

###
#8
Members - Reputation: **300**

Posted 26 April 2011 - 02:50 PM

What is the purpose of this? are you doing it just for the sake of doing it, trying somehtign new? or is there a specific problem your trying to address? I believe some of the most efficient compression algorithms are those that can exploit specific data rather than general/raw data (like the compression for mp3s just cuts off inaudiable frequencies).

###
#9
Members - Reputation: **552**

Posted 26 April 2011 - 05:37 PM

As others have said, MD5 is designed specifically to be a one-way function. Hash functions are meant to be used for things like identifying, indexing and comparing large data blocks, authentication, error checking (checksums), and so on. This is not a compression algorithm, it only maps data of arbitrary size to a (theoretically) unique value, but with severe limitations.

True data compression relies on the fact that the number of bits required to represent the output of an arbitrary input symbol is governed by its entropy. Simply speaking, data compression will perform statistical analysis on the input, and assign the smallest number of bits for the most frequently occurring symbols, and more bits for the least frequent symbols. In a nutshell, the algorithm converts information with potentially low entropy to a new representation that has high entropy.

###
#10
Moderators - Reputation: **1802**

Posted 26 April 2011 - 10:12 PM

Obligatory read for these kinds of posts: The pigeonhole principle.Or what am I seeing wrong?

Especially this section:

The pigeonhole principle arises in computer science. For example, collisions are inevitable in a hash table because the number of possible keys exceeds the number of indices in the array. No hashing algorithm, no matter how clever, can avoid these collisions. This principle also proves that any general-purpose lossless compression algorithm that makes at least one input file smaller will make some other input file larger. (Otherwise, two files would be compressed to the same smaller file and restoring them would be ambiguous.)

###
#11
Members - Reputation: **259**

Posted 27 April 2011 - 06:44 AM

What is the purpose of this? are you doing it just for the sake of doing it, trying somehtign new? or is there a specific problem your trying to address?

I just think I have a feasible idea and want to explore it. Maybe get my name out there. I've always been more into the theoretical side of things.

A lot of what you are all telling me seems to revolve around the idea that there are going to be collisions. I get this. The input data is of arbitrary size. The output is of a fixed size. And what matters is that fixed size will usually be smaller than the input size.

But:

These collisions are still a 1 in 2^128 chance right? So I guess what I'm saying is, if I augment G to a value that hashes correctly, aren't my chances (2^128 - 1) / 2^128 that I've got the right input data? I'd bet on those odds.

I get what one poster said about entropy. For example, replacing lots of instances of 000000 in your data with 1. (just a random example) But as the tradeoff, replacing instances of 1 with 00. Stuff like that. Kind of like the claims about the Dvorak keyboard. Or more to the point: http://en.wikipedia.org/wiki/Huffman_coding

I get that. But I think this is different. Instead of trying to play with the data itself and looking for trends or useless parts of the data.. it relies on statistics. Most hashing algorithms generate digests pretty uniformly. So context sensitive data or really similar data are going to be mapped to wildly different digests. So I still don't see the problem. We're still looking at a 1 in 2^128 that my correctly hashing G isn't equal to D.

I understand with all your concerns. Fully. I agree that theoretically they are holes. They make it so this isn't a perfect compression algorithm. But it seems like in the lifetime of the universe, we won't reconstruct our data incorrectly. So what doesn't seem feasible about an algorithm that compresses terabytes of data (almost losslessly, 2^128-1 / 2^128 losslessy in fact) into 128 bits. Or petabytes or exabytes.

Not to mention we get verification of data integrity by simply hashing it at any time. But this is already done pretty regularly.

###
#12
Moderators - Reputation: **37634**

Posted 27 April 2011 - 06:47 AM

No. Let's say the hash function is a 0/1 based on whether the number is odd or even -- There's a 50% chance of a collision, and by knowing the hash, you've discarded 50% of possible inputs.These collisions are still a 1 in 2^128 chance right? So I guess what I'm saying is, if I augment G to a value that hashes correctly, aren't my chances (2^128 - 1) / 2^128 that I've got the right input data? I'd bet on those odds.

If the input is a 4-byte unsigned integer (range: 4 billion), then you've got a 4bil * 0.5 chance of guessing the right answer. Or, 1 in 2 billion. As the input size goes up, your chance of a correct guess goes down, dramatically.

Sneftel pointed this out earlier:

Let;s say the 128bit hash is perfect.Correct. So consider storing a twenty-byte file. That's 2^160 possibilities (unhashed) / 2^128 possibilities (hashed) = 0.00000002328% chance of getting the right one.Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.

With a 128bit input or less, you'll guess correctly.

With a 256bit input, you'll have a 1 in 2 chance.

With a 128 byte input, you'll have worse than a 1 in 10^269 chance.

With a 1KB input, you've got worse than 1 in 10^2427 chance.

With a 1MB input, it becomes worse than 1 in 10^2525184.

With a 1GB input, it become worse than 1 in 10^(10^9.4)

In comparison, there's approximately 10^80 atoms in the universe.

No, the problem is that itI understand with all your concerns. Fully. I agree that theoretically they are holes. They make it so this isn't a perfect compression algorithm. But it seems like in the lifetime of the universe, we won't reconstruct our data incorrectly.

**will**take you the lifetime of the universe to reconstruct the correct data.

To even reconstruct a 128byte file correctly from it's 128bit hash, you've got to make on average 1/4th of a googol ^ 2.7 guesses before striking it lucky (assuming you can distinguish a correct guess from an incorrect one).

###
#13
Members - Reputation: **259**

Posted 27 April 2011 - 08:44 AM

Or is the collision the serious issue?

I would think a human could easily observe the data to see if it was some random data that just happened to collide with the correct data. Aka... a video file vs. a random sequence of bits and bites.

Therefore giving some context to the data would be the solution. Compress per file:

record: MD5(file), context

example:

record: cbadbefef18985985 etc., MP3 File

Then when you hit a collision, it'll know because it can't be interpreted as an MP3 file.

###
#14
Members - Reputation: **1970**

Posted 27 April 2011 - 09:00 AM

Is it that I need to be assigning my "Guess" string randomly instead of incrementing by 1? I can see how incrementing from 0 to 2^x would take x guesses. Making it so if you have a terabite file, it would take 8 trillion guesses (8 bits/bite * 1 trillion bites)

Or is the collision the serious issue?

I would think a human could easily observe the data to see if it was some random data that just happened to collide with the correct data. Aka... a video file vs. a random sequence of bits and bites.

See [Md5 - Collision_vulnerabilities] - They show an example where only 4 bytes (same lengths still) are changed in a string of text and yet they both produce the same hash. In a case like this, without already having the correct file available to confirm it, how could you possibly guess which version was the original (assuming you even got it back to one of the results in the first place, which as many have said before, probability says you will never do). Even if as you said a human was looking at it, I don't think they would be able to know.

I'll also point you to another similar post a while back - [Gamedev - odd compression idea]. Different "compression" method, same concept. You'll see very similar posts then as what is being posted now (from both poster and those who responded).

###
#15
Members - Reputation: **259**

Posted 27 April 2011 - 09:06 AM

Is it that I need to be assigning my "Guess" string randomly instead of incrementing by 1? I can see how incrementing from 0 to 2^x would take x guesses. Making it so if you have a terabite file, it would take 8 trillion guesses (8 bits/bite * 1 trillion bites)

Or is the collision the serious issue?

I would think a human could easily observe the data to see if it was some random data that just happened to collide with the correct data. Aka... a video file vs. a random sequence of bits and bites.

See [Md5 - Collision_vulnerabilities] - They show an example where only 4 bytes (same lengths still) are changed in a string of text and yet they both produce the same hash. In a case like this, without already having the correct file available to confirm it, how could you possibly guess which version was the original (assuming you even got it back to one of the results in the first place, which as many have said before, probability says you will never do). Even if as you said a human was looking at it, I don't think they would be able to know.

I'll also point you to another similar post a while back - [Gamedev - odd compression idea]. Different "compression" method, same concept. You'll see very similar posts then as what is being posted now (from both poster and those who responded).

What about SHA-1? Does this have that same phenomenon where a small change will produce the same hash?

SHA-1 is actually the compression algorithm I'm using. I was sticking to MD5 cuz I assumed no one here knew about SHA.

###
#16
Members - Reputation: **902**

Posted 27 April 2011 - 09:15 AM

Is it that I need to be assigning my "Guess" string randomly instead of incrementing by 1? I can see how incrementing from 0 to 2^x would take x guesses. Making it so if you have a terabite file, it would take 8 trillion guesses (8 bits/bite * 1 trillion bites)

There are two distinct problems with your approach:

- Generating all possible files quickly grows unfeasible - there are b^n possible sequences where b is the number of choices per element (256 for a byte) and n is the number of elements in the sequence.

Yet again, this grows very fast. There are 4'200 million possible four-byte files (256^4). There's 18 million million millions possible 8-byte files (256^8). - Collisions. If you manage to generate a file that matches the hash before the heat death of the universe, it may be a false positive. You can reduce this chance by making the hash larger or using multiple hashes, but you cannot guarantee anything.

###
#17
Members - Reputation: **1970**

Posted 27 April 2011 - 09:25 AM

What about SHA-1? Does this have that same phenomenon where a small change will produce the same hash?

SHA-1 is actually the compression algorithm I'm using. I was sticking to MD5 cuz I assumed no one here knew about SHA.

SHA-1 has effectively been replaced by SHA-2 (several different size varieties), so you haven't done a lot of research involving modern hashing techniques to begin with, but the examples still stands. Data is lost during the hashing. The bigger the file, the more data lost. The only way to confirm that you "guessed" right is to have the original uncompressed file in the first place (which would have defeated goal of the exercise).

Read up on concepts like prime number factorization. There is a case where you technically have all of the data at hand, unlike your example, and it still can take years of "guessing" to compute. 2 prime numbers that were multiplied together, resulting in a number that only has 4 possible whole divisors: itself, 1, and the 2 primes that were multiplied. It might seem simple until you realize were are talking about numbers are typically between 512 and 1024-bit numbers, e.g. - 1230186684530117755130494958384962720772853569595334792197322452151726400507263657518745202199786469389956474942774063845925192557326303453731548268507917026122142913461670429214311602221240479274737794080665351419597459856902143413. That is a base 10 representation of a 768-bit number. As I said before, you have to find the 2 numbers that were multiplied to make that (excluding 1 and itself). A team managed to do it

The effort took almost 2000 2.2GHz-Opteron-CPU years according to the submitters, just short of 3 years of calendar time.

What you are talking about is doing something similar, without all of the information needed to do it.

###
#19
Members - Reputation: **259**

Posted 27 April 2011 - 09:43 AM

Are you trolling us?

I was sticking to MD5 cuz I assumed no one here knew about SHA.

No I realize now it's a much better known hashing algorithm than I thought.

Probably the biggest deal for me is finding out that you can have small changes in the source resulting in the same hash.

That being the case makes me wonder if a better hashing algorithm that avoided this could be used for what I'm trying.

I actually know very little about the mechanics of the hashing algorithm itself. And I couldn't find much online. I'll keep looking.

One of the things that brought me onto this idea was mention of the fact that small changes produced totally different hashes. And of course I understood collisions had to occur but I assumed something like (small change) -> (same hash) would only occur those 1/2^128 types of times. Not due to fundamental flaws in the hashing algorithm's uniformity.

I would really like to see if there would be a way to work on a hash algorithm that was not focused on nonreversability but more focused on uniformity to avoid the (small change) -> (same hash) problem.

###
#20
Members - Reputation: **1970**

Posted 27 April 2011 - 09:51 AM

I actually know very little about the mechanics of the hashing algorithm itself. And I couldn't find much online. I'll keep looking.

If you can't find anything, then you are obviously trolling, because you aren't trying very hard. Wikipedia alone has tons of information as well as pseudocode implementations of both md5, sha-1, and sha-2 (all of which I have used to implement these in C++), along with links to the FIPS or RFC documents which explain the concepts and inner workings of them all.

You have asked several times "Is this infeasible", to which every time the answer has been "Yes, it is infeasible", but it is obviously falling off deaf ears.