# Can't reverse new compression algorithm

## Recommended Posts

sooner123    269
I'm working on a new type of compression algorithm and have run into some difficulties.

Basically what I do is this:

Given raw data: D
I compute the MD5 hash: H = MD5(D)
At this point I've compressed by a factor of Size(D) / Size(H) (which for any data larger than 128-bits is significant)

My problem is uncompressing the data.

Currently what I'm doing is as follows: (keep in mind i'm working with bit values here)

G = {0}[trillion]

while (MD5(G) != H) G++;

I know this has the problem of unintentionally hashing the preceding zeros but the main issue I'm running into is runtime.

So far 20-30 bit lengths have run in feasible time but when I tried to run this with a 32-bit guess, it took several minutes to verify.

My MD5 implementation may be inefficient but it's also possible that this algorithm suffers from a runtime of 2^n.

Any tips? Or is this simply an unfeasible compression algorithm?

##### Share on other sites
Washu    7829
I'm going to assume that this isn't a troll and that you're just ignorant of the behavior of hashing functions.

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.

##### Share on other sites
Promit    13246
O...kay...

1) Hashes are designed to be extremely difficult to reverse from the outset.
2) Multiple, completely different data patterns can produce the same output hash. (Think about it.) So even if you find something that matches your hash, odds are you still have the wrong results.

##### Share on other sites
frob    44920
[quote name='sooner123' timestamp='1303839219' post='4803158']
Any tips? Or is this simply an unfeasible compression algorithm?
[/quote]
Correct, it is not a feasible compression algorithm.

##### Share on other sites
sooner123    269
[quote name='Washu' timestamp='1303839988' post='4803160']
I'm going to assume that this isn't a troll and that you're just ignorant of the behavior of hashing functions.

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.
[/quote]

Ok.. but if the hash produces a 128-bit digest, that's 2^128 possible hash outputs. I would think the chances of two bits of data producing the same hash would be 1 in 2^128.. which is better than the safety guarantee of a a redundant storage drive.

I'm aware it's [i]possible[/i] but to say (as a later post said) that odds are in favor of me having the wrong result seems illogical.

Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.

And it's not like a hard drive will ever store 2^128 bits.. And even if it could, then you could just generate a longer hash.

##### Share on other sites
Sneftel    1788
[quote][color=#1C2837][size=2]Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.[/quote][/size][/color]
[color=#1C2837][size=2]Correct. So consider storing a twenty-byte file. That's 2^160 possibilities (unhashed) / 2^128 possibilities (hashed) = 0.00000002328% chance of getting the right one.[/size][/color]
[size="2"][color="#1C2837"]
[/color][/size]

##### Share on other sites
rip-off    10976
Research information entropy.

In general, this is a bad approach as hash functions are designed to be extremely difficult to reverse (which is essentially your algorithm).

##### Share on other sites
Nanoha    2682
Since the result of the hash is smaller than the input data, you've lost some of that information. Although it might be a compression of sorts its lossy, whats lost cannot be recovered. If your expecting certain data back (say an int between 0 and 5 but you get back 6) then you could perhaps use that knowledge to correct your "decompression" but its pretty much just guessing and your better off working the other way around (using knowledge of the data to improve compression).

What is the purpose of this? are you doing it just for the sake of doing it, trying somehtign new? or is there a specific problem your trying to address? I believe some of the most efficient compression algorithms are those that can exploit specific data rather than general/raw data (like the compression for mp3s just cuts off inaudiable frequencies).

##### Share on other sites
Tachikoma    575
This is a perfect example of using the wrong tool for the wrong job.

As others have said, MD5 is designed specifically to be a [url="https://secure.wikimedia.org/wikipedia/en/wiki/One-way_function"]one-way function[/url]. Hash functions are meant to be used for things like identifying, indexing and comparing large data blocks, authentication, error checking (checksums), and so on. This is not a compression algorithm, it only maps data of arbitrary size to a (theoretically) unique value, but with severe limitations.

True data compression relies on the fact that the number of bits required to represent the output of an arbitrary input symbol is governed by its [url="https://secure.wikimedia.org/wikipedia/en/wiki/Information_theory"]entropy[/url]. Simply speaking, data compression will perform statistical analysis on the input, and assign the smallest number of bits for the most frequently occurring symbols, and more bits for the least frequent symbols. In a nutshell, the algorithm converts information with potentially low entropy to a new representation that has high entropy.

##### Share on other sites
Yann L    1802
[quote name='sooner123' timestamp='1303847102' post='4803199']
Or what am I seeing wrong?
[/quote]
Obligatory read for these kinds of posts: [url="http://en.wikipedia.org/wiki/Pigeonhole_principle"]The pigeonhole principle[/url].

Especially this section:
[quote]
The pigeonhole principle arises in [url="http://en.wikipedia.org/wiki/Computer_science"]computer science[/url]. For example, collisions are inevitable in a [url="http://en.wikipedia.org/wiki/Hash_table"]hash table[/url] because the number of possible keys exceeds the number of indices in the array. No hashing algorithm, no matter how clever, can avoid these collisions. This principle also proves that any general-purpose [url="http://en.wikipedia.org/wiki/Lossless_data_compression"]lossless compression algorithm[/url] that makes at least one input file smaller will make some other input file larger. (Otherwise, two files would be compressed to the same smaller file and restoring them would be ambiguous.)
[/quote]

##### Share on other sites
sooner123    269
[quote name='Nanoha' timestamp='1303851014' post='4803224']
What is the purpose of this? are you doing it just for the sake of doing it, trying somehtign new? or is there a specific problem your trying to address?
[/quote]

I just think I have a feasible idea and want to explore it. Maybe get my name out there. I've always been more into the theoretical side of things.

A lot of what you are all telling me seems to revolve around the idea that there are going to be collisions. I get this. The input data is of arbitrary size. The output is of a fixed size. And what matters is that fixed size will usually be smaller than the input size.

But:

These collisions are still a 1 in 2^128 chance right? So I guess what I'm saying is, if I augment G to a value that hashes correctly, aren't my chances (2^128 - 1) / 2^128 that I've got the right input data? I'd bet on those odds.

I get what one poster said about entropy. For example, replacing lots of instances of 000000 in your data with 1. (just a random example) But as the tradeoff, replacing instances of 1 with 00. Stuff like that. Kind of like the claims about the Dvorak keyboard. Or more to the point: [url="http://en.wikipedia.org/wiki/Huffman_coding"]http://en.wikipedia.org/wiki/Huffman_coding[/url]

I get that. But I think this is different. Instead of trying to play with the data itself and looking for trends or useless parts of the data.. it relies on statistics. Most hashing algorithms generate digests pretty uniformly. So context sensitive data or really similar data are going to be mapped to wildly different digests. So I still don't see the problem. We're still looking at a 1 in 2^128 that my correctly hashing G isn't equal to D.

I understand with all your concerns. Fully. I agree that theoretically they are holes. They make it so this isn't a perfect compression algorithm. But it seems like in the lifetime of the universe, we won't reconstruct our data incorrectly. So what doesn't seem feasible about an algorithm that compresses terabytes of data (almost losslessly, 2^128-1 / 2^128 losslessy in fact) into 128 bits. Or petabytes or exabytes.

Not to mention we get verification of data integrity by simply hashing it at any time. But this is already done pretty regularly.

##### Share on other sites
Hodgman    51237
[font="arial, verdana, tahoma, sans-serif"][size="2"][font="arial, verdana, tahoma, sans-serif"][size="2"]Every computer science student has this idea when they learn about hashing, it's not a new thought... and then they learn about compression.[/size][/font][quote name='sooner123' timestamp='1303908243' post='4803516']These collisions are still a 1 in 2^128 chance right? So I guess what I'm saying is, if I augment G to a value that hashes correctly, aren't my chances (2^128 - 1) / 2^128 that I've got the right input data? I'd bet on those odds.[/quote]No. Let's say the hash function is a 0/1 based on whether the number is odd or even -- There's a 50% chance of a collision, and by knowing the hash, you've discarded 50% of possible inputs.[/size][/font]
[font="arial, verdana, tahoma, sans-serif"][/font][size="2"]If the input is a 4-byte unsigned integer (range: 4 billion), then you've got a 4bil * 0.5 chance of guessing the right answer. Or, 1 in 2 billion. As the input size goes up, your chance of a correct guess goes down, dramatically.[/size]
[font="arial, verdana, tahoma, sans-serif"][size="2"]Sneftel pointed this out earlier:[quote name='Sneftel' timestamp='1303847493' post='4803201'][/size][/font][quote][color="#1C2837"][size="2"]Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.[/quote][/size][/color][color="#1C2837"][size="2"]Correct. So consider storing a twenty-byte file. That's 2^160 possibilities (unhashed) / 2^128 possibilities (hashed) = 0.00000002328% chance of getting the right one.[/size][/color][/quote]Let;s say the 128bit hash is perfect.
With a 128bit input or less, you'll guess correctly.
With a 256bit input, you'll have a 1 in 2 chance.
With a 128 byte input, you'll have worse than a 1 in 10^269 chance.
With a 1KB input, you've got worse than 1 in 10^2427 chance.
With a 1MB input, it becomes worse than 1 in 10^2525184.
With a 1GB input, it become worse than 1 in 10^(10^9.4)
In comparison, there's approximately 10^80 atoms in the universe.
[font="arial, verdana, tahoma, sans-serif"][size="2"][quote]I understand with all your concerns. Fully. I agree that theoretically they are holes. They make it so this isn't a perfect compression algorithm. But it seems like in the lifetime of the universe, we won't reconstruct our data incorrectly.[/quote]No, the problem is that it [b]will[/b] take you the lifetime of the universe to reconstruct the correct data.
[/size][/font][font="arial, verdana, tahoma, sans-serif"] [/font]
[size="2"]To even reconstruct a 128byte file correctly from it's 128bit hash, you've got to make on average 1/4th of a googol ^ 2.7 guesses before striking it lucky (assuming you can distinguish a correct guess from an incorrect one).[/size]

##### Share on other sites
sooner123    269
Is it that I need to be assigning my "Guess" string randomly instead of incrementing by 1? I can see how incrementing from 0 to 2^x would take x guesses. Making it so if you have a terabite file, it would take 8 trillion guesses (8 bits/bite * 1 trillion bites)

Or is the collision the serious issue?

I would think a human could easily observe the data to see if it was some random data that just happened to collide with the correct data. Aka... a video file vs. a random sequence of bits and bites.

Therefore giving some context to the data would be the solution. Compress per file:

record: MD5(file), context

example:

Then when you hit a collision, it'll know because it can't be interpreted as an MP3 file.

##### Share on other sites
Rattrap    3385
[quote name='sooner123' timestamp='1303915496' post='4803558']
Is it that I need to be assigning my "Guess" string randomly instead of incrementing by 1? I can see how incrementing from 0 to 2^x would take x guesses. Making it so if you have a terabite file, it would take 8 trillion guesses (8 bits/bite * 1 trillion bites)

Or is the collision the serious issue?

I would think a human could easily observe the data to see if it was some random data that just happened to collide with the correct data. Aka... a video file vs. a random sequence of bits and bites.
[/quote]

See [[url="http://en.wikipedia.org/wiki/Md5#Collision_vulnerabilities"]Md5 - Collision_vulnerabilities[/url]] - They show an example where only 4 bytes (same lengths still) are changed in a string of text and yet they both produce the same hash. In a case like this, without already having the correct file available to confirm it, how could you possibly guess which version was the original (assuming you even got it back to one of the results in the first place, which as many have said before, probability says you will never do). Even if as you said a human was looking at it, I don't think they would be able to know.

I'll also point you to another similar post a while back - [[url="http://www.gamedev.net/topic/572410-odd-compression-idea/page__p__572410__fromsearch__1#entry572410"]Gamedev - odd compression idea[/url]]. Different "compression" method, same concept. You'll see very similar posts then as what is being posted now (from both poster and those who responded).

##### Share on other sites
sooner123    269
[quote name='Rattrap' timestamp='1303916408' post='4803565']
[quote name='sooner123' timestamp='1303915496' post='4803558']
Is it that I need to be assigning my "Guess" string randomly instead of incrementing by 1? I can see how incrementing from 0 to 2^x would take x guesses. Making it so if you have a terabite file, it would take 8 trillion guesses (8 bits/bite * 1 trillion bites)

Or is the collision the serious issue?

I would think a human could easily observe the data to see if it was some random data that just happened to collide with the correct data. Aka... a video file vs. a random sequence of bits and bites.
[/quote]

See [[url="http://en.wikipedia.org/wiki/Md5#Collision_vulnerabilities"]Md5 - Collision_vulnerabilities[/url]] - They show an example where only 4 bytes (same lengths still) are changed in a string of text and yet they both produce the same hash. In a case like this, without already having the correct file available to confirm it, how could you possibly guess which version was the original (assuming you even got it back to one of the results in the first place, which as many have said before, probability says you will never do). Even if as you said a human was looking at it, I don't think they would be able to know.

I'll also point you to another similar post a while back - [[url="http://www.gamedev.net/topic/572410-odd-compression-idea/page__p__572410__fromsearch__1#entry572410"]Gamedev - odd compression idea[/url]]. Different "compression" method, same concept. You'll see very similar posts then as what is being posted now (from both poster and those who responded).
[/quote]

What about SHA-1? Does this have that same phenomenon where a small change will produce the same hash?

SHA-1 is actually the compression algorithm I'm using. I was sticking to MD5 cuz I assumed no one here knew about SHA.

##### Share on other sites
Zao    971
[quote name='sooner123' timestamp='1303915496' post='4803558']
Is it that I need to be assigning my "Guess" string randomly instead of incrementing by 1? I can see how incrementing from 0 to 2^x would take x guesses. Making it so if you have a terabite file, it would take 8 trillion guesses (8 bits/bite * 1 trillion bites)
[/quote]

There are two distinct problems with your approach:
[list=1][*]Generating all possible files quickly grows unfeasible - there are b^n possible sequences where b is the number of choices per element (256 for a byte) and n is the number of elements in the sequence.
Yet again, this grows very fast. There are 4'200 million possible four-byte files (256^4). There's 18 million million millions possible 8-byte files (256^8).[*]Collisions. If you manage to generate a file that matches the hash before the heat death of the universe, it may be a false positive. You can reduce this chance by making the hash larger or using multiple hashes, but you cannot guarantee anything.[/list]Your method grows unfeasible already far below the size of the hash, so there's _no_ space gain at all, just a massive waste of time.

##### Share on other sites
Rattrap    3385
[quote name='sooner123' timestamp='1303916791' post='4803567']
What about SHA-1? Does this have that same phenomenon where a small change will produce the same hash?

SHA-1 is actually the compression algorithm I'm using. I was sticking to MD5 cuz I assumed no one here knew about SHA.
[/quote]

SHA-1 has effectively been replaced by SHA-2 (several different size varieties), so you haven't done a lot of research involving modern hashing techniques to begin with, but the examples still stands. Data is lost during the hashing. The bigger the file, the more data lost. The only way to confirm that you "guessed" right is to have the original uncompressed file in the first place (which would have defeated goal of the exercise).

Read up on concepts like prime number factorization. There is a case where you technically have all of the data at hand, unlike your example, and it still can take years of "guessing" to compute. 2 prime numbers that were multiplied together, resulting in a number that only has 4 possible whole divisors: itself, 1, and the 2 primes that were multiplied. It might seem simple until you realize were are talking about numbers are typically between 512 and 1024-bit numbers, e.g. - [color=#0000FF][font=arial, verdana,]12301866845301177551304949583849627207728535695953[/font][/color][color=#0000FF][font=arial, verdana,]34792197322452151726400507263657518745202199786469[/font][/color][color=#0000FF][font=arial, verdana,]38995647494277406384592519255732630345373154826850[/font][/color][color=#0000FF][font=arial, verdana,]79170261221429134616704292143116022212404792747377[/font][/color][color=#0000FF][font=arial, verdana,]94080665351419597459856902143413. [/font][/color]That is a base 10 representation of a 768-bit number. As I said before, you have to find the 2 numbers that were multiplied to make that (excluding 1 and itself). A team managed to do it
[font=arial, verdana,]
[/font]
[font=arial, verdana,][quote][/font]
[font=arial, verdana,]The effort took almost 2000 2.2GHz-Opteron-CPU years according to the submitters, just short of 3 years of calendar time.[/font]
[font=arial, verdana,][/quote][/font]
[font=arial, verdana,]
[/font]
[font=arial, verdana,]What you are talking about is doing something similar, without all of the information needed to do it.[/font]

##### Share on other sites
rip-off    10976
[quote]
I was sticking to MD5 cuz I assumed no one here knew about SHA.
[/quote]
Are you trolling us?

##### Share on other sites
sooner123    269
[quote name='rip-off' timestamp='1303918139' post='4803573']
[quote]
I was sticking to MD5 cuz I assumed no one here knew about SHA.
[/quote]
Are you trolling us?
[/quote]

No I realize now it's a much better known hashing algorithm than I thought.

Probably the biggest deal for me is finding out that you can have small changes in the source resulting in the same hash.

That being the case makes me wonder if a better hashing algorithm that avoided this could be used for what I'm trying.

I actually know very little about the mechanics of the hashing algorithm itself. And I couldn't find much online. I'll keep looking.

One of the things that brought me onto this idea was mention of the fact that small changes produced totally different hashes. And of course I understood collisions had to occur but I assumed something like (small change) -> (same hash) would only occur those 1/2^128 types of times. Not due to fundamental flaws in the hashing algorithm's uniformity.

I would really like to see if there would be a way to work on a hash algorithm that was not focused on nonreversability but more focused on uniformity to avoid the (small change) -> (same hash) problem.

##### Share on other sites
Rattrap    3385
[quote name='sooner123' timestamp='1303918985' post='4803576']
I actually know very little about the mechanics of the hashing algorithm itself. And I couldn't find much online. I'll keep looking.
[/quote]

If you can't find anything, then you are obviously trolling, because you aren't trying very hard. Wikipedia alone has tons of information as well as pseudocode implementations of both md5, sha-1, and sha-2 (all of which I have used to implement these in C++), along with links to the FIPS or RFC documents which explain the concepts and inner workings of them all.

You have asked several times "Is this infeasible", to which every time the answer has been "Yes, it is infeasible", but it is obviously falling off deaf ears.

##### Share on other sites
sooner123    269
[quote name='Rattrap' timestamp='1303919516' post='4803579']
[quote name='sooner123' timestamp='1303918985' post='4803576']
I actually know very little about the mechanics of the hashing algorithm itself. And I couldn't find much online. I'll keep looking.
[/quote]

If you can't find anything, then you are obviously trolling, because you aren't trying very hard. Wikipedia alone has tons of information as well as pseudocode implementations of both md5, sha-1, and sha-2 (all of which I have used to implement these in C++), along with links to the FIPS or RFC documents which explain the concepts and inner workings of them all.

You have asked several times "Is this infeasible", to which every time the answer has been "Yes, it is infeasible", but it is obviously falling off deaf ears.
[/quote]

Sorry to waste your time then. It was an idea. Nothing more.

(have a lot to learn) != (trolling) imo.

It's hard to know the right places to look and what to look for when you are just getting into this stuff.

Even so, [url="http://en.wikipedia.org/wiki/MD5#Pseudocode"]http://en.wikipedia.org/wiki/MD5#Pseudocode[/url] is pretty damn cryptic to me.

I won't lie though. All the links and references so far in this thread have taught me a lot.

I'm still not convinced things can't be done the way I'm suggesting. But I can see that MD5 is not the way to do it.

##### Share on other sites
KulSeran    3267
[quote name='sooner123' timestamp='1303918985' post='4803576']
I would really like to see if there would be a way to work on a hash algorithm that was not focused on nonreversability but more focused on uniformity to avoid the (small change) -> (same hash) problem.
[/quote]
That wouldn't change anything. All a hash does is mix the bits of source data to attempt to come up with a "unique" number that identifies that data, but in doing so, it throws away information. In throwing away information, you insure that your hash isn't really unique. It is just hard to make a file that has the same hash. Hard, but not impossible. If you have a 4byte perfect hash, and hash a 5 byte string, there are 256 4byte prefixes depending on the value of the 5th byte. Without knowing what the 5th byte's value is, you can't figure out the value of the first 4 bytes by using the hash function. Now, compound that problem when you add more bytes.

To give an example from school, we were shown how you can take a c++ "hello world" program and make it say "goodbye world" but have the same CRC32 hash value. Simple, replace the string, and toss a /* garbage comment */ at the end of the file. Fill the comment with with bits till it comes up with the same hash. Without a visual inspection by a human, you'd never know the file'd been tampered with. You can compile it to verify that it works, and the same CRC32 hides the fact that you did tamper with it. You can do the same thing for most files. You can edit a JPG image, and then fill the comments with garbage till the hash is the same. With a bitmap you could make some big edits, then twiddle around with the least significant bits of every pixel, as the human eye isn't going notice the differences. Now you have an image and a shopped image that both have the same SHA-256 hash. A good shop is hard to tell from the origional, and they have the same hash, so what image is the one you were looking for?

##### Share on other sites
sooner123    269
Wow. I really feel like a complete idiot.

Everything you guys have been saying was going right over my head.

Not sure why but something just clicked.

If I had 256-bit data, hashed it to 128-bit hash. I'd have to check 2^128 guesses just to find a collision with even probability (assuming hash function had a uniform range)

And even then.. there would be on average about 2^128 collisions ( 2^(256 - 128) )

edit: i think the "how long this will take to run even when implemented efficiently" part of my brain simply turned off.
edit: and also.. no matter how uniform the hash function is.. obviously there should still (statistically speaking) be tons of instances of (small change) -> (different hash) rendering this utterly infeasible. Edited by sooner123