Can't reverse new compression algorithm

Started by
21 comments, last by sooner123 12 years, 12 months ago
I'm working on a new type of compression algorithm and have run into some difficulties.

Basically what I do is this:

Given raw data: D
I compute the MD5 hash: H = MD5(D)
At this point I've compressed by a factor of Size(D) / Size(H) (which for any data larger than 128-bits is significant)

My problem is uncompressing the data.

Currently what I'm doing is as follows: (keep in mind i'm working with bit values here)

G = {0}[trillion]

while (MD5(G) != H) G++;

I know this has the problem of unintentionally hashing the preceding zeros but the main issue I'm running into is runtime.

So far 20-30 bit lengths have run in feasible time but when I tried to run this with a 32-bit guess, it took several minutes to verify.

My MD5 implementation may be inefficient but it's also possible that this algorithm suffers from a runtime of 2^n.

Any tips? Or is this simply an unfeasible compression algorithm?
Advertisement
I'm going to assume that this isn't a troll and that you're just ignorant of the behavior of hashing functions.

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

O...kay...

1) Hashes are designed to be extremely difficult to reverse from the outset.
2) Multiple, completely different data patterns can produce the same output hash. (Think about it.) So even if you find something that matches your hash, odds are you still have the wrong results.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

Any tips? Or is this simply an unfeasible compression algorithm?

Correct, it is not a feasible compression algorithm.

I'm going to assume that this isn't a troll and that you're just ignorant of the behavior of hashing functions.

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.


Ok.. but if the hash produces a 128-bit digest, that's 2^128 possible hash outputs. I would think the chances of two bits of data producing the same hash would be 1 in 2^128.. which is better than the safety guarantee of a a redundant storage drive.

I'm aware it's possible but to say (as a later post said) that odds are in favor of me having the wrong result seems illogical.

Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.

And it's not like a hard drive will ever store 2^128 bits.. And even if it could, then you could just generate a longer hash.

What is it that doesn't work about this? Or what am I seeing wrong?
[color=#1C2837][size=2]Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.[/quote]
[color=#1C2837][size=2]Correct. So consider storing a twenty-byte file. That's 2^160 possibilities (unhashed) / 2^128 possibilities (hashed) = 0.00000002328% chance of getting the right one.
[color="#1C2837"]
Research information entropy.

In general, this is a bad approach as hash functions are designed to be extremely difficult to reverse (which is essentially your algorithm).
Since the result of the hash is smaller than the input data, you've lost some of that information. Although it might be a compression of sorts its lossy, whats lost cannot be recovered. If your expecting certain data back (say an int between 0 and 5 but you get back 6) then you could perhaps use that knowledge to correct your "decompression" but its pretty much just guessing and your better off working the other way around (using knowledge of the data to improve compression).

What is the purpose of this? are you doing it just for the sake of doing it, trying somehtign new? or is there a specific problem your trying to address? I believe some of the most efficient compression algorithms are those that can exploit specific data rather than general/raw data (like the compression for mp3s just cuts off inaudiable frequencies).

Interested in Fractals? Check out my App, Fractal Scout, free on the Google Play store.

This is a perfect example of using the wrong tool for the wrong job.

As others have said, MD5 is designed specifically to be a one-way function. Hash functions are meant to be used for things like identifying, indexing and comparing large data blocks, authentication, error checking (checksums), and so on. This is not a compression algorithm, it only maps data of arbitrary size to a (theoretically) unique value, but with severe limitations.

True data compression relies on the fact that the number of bits required to represent the output of an arbitrary input symbol is governed by its entropy. Simply speaking, data compression will perform statistical analysis on the input, and assign the smallest number of bits for the most frequently occurring symbols, and more bits for the least frequent symbols. In a nutshell, the algorithm converts information with potentially low entropy to a new representation that has high entropy.
Latest project: Sideways Racing on the iPad

Or what am I seeing wrong?

Obligatory read for these kinds of posts: The pigeonhole principle.

Especially this section:

The pigeonhole principle arises in computer science. For example, collisions are inevitable in a hash table because the number of possible keys exceeds the number of indices in the array. No hashing algorithm, no matter how clever, can avoid these collisions. This principle also proves that any general-purpose lossless compression algorithm that makes at least one input file smaller will make some other input file larger. (Otherwise, two files would be compressed to the same smaller file and restoring them would be ambiguous.)
[/quote]

This topic is closed to new replies.

Advertisement