Public Group

# Can't reverse new compression algorithm

This topic is 2785 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm working on a new type of compression algorithm and have run into some difficulties.

Basically what I do is this:

Given raw data: D
I compute the MD5 hash: H = MD5(D)
At this point I've compressed by a factor of Size(D) / Size(H) (which for any data larger than 128-bits is significant)

My problem is uncompressing the data.

Currently what I'm doing is as follows: (keep in mind i'm working with bit values here)

G = {0}[trillion]

while (MD5(G) != H) G++;

I know this has the problem of unintentionally hashing the preceding zeros but the main issue I'm running into is runtime.

So far 20-30 bit lengths have run in feasible time but when I tried to run this with a 32-bit guess, it took several minutes to verify.

My MD5 implementation may be inefficient but it's also possible that this algorithm suffers from a runtime of 2^n.

Any tips? Or is this simply an unfeasible compression algorithm?

##### Share on other sites
I'm going to assume that this isn't a troll and that you're just ignorant of the behavior of hashing functions.

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.

##### Share on other sites
O...kay...

1) Hashes are designed to be extremely difficult to reverse from the outset.
2) Multiple, completely different data patterns can produce the same output hash. (Think about it.) So even if you find something that matches your hash, odds are you still have the wrong results.

##### Share on other sites

Any tips? Or is this simply an unfeasible compression algorithm?

Correct, it is not a feasible compression algorithm.

##### Share on other sites

I'm going to assume that this isn't a troll and that you're just ignorant of the behavior of hashing functions.

Hashes are non-unique. MD5 has been broken for ages now and it's trivial to generate two sets of data which hash to the exact same value.

Hashing is not compression.

Ok.. but if the hash produces a 128-bit digest, that's 2^128 possible hash outputs. I would think the chances of two bits of data producing the same hash would be 1 in 2^128.. which is better than the safety guarantee of a a redundant storage drive.

I'm aware it's possible but to say (as a later post said) that odds are in favor of me having the wrong result seems illogical.

Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.

And it's not like a hard drive will ever store 2^128 bits.. And even if it could, then you could just generate a longer hash.

##### Share on other sites
[color=#1C2837][size=2]Amount of possibilities (unhashed) / Amount of possibilities (hashed) is the ratio for seeing the likelihood of "guessing" the right data to give me my initial data.[/quote]
[color=#1C2837][size=2]Correct. So consider storing a twenty-byte file. That's 2^160 possibilities (unhashed) / 2^128 possibilities (hashed) = 0.00000002328% chance of getting the right one.
[color="#1C2837"]

##### Share on other sites
Research information entropy.

In general, this is a bad approach as hash functions are designed to be extremely difficult to reverse (which is essentially your algorithm).

##### Share on other sites
Since the result of the hash is smaller than the input data, you've lost some of that information. Although it might be a compression of sorts its lossy, whats lost cannot be recovered. If your expecting certain data back (say an int between 0 and 5 but you get back 6) then you could perhaps use that knowledge to correct your "decompression" but its pretty much just guessing and your better off working the other way around (using knowledge of the data to improve compression).

What is the purpose of this? are you doing it just for the sake of doing it, trying somehtign new? or is there a specific problem your trying to address? I believe some of the most efficient compression algorithms are those that can exploit specific data rather than general/raw data (like the compression for mp3s just cuts off inaudiable frequencies).

##### Share on other sites
This is a perfect example of using the wrong tool for the wrong job.

As others have said, MD5 is designed specifically to be a one-way function. Hash functions are meant to be used for things like identifying, indexing and comparing large data blocks, authentication, error checking (checksums), and so on. This is not a compression algorithm, it only maps data of arbitrary size to a (theoretically) unique value, but with severe limitations.

True data compression relies on the fact that the number of bits required to represent the output of an arbitrary input symbol is governed by its entropy. Simply speaking, data compression will perform statistical analysis on the input, and assign the smallest number of bits for the most frequently occurring symbols, and more bits for the least frequent symbols. In a nutshell, the algorithm converts information with potentially low entropy to a new representation that has high entropy.

##### Share on other sites

Or what am I seeing wrong?

Obligatory read for these kinds of posts: The pigeonhole principle.

Especially this section:

The pigeonhole principle arises in computer science. For example, collisions are inevitable in a hash table because the number of possible keys exceeds the number of indices in the array. No hashing algorithm, no matter how clever, can avoid these collisions. This principle also proves that any general-purpose lossless compression algorithm that makes at least one input file smaller will make some other input file larger. (Otherwise, two files would be compressed to the same smaller file and restoring them would be ambiguous.)
[/quote]

1. 1
2. 2
3. 3
Rutin
15
4. 4
khawk
14
5. 5
frob
12

• 9
• 11
• 11
• 23
• 12
• ### Forum Statistics

• Total Topics
633662
• Total Posts
3013228
×