# Compression algorithm?

## Recommended Posts

luke2    100
I was randomly brainstorming awhile ago, and came up with this untested/unverified algorithm. A file of data it essentially a very very big number. 109250092351 You can find a way to represent that very big number in little space. 1.09x10^n (although that isn't an accurate idea) I was thinking that by adding up the total number for the file, and then representing it with the equation 2^n +- a where n = the nearest power of two and a equals the difference between that power of two and the actual value. Which, when added together gets the original 'number' of the file. This all depends upon the idea that no number can be represented in more than one way in binary, or decimal. A large data file will probably require several '2^n +- a' groups, but it should be a trivial 16 bytes (8 bytes per int?) for every (I couldn't find a calculator that returned anything other than an overflow error ) bytes. Do any of you notice any glaring mistakes in my logic? Since I ( a mere highschooler ) thought this up randomly, in a few minutes, there is no way I could have 'discovered' a particularly good algorithm. Btw, I am aware of the occasional 'Hey I've just made this 1337 compression algorithm, but I can't show it to you.'

##### Share on other sites
tendifo    146
I don't get it.. you're trying to compress a single number like 109250092351? That number could easily be represented by a long long int, which is only 8 bytes long, not that bad.

The problem with your idea is that the spacing between incrememnt of 2^n because larger and larger, so the value of a will have to become larger and larger to reach the number being compressed.

I hope I understand what you're saying.

##### Share on other sites
Sneftel    1788
Most of us have "discovered" basically this algorithm at one time or another, although you've added a few obfuscating details. [smile] In your case, consider how many 16-bit groups it will take to store your average n-bit "big number". (Consider how many significant digits you can represent with each group.) If that number of groups is less than n/16, you've won! (You haven't won. Still, it's a good mental exercise.)

##### Share on other sites
Basiror    241
Probably one of the best compression algorithms is the Huffman Coding algorithm.

In order to encode some data to achieve maximum compression( the compression border is the entropie H(Input P)
H(Input P) = - SUM_OVER_A(pi*log_2(pi))

e.g.:
you have 3 symbols with probabilities:
p0 = 0.25
p1 = 0.275
p2 = 0.475

so you entropie is
H(p) = - (p0*log_2(p0) + p1*log_2(p1) + p2*log_2(p2))
= - (0.25*(-2) + 0.275*(-1.8625) + 0.475*(-1.074))
= 1.5223375

So you need at least 1.5223375 bits per symbol

This is achieve by building a huffmantree

you have the above 3 symbols , generate a list, merge the 2 symbols with the least probability ....

so you get
p0
p1
p2

to:
p0p1
p2

now you have only 2 opportunities left:

the lower probabilities are on the left side, the higher ones on the right side of a binary tree node(left = 0, right = 1)
--/root\----
0/------\1--
p2------/\--
------0/--\1
------p0--p1

p2 = BINARY(0)
p1 = BINARY(11)
p0 = BINARY(10)

This is the optimal prefix code, meaning you can t!!! combine the BINARY(X) codewords in any way to get the wrong symbol out of the binary stream

so encoding symbols e.g.:s0s1s2s2s1s0

10|11|0|0|11|10

To decode it you need to build the huffmantree based on the delivered probabilities( included in the header of the compressed chunk)

b) reconstruct huffman tree
c) decode
read bit 1: '1' -> (p0 | p1)
read bit 2: '0' -> (p0)
read bit 3: '1' -> (p0 | p1)
read bit 4: '1' -> (p1)
read bit 5: '0' -> (p2)
read bit 6: '0' -> (p2)
read bit 7: '1' -> (p0 | p1)
read bit 8: '1' -> (p1)
read bit 9: '1' -> (p0 | p1)
read bit 10: '0' -> (p0)

here ya go, its simple, its one of the most efficient lossles compression algorithm using variable length encoding, its already implemented in the zlib and its the one closest to the entropy H(P) ("Shannons Coding Theorem")

I dunno if there are any better algorithms available, I personally don t know any, maybe some improved huffman versions

Hope that helps.
if you study computer science you will discuss this topic a lot in the theory lectures

##### Share on other sites
Zahlman    1682
Real compression relies on the fact that useful files generally aren't really random. We have to look for interesting patterns in the data, and accept that we will actually take a loss on the vast, vast majority of theoretically possible "files" (sets of data) - just not the ones we're interested in. :)

The idea is, there are 2^N possible files of length N bits. But there are only 2^N - 1 files of length less than N bits, so even if we only had to compress files of a given exact length, we couldn't possibly win on *all* of them. And since we need to compress files of all lengths (OK, so we only have cases where N is a multiple of 8 in practice, but we can also only *emit* files with that restriction, and you can make a similar argument using base 256 instead of base 2), we can show that the only way to avoid *losing* on some cases is to break even on *every* case - i.e. do nothing at all.

In fact, a really optimized implementation of the proposed scheme actually turns out to do nothing, more or less. :)

##### Share on other sites
Guest Anonymous Poster
Not everyone things compression should shrink the data, some people could use the deffinition that talks about shrinking data with high redundancy, and expanding data with low redundancy.
While there would be just xxx amount of strings with lenght of "n", there are quite a lot more strings with lenght of "n + a a > 0" Some compression algorithms could just use knowledge that would be present on target computer, so they might get away with shrinking any file.

luke2 you should just play with it, and use programming a little to see result of your algorithm, you'd see the problem more clearly.

Raghar

##### Share on other sites
Extrarius    1412
Quote:
 Original post by Anonymous PosterNot everyone things compression should shrink the data, some people could use the deffinition that talks about shrinking data with high redundancy, and expanding data with low redundancy.While there would be just xxx amount of strings with lenght of "n", there are quite a lot more strings with lenght of "n + a a > 0" Some compression algorithms could just use knowledge that would be present on target computer, so they might get away with shrinking any file.luke2 you should just play with it, and use programming a little to see result of your algorithm, you'd see the problem more clearly.Raghar
You have 4 different files that contain exactly 2 bits. There are exactly two possible files with a length of exactly one bit. Compress all 4 files to 1 bit in length such that you can decompress each result to the appropriate original file. (Hint: 2 bits hold more information than 1 bit, and you can't compress 4 different files into 2 different files without losing information)

##### Share on other sites
kvp    196
The idea taken to its limits is called Arithmetical Compression. There are quite a few books about it, but it's basically a way to represent a file with a single very large number, that requires less space than the original file. Huffman encoding is just a special case with powers of two components and it's usually limited by the arbitary 8 bit per byte rule. Arithmetical compression doesn't have these limits. (afaik ibm had the patent for it)

##### Share on other sites
Bregma    9199
Quote:
 Original post by luke2I was randomly brainstorming awhile ago, and came up with this untested/unverified algorithm.
I mourn the loss of the days when every schoolchild had to learn the use of a sliderule. I have one sitting on the desk beside me, but more as a toy (as is the deer skull with full rack, the ukelele, and the dozen chinese horses -- gotta personalize your cubicle somehow). You had to learn some very important numeracy skills to use a sliderule correctly.
Quote:
 A file of data it essentially a very very big number.

Yes. This is one of the fundamentals of computability theory, something usually addressed in a third-year university-level math course. Not only is every file essentially a huge integer, but so is every program, and in fact every state in every program. Google for Alonzo Church if you want more theory on the linear transformability of such spaces. A man named Kurt Goedel, a buddy of Albert Einstein, published a wonderful elegant proof taking advantage of this idea to show that no system of logic is both consistent and compete, and a perfect illustration of that is known as the halting problem (also quite googleable).

So, you've independantly hit on a very advanced concept that has a lot of useful implications. Keep it up. The programming world could use more good brains. Save yourself now: don't learn PERL or even try to read any PERL code. It would be atragedy to lose a good mind at such a young age.
Quote:
 You can find a way to represent that very big number in little space. 1.09x10^n (although that isn't an accurate idea)

Woah, wait a minute. Here's where the sliderule knowledge comes in. You have not represented the integer 109250092351 by 1.09x10^11. You've lost 9 significant digits. A file that's, say, 1 megabyte is size has about 1 million significant digits.
Quote:
 I was thinking that by adding up the total number for the file, and then representing it with the equation2^n +- a

No, your compression function would need to be m * 2^n + a. Wouldnt the number of bits used to store m, n, and a be greater than the number of bits used to store m * 2^n + a? Let's see.

Let's say, for the sake of simplicity (and I'm assuming that you were already doing this), m is 1. The original file would be stored in n bits. So, a would require n-1 bits to store, and n would require log n bits to store. So, the total number of bits to store the original n-bit file using your compression algorithm would be (n-1) + log n bits. For any useful file, log n > 1, so your proposed compression algorithm, as I understand it, will always result in moire storage space than the original file.

Unless either I'm misunderstanding your algorithm or you're willing to truncate your file to save space. In which case, you could just truncate the file to save space.

--smw

##### Share on other sites
alunharford    122
Quote:
 Original post by luke2Do any of you notice any glaring mistakes in my logic? Since I ( a mere highschooler ) thought this up randomly, in a few minutes, there is no way I could have 'discovered' a particularly good algorithm.

Unfortunately so.
Lets call the content of your file n. So it's log n bits in length (rounded up to the nearest integer).

Now you take your compression algorithm and calculate the nearest power of two. That's: log (log n)

You then have to store the difference, which is a number up to (n/2). So the number of bits you need is log (n/2) = (log n) - 1

So the total number of bits you need to store is approximately:
(log n) - 1 + log (log n)
Since that's bigger than log n for any significant file (and I think the occasion where it's smaller (n=4) occurs because I didn't factor rounding into my calculations - maybe I've overlooked something small, but the idea is certainly correct).

Now, you may ask what happens if the difference is small? Can we use a variable length field? Unforunately, that doesn't help. The problem is that the probability of being able to save 1 bit with a variable length field is 1/2, the probability of being able to save 2 bits is 1/4, 3 bits is 1/8, 4 bits is 1/16, ..., 16 bits is 1/65536, 32 bits is 1/4294967296. Since we have to store a length field, the probability of us being lucky enough for it not to use more space is *very* small.