Public Group

# Binary search as compression method

This topic is 3341 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I had an odd idea of trying to use a binary search as a compression algorithm. The basic idea is that given say a 32 bit integer you can find it by using a binary search in at most, 33 iterations. You can cut this down to 31 iterations and use 2 bits to specify which of the four remaining integers is the correct one. So need to know information is: 1. how many iterations to take 2. what 'section' of the possible domain the number is in 3. which end of the section the number is closest to 4. and of the final 4 what one is correct. Item 4 requires 2 bits, item 3 requires 1 bit, item 2 requires 5 bits and item 1 requires 5 bits for a total of 13 bits. An example: say you have the integer 2055 The section is equal to the least number of bits required to store the integer, in this case 12. This puts us in the section 4096-2048. 2055 is closer to 2048, so we mark down 0 for item number 3. Now we know that the integer lies between 3072 and 2048. Doing a binary search, we can get this down to four remaining possibilities in 8 iterations. The four possibilities are 2056(00), 2055(01), 2054(10) and 2053(11). Therefor we need to store 0(binary digit), 01(2 binary digits), section number 12, and 8 iterations. Am I messing something up, or is this a viable algorithmic compression scheme?

##### Share on other sites
It's similar to arithmetic coding.

Quote:
 Therefor we need to store 0(binary digit), 01(2 binary digits), section number 12, and 8 iterations.

Sadly, this information will take, on average, 32 bits to store (for 32 bit integers of equal probabilities).

In your case it would be simpler to just encode the needed bits directly. Integer can be 32 bits long (encoded in 5 bits), the remainder are the actual bits of the integer. So 5 bits + 12 bits could encode 2055 (and up to 4095) in 17 bits.

Also, integers can be considered a form of a binary tree. Consider the following:
              |       +------+-------+       |              | +-----+-----+  +-----+-----+ |           |  |           | 0           1  2           3
Parse the bits of an int one by one. If the bit is 0, move left. If the bit is 1, move right.

If the tree is leaf, print the number and stop.

2 in binary is 10. In above tree, you'd first move right, then left, then print two.

##### Share on other sites
You have to encode the binary search as bits.

The problem comes when needing to tell the difference between 0001 and 01. It requires extra information, basically the bits you are elminating.

I went through this a while back. tho it was in the range of 8 bits. I found I could use at most 7 bits (basically the path you took through the binary tree) to encode most of the numbers, yah there's edge cases, but I figured id handle those separately.

I figured if I could eliminate 1 bit per iteration of encoding, it be a pretty damn good recursive compression alg.

It just doesnt work in the end.

What I was trying sounds along the lines of what you're trying.

##### Share on other sites
If you want to make a compression algorithm, make sure you understand the pigeonhole principle.

From wikipedia: Given two natural numbers n and m with n > m, if n items are put into m pigeonholes, then at least one pigeonhole must contain more than one item. This principle proves that any general-purpose lossless compression algorithm that makes at least one input file smaller will make some other input file larger. Otherwise, two files would be compressed to the same smaller file and restoring them would be ambiguous.

##### Share on other sites
First make sure you understand what Wiggin posted above. Then notice that it doesn't mean that compression is impossible. This is how lossless compression always works: Of all the possible inputs, some are more probable than others, so if we encode the more probable ones with shorter representations and the less probable ones with longer representations, we can make the average representation length smaller than the original length of the message.

In other words, you need to understand something about what things happen often and what things almost never happen in order to be able to compress (express common things with few bits).

[Edited by - alvaro on October 26, 2009 2:32:40 PM]

1. 1
2. 2
Rutin
19
3. 3
khawk
18
4. 4
5. 5

• 9
• 12
• 16
• 26
• 10
• ### Forum Statistics

• Total Topics
633769
• Total Posts
3013759
×