A Thought On Compression: A Small Theory

This topic is 5060 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Ok, so was pretty bored today after just stumbling a bit on my rendering engine and had a thought, it hurt :). Pure Text Compression ===================== And what a powerful way of doing it would be... So, i was thinking along the lines of a binary tree to compress a sentence. Supposing u have a sentence: Dave went to the market and bought a sheep. Now supposing each word became the bottom node on a single binary tree, so: / \ / \ / \ / \ / Dave went to the market and bought a sheep. (nil) If the compressor had a list of the most common words, like the top 1000 and also a list of say the top 100 word-pairs, in this case like 'to the', the word or pair of words could be replaced by a single or pair of ascii chars (depending on the index withing the list) each would have a possible value of 0-255. There fore this allows alot of scope for the list lengths since so many different ones can be represented by either 1 or 2 chars. Ok so compression, so far i have talked a bit about a simple conversion from the actual word to some sort of integer representing its position within a fixed list. If the word does not exist within the list, it could simply be left as it is and the rest of the file converted around it/them. Right so to further compress this file, once all the individual words are converted to the integers, you could look for the pairs of words as i mentioned earlier. If you come accross a pair of words that exist in the paired-word list you could simply move back up the binary tree position and replace the above node with the integer that represents this word pair. So assuming 'to the' is in the paired-words list you might have: (nil) 124 (nil) (nil) (nil) / \ / \ / \ / \ / Dave went to the market and bought a sheep. (nil) Ok so as u can see the 'to the' pairing might well be item 124 in the paired-words list so it would be place in the tree as so. I would then expect that the 'to' and 'the' nodes would be deleted and ofcourse by this stage we would actually be working entirely with numbers for efficiency so the tree might now look something like... (nil) 124 (nil) (nil) (nil) / \ / \ / \ / \ / \ 23 54 3 1 12 24 67 45 55 (nil) Now in order to store this representation you would have to keep track of what 'level' or generation within the binary tree the number is being stored on. Obviously otherwise each number could be on any level and you wouldn't know how to convert the number back to the words because you wouldn't know how many stages of compression it went through. Ok so thats basically it and i am aware it most likely has alot of flaws but it was only a thought. I'm sure some useful compressor could be derived from this theory and i already have an example on the way, but its a fair way off. I have no doubt that some other guy has thought of this. Any comments? EDIT: dammit the tree depictions are buggered, oh well, im hoping yu can still see what i mean although i will try and redraw them. ace

Share on other sites
You might be interested in Huffman coding which is used in some form in common compression algorithms. It's a similar idea to what you're thinking but it works for any type of data.

Share on other sites
PPM essentially does this, but on the character level, thus making it generic enough to be used for any binary data.

Share on other sites
ASCII chars are a huge waste of space. Look at Huffman encoding, its pretty simple.

Share on other sites
Yeah, it's basically Huffman and Lempel-Ziv type compression, and it's already been done (in fact most general compression libraries can do it). The only thing of note here is the attempt to reduce the file size further via use of a common "dictionary", instead of having a file specific dictionary in the header. On the one hand this means very small files can be compressed more effeciently, however those files are typically of a trivial size to begin with. Larger files could potentially be made a few bytes smaller, however the essentially trivial gains would be offset by the fact that you need a standard "dictionary" for every language, and in fact to remain remotely efficient, every general topic (the file header would specify which dictionary was being used). You would likely have to download updates for your decompressor just to keep up to date.

Where this kind of thing could be used would be something like an IM or in in game chat, where text message packets are large in comparison to data packets, but are still too small to fit a full compression header. Off course you could also use an adaptive Huffman/Lempel-Ziv to the same or better effect.

1. 1
2. 2
Rutin
19
3. 3
4. 4
5. 5

• 9
• 9
• 9
• 14
• 12
• Forum Statistics

• Total Topics
633298
• Total Posts
3011260
• Who's Online (See full list)

There are no registered users currently online

×