data compression

Started by
7 comments, last by Crispy 21 years, 6 months ago
Hi, the Huffman compression algorithm is cool, but it has certain "artefacts" when it comes to compressing certain kinds of data - namely it sometimes comrpesses the source buffer into a bigger buffer yet (like when the source buffer contains a lot of different-valued bytes with near equal frequencies). I''m just starting out on the subject, so I was wondering if there are any neat tricks with which to normalize the compression results (such as selectively compressing data, compressing only nibble-sized chunks, etc - are these tricks worthwhile?). I''d appreciate some guidance at this point. Furthermore, I''ve read that the Huffman algorithm is one of the best (if not the best) in town. Can anyone hazard an educated guess on the subject? Additionally, what are other effective lossless compression methods/algorithms worth having a look at? Thanks, Crispy
"Literally, it means that Bob is everything you can think of, but not dead; i.e., Bob is a purple-spotted, yellow-striped bumblebee/dragon/pterodactyl hybrid with a voracious addiction to Twix candy bars, but not dead."- kSquared
Advertisement
Your metric on compression (how good it is) is the Shannon Limit. Read up on Shannon, who just passed away this last year.

Huffman is the best compression possible under the constraints that each input symbol maps to a single output symbol. It''s nothing to sneeze at, but it is a limitation.

Arithmetic codes and codebook (Ziv-Lempel family) compressions schemes are ways to represent groups of input symbols with groups of output symbols.

For now, something you''ve sort of hinted at is truncated huffman coding--symbol probability families can fall under a single huffman "code", and can trigger a switch to different coding. This is useful if you have a large set but 1/10th of the symbols are a very small probability (e.g.)--just group them all under a single ''key'' symbol and when it''s detected you switch to a different encoding method.

So, things to google for:
shannon
ziv lempel
arithmetic code
truncated huffman

Have fun--there''s lots to learn in this field.
quote:
For now, something you''ve sort of hinted at is truncated huffman coding--symbol probability families can fall under a single huffman "code", and can trigger a switch to different coding. This is useful if you have a large set but 1/10th of the symbols are a very small probability (e.g.)--just group them all under a single ''key'' symbol and when it''s detected you switch to a different encoding method.


I read the following about Truncated Huffman Coding:

http://www.cs.technion.ac.il/Labs/Isl/Project/Projects_done/VisionClasses/DIP/Lossless_Compression/node10.html

It sounds like you pick a threshold for encoding via Huffman style and the remaining symbols you mark with a special code and just put them in literally or with another encoding scheme? The line I don''t really understand from the above URL is:

quote:
The J-K less probable source symbols are assigned the Huffman code of that hypothetical symbol concatenated with natural binary code of length $\log_{2}(J-K)$


Too bad the forums don''t have a TeX interpreter. Anyways, I don''t see what they mean by the "natural binary code of length log2(J-K)". Why would the number of symbols not above a certain frequency threshold determine what goes after the dummy-symbol?
see sig for the zlib library, from their site there ought to be links to more information (and you can dl the free zlib too


- Magmai Kai Holmlor

"Oh, like you''ve never written buggy code" - Lee

[Look for information | GDNet Start Here | GDNet Search Tool | GDNet FAQ | MSDN RTF[L] | SGI STL Docs | STFW | Asking Smart Questions ]

[Free C++ Libraries | Boost | ACE | Loki | MTL | Blitz++ | wxWindows| Spirit(xBNF)]
[Free C Libraries | zlib ]

- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
Huffman encoding and arithmetic encoding are pure entropy encoders. That is, all they do is remove the entropy from a set of bytes. This uses only character frequency information, and leaves out and information based on the order of the characters. Of the two, Huffman is simpler, but arithmetic is slightly better as it can share bits between multiple characters.

Dictionary-based encoding, like all of the Lempel-Ziv algorithms, are a different type of encoding. They look for repeated byte sequences, and replace them with an index into a dictionary. These algorithms look much more at byte order than the frequency of individual bytes.

Typical compression algorithms tend to use two passes. The first is usually a dictionary system. The second is an entropy encoding pass, to further compress the results of the first.
Thanks guys. I''m certainly going to have a look at arithmtic codes Stoffel mentioned (even sounds interesting). The fact is I need to compress sound and images which traditionally have a lot of versatile info stored in them which renders my current compression competely useless in many if not most cases...

quote:Original post by Anonymous Poster
I read the following about Truncated Huffman Coding:


Yeah I read that, too. Took me a while to understand what''s written there, especially since a period is missing somewhere in the middle

quote:Original post by Stoffel
Arithmetic codes and codebook (Ziv-Lempel family) ...


You mean Lempel-Ziv as in (win)zip and stuff?

quote:Original post by Magmai Kai Holmlor
... and you can dl the free zlib too ...


Now where''s the fun in that?

Actually - before I set out to explore compression I expected it to be a lot more complicated. What I have in mind is strictly fixed-length, lossless compression algorithms. Hacking JPEG still looks way too far off...

Thanks again,
Crispy
"Literally, it means that Bob is everything you can think of, but not dead; i.e., Bob is a purple-spotted, yellow-striped bumblebee/dragon/pterodactyl hybrid with a voracious addiction to Twix candy bars, but not dead."- kSquared
quote:...fixed-length, lossless compression algorithms


If you can do that, I''ll give you big money . Either non-fixed-length or lossy. If you did both, then you could just keep on compressing stuff and compressing stuff until it was 1 byte

(Speaking of which, I zipped a file (at maximum compression) and then zipped it again and it became smaller... it works because the header uses plain text for all the file names, and there were a heap of files in it)

Trying is the first step towards failure.
Trying is the first step towards failure.
quote:Original post by ragonastick
If you can do that, I''ll give you big money .


Oops . In that case could you explain the meaning of fixed-length - i was under the expression that Huffman coding is fixed-length, and I'' pretty sure it is lossless...


quote:
(Speaking of which, I zipped a file (at maximum compression) and then zipped it again and it became smaller... it works because the header uses plain text for all the file names, and there were a heap of files in it)


Multiple levels of compression - why not if speed is not an issue... With large files this should work (and be efficient spacewise) since every level of compression creates a completely new buffer.

Crispy
"Literally, it means that Bob is everything you can think of, but not dead; i.e., Bob is a purple-spotted, yellow-striped bumblebee/dragon/pterodactyl hybrid with a voracious addiction to Twix candy bars, but not dead."- kSquared
You''ll usually get the best compression by combinining different algorithms. One good combination is to compress the data using LZW or LZC(adaptive dictionary) and then run the output codes through an adaptive huffman filter using only the last 8 bits for it and just output the rest.
Adaptive Huffman can be slow as hell though. Compressing a file of 2mb in my first version took over two hours but after I''ve done some tweeking (many many hash tables) I got down to 6 seconds. Regarding the LZW algorithm, well, Unisys still have patent on it but it will expire next year in June or July I think.

link of the day: www.datacompression.info

This topic is closed to new replies.

Advertisement