How to choose symbol size for a general purpose compression method?

Started by
6 comments, last by pseudomarvin 7 years ago

I am tasked with implementing three different compression methods: arithmetic coding, PPM and LZW. I then have to compare their performance on different data (image, audio, text, binary executables).

If the compression program receives a file on input and does not know what the contents are, how do I choose the appropriate symbol size? E.g., if the file is text, a single character could be 8 or 16 bit long and setting symbol size to 8 bits or 16 bits could have large impact on the compression ratio.

Do I simply try various reasonable sizes (various multiples of one byte) and see which one fits the data best?

Advertisement

Typical lossless compression works on the byte level, and as such the symbol size is always 8 bits. But it also doesn't care about the file content, which means can be hard to be as efficient as a compression method that is content-aware. However, the benefit is that it will work on any arbitrary file, and in theory it's not necessary to understand the data to obtain optimal compression anyway. (Obviously in practice, knowing the content can obviously help you choose a better algorithm (or dictionary size, or whatever) that gives you an advantage.)

It's also worth considering that text these days is rarely 16-bits long. It's usually either a fixed 8-bit encoding (e.g. a ISO-8859 variant), or a variable length encoding such as UTF-8.

I don't understand why you need a single appropriate symbol size, tbh. True, you need some value to make the algorithm work, but why do you have to make that decision? Isn't a user of the algorithms in a much better position to make this choice?

Obviously, by delaying that choice until usage, you get the same problem as part of the performance comparison. However, depending on what notion of performance you use, you can pick different values. eg performance against memory use, performance against symbol size, as fast as possible, as small as possible result file, etc etc.

I am tasked with implementing three different compression methods: arithmetic coding, PPM and LZW. I then have to compare their performance on different data (image, audio, text, binary executables).

This sounds like a CS homework problem.

* Arithmetic coding is important to do even today. It is usually considered a starting point, not an ending point.

* PPM is going to give you some interesting results for images. For grins and giggles, try that with an image from a new 50 megapixel digital camera.

* LZW is 35 years old, and was constrained by hardware of the era. It is useful to study since there are many compressors today based on Limpel & Ziv's work in the 1970s, generally with technical names starting LZxxxx. Be aware you are studying a historical version. It has a few properties that are fun to study, like the result of re-running the output through the compressor again. Repeat a few iterations, checking the result size until you've learned a few things.

how do I choose the appropriate symbol size? E.g., if the file is text, a single character could be 8 or 16 bit long and setting symbol size to 8 bits or 16 bits could have large impact on the compression ratio. Do I simply try various reasonable sizes (various multiples of one byte) and see which one fits the data best?

This is one of the better questions for you to answer, and a reason there are so many LZxxx algorithms.

There is no universal "best". Today there are algorithms with dynamic tables, algorithms that encode everything into Markov chains and then compress the statisically-chained data. There are also versions specifically designed for various types of data.

For your project, you might try both 8-bit and 16-bit and compare the results. If you happen to know more about the data file you might also try 32-bit and 64-bit, since some specific types of data files will be more friendly or less friendly to these.

Good luck on your comparison.

Thank you all for the thoughtful comments, they've been very helpful.

@Alberth: For practical reasons it would probably be best to let the user decide as you say, I just wanted to know what some reasonable values are.

@frob: It is indeed a CS homework problem :D. If I use a single byte there is an upper limit of 2^8 = 256 codes that I would require at most. Theoretically, if I use 64 bits as the size of a symbol I would have to (in the worst case) assign 2^64 codes. I guess that this is not really a problem if I only assign codes to 64b strings that actually occurred in the data (the cardinality of that subset of all possible 64b strings will be much smaller). Am I correct in that?

the cardinality of that subset of all possible 64b strings will be much smaller

One reason newer compression algorithms are able to work at all is because computers have far more memory.

In the worst case you have an enormous number of dictionary entries. But if there are frequent patterns at that particular size, it becomes much better. For example, LZMA2 works best with enormous data dictionaries potentially measured in gigabytes because it allows for various longer patterns in the dictionary pattern matching.

That's what I meant when I said it was something you should play with. LZMA was cobbled together without any formal research. Igor Pavlov just was looking at some compression options, and threw something together that mixed LZ77, arithmetic coding, and Markov chains, the same things you are probably studying in your class. When it worked well, he shared it with the world inside his 7zip compression tool.

@Alberth: For practical reasons it would probably be best to let the user decide as you say, I just wanted to know what some reasonable values are.

Is easy! Build the tool, run performance tests :D and decide based on the results!

(In all fairness, as author of a tool, it's probably a good idea to give an initial suggestion for such parameter values to the user, perhaps augmented with an informal explanation what happens if you change parameters or what works better for certain types of files. A user cannot reasonably be expected to understand how to decide such things without a little guidance.)

All right, thanks again for the input :).

This topic is closed to new replies.

Advertisement