data compression

Started by
37 comments, last by iMalc 10 years, 2 months ago

if information cannot be destroyed nor created then what does really means data compression?

nws2mOY.png

this should be the principal definition of data compression

what do you think?

The diagram does not illustrate data compression. What it illustrates is a channel that transmits bits and another channel that transmits more complicated codes (e.g., an alphabet of some kind that can be expressed in a 6-bit code). You get time compression, but you have not got information compression.

Now, supposed that 101010 is the only message. Then it is clearly compressible to one bit. Let's suppose that we then use 0 to mean no message, and 1 to mean that message. So now we have a stream of 0's and one's that represents some number of occurrences of 101010. You could also choose to use 1 to mean the message and 0 to mean no more transmission.

In Shannon's treatment of information, he is only speaking of lossless compression in the sense that the essential message can be reconstructed. The compressibility is a measure of how much redundancy there is.

Here's a practical case of some importance.

With a basic pseudo-random number generator, PRNG, there is some seed value and a cyclic procedure that, for each cycle, produces a pseudo-random value of some number of bits (let's say 1 bit for simplicity) and a new seed. There are no other inputs and the procedure itself is fixed. Clearly, the stream of bits produced by the PRNG is compressible. It is in fact compressible to the value of the seed, since given a seed value, one can completely produce the sequence that results from that initial seed. (We always assume the procedure itself is known.)

So, in principle, such a PRNG is not cryptographically secure in the sense that given enough bits produced from the PRNG, one can discover the value of the seed and reliably predict all further bits. So the PRNG sequence is not truly random because for it to be truly random the chance of predicting a next bit is always 50-50. As a practical matter, it may take a relatively large number of bits before the next bits are producible. In serious creation of pseudo-random values in cryptographic settings, there is a limit on how long bits are produced form an initial seed, and then some new seed is chosen by what is hopefully an unpredictable means. That can be rather more complicated and time-consuming than the running of the PRNG, and the security of that is a big deal.

Finally, archiving compressors such as gzip and the methods used in Zip use a variety of techniques to inspect a message and determine from frequencies of occurrences how to recode the message in a new "alphabet", often of variable-length bit sequences, that has an useful compression result even when the mapping to the new alphabet is also communicated for use by someone wanting to decompress the message. Here the procedure is fixed but the mapping is determined by the data and must be passed along with the data.

The question about compressibility does not have an abstract answer. Compressibility is always about a given corpus of messages of some form.

Advertisement

There is a pseudo-scientific background to that which assumes a Newtonian universe and the theoretical ability to compute every state from the universe's previous (or future) state.

Seriously?

Yep, seriously. There's people, and not few of them, who believe in that.

And then of course, apart from the obviously wrong Newtonian stuff, there's the slightly less obvious quantum theory stuff about no-cloning, no-destroying , no-hiding. Which, if you ask me, even if true is entirely irrelevant to "information" in the totally-not-quantum sense we talk about, and besides, in my opinion, is bollocks too.

But yeah, seriously. There's people who firmly believe you can't create or destroy information, and it's backed by some quasi-scientific... stuff. biggrin.png

Seems it's a long time since they've burned a piece of paper or since they saw a person dying.

there is nothing special or magical or philosophical about compression.

rather, take some data, and try to replace it with a more compact representation.

the main difference then is in the types (representations) of input data, and how it is represented in its compressed form.

for example, you can take text and replace any repeating strings with references to those strings.

or, you can take some audio/video and replace it with some approximation (this is a topic in itself), in the process losing the ability to fully recover the original image (only its approximate form will be retained).

...

there is a tradeoff though in terms of the computational complexity required to encode/decode various representations, and kind of a big game of trying to find the most "bang for the buck" (best compression at fastest encoding and decoding speeds), sometimes with the balance shifted one way or another (how critical is raw speed vs how much one can get by utilizing "slow" algorithms).

for example, real-time video recording may have different requirements than offline / batch encoding, which in turn have different requirements from an archive file, ...

this may in turn effect the implementation even of the same format, for example, you might have something like a Deflate encoder, which may be designed differently depending on whether it tries to encode very quickly, or whether it tries to get good compression.

nothing special here, it is more the difference between, say using the first possible match and giving up if no match is seen right then, or going and wandering around and looking for a better match, ...

then you have different formats designed to try to use different strategies, for example, Deflate will look for repeating strings and also use different numbers of bits depending on how common each character is.

or, JPEG will go and do a bunch of math to basically convert the image pixel-blocks into frequency-based approximations.

or, PNG will go and try to predict each pixel from adjacent pixels, store values for how far off it was for each pixel, and Deflate compress the results.

another format might instead just store the pixel-blocks as pairs of colors and use 2 bits per pixel to select between each color, possibly itself followed by Deflate compression.

...

so, lots of stuff can be done...

I see compression as essentially just rearranging bits, or at least their meaning.

'A' is normally (too lazy to check though) encoded by the bitstring '1000001'. What if I choose to represent it by '01010101' instead? It's perfectly possible, I would just have to two swap things around. Is it useful? Not really. But what if I choose to replace the bitstring '10000001' with just '1'? I would need to rearrange a lot more to make it work (ie without introducing ambiguities), but it can be done without too much trouble. This way, I can encode data with a lot of A's much more efficiently: all A's need 8x less bits.

Compression is about remapping the meaning of bits so that a given piece of data turns out be represented with less bits than before.

Of course, you could choose to encode your entire hard drive with a single bit; it is a valid remapping. The problem is that you need to store the remapping somewhere, and that would just amount to having a hash table that maps '1' to <your entire HD data>. So you lose instead of gain. This means that it is important for compression that the remapping itself can be described procedurally.

The part where one says 'you cannot just create information out of nothing' can be found in the remapping: in order to represent 'A' by '1', one has to make a lot of sacrifices. In the case of this example, in order to represent A with one bit, all other characters will require 9 bits. So from this standpoint there is no gain to be made. But since you use the remapping to reencode data, compression is actually possible.


Yep, seriously. There's people, and not few of them, who believe in that.

Well, then I'm all for using this for compression. It will nicely show that it doesn't work! smile.png

But I would really like to know now, if this is what the OP meant!

Wow.


Data compression is a straightforward mathematical mapping.

You can find alternate encodings in an attempt to reduce entropy. It usually gives a smaller file, but sometimes doesn't. Sometimes running a compression routine results in a larger file. There are many different algorithms out there, the most widely successful ones have the nicknames DEFLATE and COMPRESS, both are internally used in zip formats. More complex statistical analysis is done in formats like .7z, but still the goal is entropy reduction of an arbitrary input stream.

You can remove data and replace it with similar, but smaller, data. For example, you can take a 24bpp image and reduce it to 8bpp. You can take a 192kHz audio file and resample it at 44kHz or 22kHz or even less. You can compress your large image with a highly compressed jpeg with lots of compression artifacts, or resize it. These are domain specific algorithms, and data is lost in the process.


There is no "metaphysical destruction of information" going on. You can have a 50x50 low quality thumbnail generated from a 21 megapixel photograph, it doesn't involve parallel universes, deities, or invisible trans-physical dimensions. The fact that you can't reverse the process -- turning a tiny thumbnail into a highly detailed high resolution photograph, also has nothing to do with eminent destruction of the universe because of data loss.

All the metaphysical stuff, that has nothing to do with the mathematics of data compression.

What you don't understand is that there is "information" and then there is "space". The two are not the same! If I write "AAAAAAAAAA" it takes 10 bytes of space with a typical ASCII encoding, but how much information is there really in this sequence of bytes? Way less than 10 bytes. What compression does is try to encode information into as little space as possible, possibly losing some information in the process if the gains in storage space are worth it.

There's no magical reconstruction of information that happens here: if we agree on the convention that a number preceding a character means "repeat this character n times", i.e. a compression format, then the strings "AAAAAAAAAA" and "10A" contain exactly the same information. Yet one takes less space than the other, and that's it - you have compression. Better compression algorithms work more intelligently to take advantage of structure in the input data - if there is no structure, no compression is possible (for instance, random or pseudorandom data cannot be compressed).

The concept of information has been around for a while and was formalized with the development of information theory back in the 20's. I can't believe nobody has brought it up yet, but I suggest you read up on it, it's actually quite interesting and - in my opinion - fundamental knowledge that everyone should learn.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Actually a common way to compress a signal is to transform it from the time domain to the frequency domain, by a Fourier transform. Then you can throw away half of the signal because the Fourier transform produces both positive and negative frequencies, for a real signal these are mirror images of each other, in other words for a real signal S(t), which has a Fourier transform F, F(f) = F(-f) (note the change of variables from time t to frequency f). Something like this is used in quadrature amplitude modulation (QAM)

Actually a common way to compress a signal is to transform it from the time domain to the frequency domain, by a Fourier transform. Then you can throw away half of the signal because the Fourier transform produces both positive and negative frequencies, for a real signal these are mirror images of each other, in other words for a real signal S(t), which has a Fourier transform F, F(f) = F(-f) (note the change of variables from time t to frequency f). Something like this is used in quadrature amplitude modulation (QAM)

Doesn't the Fourier transform of the real signal contain both real and imaginary components (amplitude and phase) though? Then half of the (complex) transform is still the same size as the (real) input signal, so it's not compression per se as far as I can tell (otherwise you would be able to violate information theory by taking the FT of a uniformly random signal, and "compressing" it to half its size :D ). But yes there are schemes that compress in the frequency domain since when it comes to many types of data, in particular, sound and images, the frequency domain is much more predictable than the spatial domain and is more readily compressed. It's also way easier to quantize a signal without losing too much perceived accuracy (audio, video, ..) by working in the frequency domain rather than in the spatial domain. JPEG does this for instance, as well as MP3, though I don't know all the details.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Actually a common way to compress a signal is to transform it from the time domain to the frequency domain, by a Fourier transform. Then you can throw away half of the signal because the Fourier transform produces both positive and negative frequencies, for a real signal these are mirror images of each other, in other words for a real signal S(t), which has a Fourier transform F, F(f) = F(-f) (note the change of variables from time t to frequency f). Something like this is used in quadrature amplitude modulation (QAM)

Doesn't the Fourier transform of the real signal contain both real and imaginary components (amplitude and phase) though? Then half of the (complex) transform is still the same size as the (real) input signal, so it's not compression per se as far as I can tell (otherwise you would be able to violate information theory by taking the FT of a uniformly random signal, and "compressing" it to half its size biggrin.png ).

The Fourier transform does not change the amount of information. For, say, a 1024 point FFT, you input 1024 complex values and you get 1024 complex values, so no change in information.

If you instead input 1024 real values, you still get 1024 complex values out, but 511 of them are redundant due to complex conjugate symmetry. That leaves us with 513 values, out of which the first one and the last one have an imaginary part of zero, and thus represent the same amount of information as just one complex number. In the end, 1024 real values in gives you 511 complex values and 2 real values out.

Exactly the same amount of data storage needed for both the input and the output. The amount of unique and redundant information is the same in both the input and the output, but the information is often transformed to a different representation that is more suitable for a later compression stage.

This topic is closed to new replies.

Advertisement