delta compressing two byte buffers

Started by
7 comments, last by Zahlman 14 years, 10 months ago
The problem is pretty simple. I have two buffers : buffer a(bytes[], len) buffer b(bytes[], len) and I want to compute the difference between those two buffers, buffer c, and transmit the change (or record to file or whatever). Basically lossless delta-compression. on the compression side, c = b - a on the decompression side, b = a + c the changes between a and b are suppose to be relatively minimal (a and b are basically incremental object serialisations), and a.len would most likely be the same as b.len (although I'd like to support the case when this is not true). Ideally, you'd want to pack together parts of the object that change the most, and pack together parts of the object that change the least. So far, I'm using a run-length-encoder algorithm, encoding alternating runs of changed / unchanged bytes (actually, x number of contiguous bits), as well as the changes of course, plus other small optimisations (notably, the resolution of the VBLE encoding for the run lengths, skipping small unchanged runs, and not compressing if the actual compression is bigger than just transmitting b entirely). I was wondering if there are other and better algorithms to record arbitrary binary differences, and compress that in some way. Note that the buffer are in general pretty small in size (largest would probably be < 10K, smallest could be only one bit), definitely not file-size. This is in the context of network games, or recording replays.

Everything is better with Metal.

Advertisement
I have never heard delta compression explained that way. I have always known it as a b are ints, b - a is usually less than 255 so we make C in to a byte and transmit that. Then when you have a b - a that is larger than 255 you add a signal byte and extend C to a short(note you are now at three bytes instead of the original 4). Then if b - a is larger than a short you use a different signal byte and send a directly(5 bytes to transmit a 4 byte int). It typically isn't amicable to RLE encoding on the bit level since you have already eliminated the dead space, though RLE may help at the byte level if you have a large number of nominal changes 0,1,2 in a row.

For general loss less encoding you might try gzip or 7zip both are source available if it is a learning exercise. If not use them directly. The best compression methods are data aware and would require knowing more about the context of what we are transmitting.
Quote:
The best compression methods are data aware and would require knowing more about the context of what we are transmitting.


yeah, I'm trying to remove that layer and work on raw data, as an exercise. I would think this would lead to pretty good compression without prior knowledge of what is actually inside the buffer, or forcing assumptions. In networked games, serialisations of object states are usually fixed in length and content structure, unless you are willing to pass a 'context' (hash value of a template structure, something like that).

Quote:
have always known it as a b are ints, b - a is usually less than 255 so we make C in to a byte and transmit that.


I was speaking in terms of abstract objects. implementing operators '-' and '+' on buffers objects and a delta stream object if you will.

typically (from experience) when I serialise an object into a stream / buffer, the changes from one state to another are minimal. That leads to long runs of unchanged bits, followed by smaller runs of modified bits, which need to be transmitted. If there is no change, there is nothing to transmit.

Everything is better with Metal.

I think what you're doing sounds about right. I'd probably be tempted to just XOR the arrays and RLE encode them in some fashion.

Beyond that, I think your biggest bang for the buck is going to come in terms of not sending data - figuring out how often you need to send data, or creating a "low priority" struct for half of the data, or throttling the number of updates. Or, even better, figuring out what data absolutely doesn't need to be sent over the wire, ever.

Going crazy on how much you can compress binary data is probably overkill. Go with something reasonable, and if you have problems sending too much data, start by sending less data, and only then getting exotic with the compression schemes.
If your compressing that much data (~10k buffers) and your RLE delta buffer ends up anywhere near ~1k it might be worth while to use a standard zip or lzw schemes.

For anything less I don't know if you can squeeze any compression out of them given both schemes have their own overhead.

Some people collate data which increases compression efficiency but also increased latency.

Good Luck!

-ddn
For data where zero bytes are common but don't necessarily come in long runs (or have other positional patterning), Huffman-style compression is indicated. Especially if there is further uneven-ness in the distribution of byte values (e.g. 60% 0x00, 30% 0x01, 5% 0x02, 3% 0x03, 2% others).

Delta coding is not compression in and of itself. It's simply an attempt to transform data so that it will be more easily compressed.

Usually one sees delta coding applied successively to bytes within an individual stream. This sort of thing is useful for things like simple audio data, where an input file of a sine wave with random noise might turn into a delta-coded file of mostly small numbers (since without the noise, the values would be limited by the maximum slope of the sine wave, which is small compared to the total range of permitted values). Of course, it also works to apply a "diff" between two streams, as you are doing.

You might also consider XOR instead of subtraction as your "difference" operator.
I'm thinking of huffman, but as a method to compress final packets, after all the object states are delta-encoded. However I'm not sure how much benefit I will gain, given that the delta encoding would scramble data into a chaotic mess.

Everything is better with Metal.

Maybe you can apply BWT on it and make it easier to compress ?

http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

-ddn
Quote:Original post by oliii
However I'm not sure how much benefit I will gain, given that the delta encoding would scramble data into a chaotic mess.


You could just, you know, try it. You have nothing to lose but implementation time, and you should be able to find a library that will help. :)

This topic is closed to new replies.

Advertisement