Sign in to follow this  
implicit

Actually Revealing Arithmetic Coding

Recommended Posts

As a bit of a holiday project I've decided to familiarize myself with arithmetic coding, yet despite the popularity of the technique I'm having a bit of trouble finding decent introductory material on the subject. Oh, there's plenty of tutorials and papers out there and by and large they cover the basic coding principles admirably, but there seems to be a decided lack of coverage of the actual implementation details (for a typical example, see this paper purporting to be a comprehensive guide to arithmetic coding.) Perhaps it's just me but it's far from intuitively obvious that a limited-precision implementation would even work reliably in the first place, certainly how to best rearrange the arithmetic and round the numbers to maximize precision while avoiding error is anything but obvious. Hell, I have trouble even figuring out the minimum number of bits to flush at the end on my own. I suppose I could just copy one of the ready-made implementations, but then I wouldn't learn anything, and besides I can think of a dozen possible ways to improve precision and performance in special-cases yet I lack any real idea of how to properly analyze these changes to convince myself that I didn't miss any edge cases. Are there any readable resources out there covering all of the minutiae or is expecting to find any such material in the first place merely a symptom of my Internet-age refusal to think for myself combined with my intrinsic aversion to formal proofs?

Share this post


Link to post
Share on other sites
I remember going through the exact same thing when studying arithmetic coding. The theory is simple enough, but going from theory to code with the range encoding was a bit tough to figure out since nothing really described that part well when it came to implementation. My suggestion is to just check out a very basic, unoptimized, fully-functional implementation of arithmetic coding in a language you are comfortable with and see how they do it. It might work for you, or it might not - everyone learns differently.

Share this post


Link to post
Share on other sites
Quote:
Original post by vaneger
wikipedia has a solid entry on range and arithmetic coding. There is some sample code there as well.
Thank you. The wikipedia article is solid as usual however at the end of the day it's still an encyclopedic article, and so it lacks the depth I'm looking for (it also makes odd claims like "In frequency concept many computational challenges are avoided" without much in the way of motivation.)

Quote:
Original post by Spodi
I remember going through the exact same thing when studying arithmetic coding. The theory is simple enough, but going from theory to code with the range encoding was a bit tough to figure out since nothing really described that part well when it came to implementation. My suggestion is to just check out a very basic, unoptimized, fully-functional implementation of arithmetic coding in a language you are comfortable with and see how they do it. It might work for you, or it might not - everyone learns differently.
I believe that I've mostly understood those algorithms, the bit I'm having trouble deciphering is the underlying motivation for why the code rounds the way it does and in general is written quite the way it is.
To put it bluntly my own implementation keeps failing in hard-to-find edge cases, so after half a dozen revisions intended to "just make things work" I'm having a hard time convincing myself that I have all of my I's dotted and t's crossed.

As an example it looks like people go to a ridiculous amount of trouble trying to deal with ranges straddling the halfway point, yet there is an "obvious" way of rearranging the arithmetic to increase precision and simplifying things in the encoder while avoid this mess entirely in the decoder. Now this means that either everyone is being stupid and I've just invented an awesome new trick, or proper rounding is a lot more subtle than I've been led to believe..

Quote:
Original post by nmi
I found this article very good to read:
http://dogma.net/markn/articles/arith/part1.htm
Thank you, I'll check it out.

Share this post


Link to post
Share on other sites
Well, it took me half of an eternity but I've finally caught on to the fact that using high precision and applying proper rounding was never all that important in the first place. I suppose what threw me was the fudge factors used in some coder sources to maintain an inclusive upper bound, so I had been trying to carefully and conservatively round the lower bound up and the upper bound down while keeping things separate during renormalization.

However a far less finicky way of doing things, and what I finally figured out that the real coders were really doing, is to use an exclusive upper bound and make sure it is recalculated using the exact same arithmetic which would have been used had it instead been the lower bound of the next higher character. A small bit of additional care needs to be taken to always round the boundaries down, since there's no guarantee that our imaginary next character will get renormalized in lock-step with our particular range, but that's about all there is to it.

I still haven't figured out what the deal is with seemingly everyone deferring renormalization of ranges straddling the middle yet. Summing up the lower bound of the characters into a separate accumulator and carrying back into the output buffer when it overflows appears to work just fine without complicating the decoder or losing one bit of precision.

In the unlikely event that someone is curious I have decided to submit the coder itself. It's still somewhere between one and two orders of magnitude slower than a fast Huffman coder and there's no adaptive modeling going on (although I'm dying to get to grips with those nifty Fenwick trees) but it's a start.
enum { AC_ALPHABET = 0x100 };
typedef uint8_t ac_char_t;

uint8_t *
ac_encode(uint8_t *dst, const ac_char_t *src, const uint32_t *model, size_t len)
{
size_t token;
uint8_t *carry;
uint32_t limit;
uint64_t accum;
uint32_t lower;
uint32_t range = 0x00000000;
uint32_t datum = 0x00000000;

while(len--)
{
token = *src++;

// Narrow down the range according to the chosen character
limit = model[token + 0];
accum = (uint64_t) limit * range;
lower = limit - (uint32_t) (accum >> 32);

limit = model[token + 1] - limit;
accum = (uint64_t) limit * range + (uint32_t) accum;
range = (uint32_t) (accum >> 32) - limit;

// Allow carries back onto the output buffer
if(datum += lower, datum < lower)
{
carry = dst;
while(!++*--carry);
}

// Renormalize range and write one byte at a time
while(range >= 0xFF000000)
{
*dst++ = (uint8_t) (datum >> 24);

datum <<= 8;
range <<= 8;
}
}

// Flush significant bytes in the output buffer. This almost certainly
// isn't optimal, and to be honest I'm not particularily confident
// that it it works reliably in the first place
lower = 0x80000000 | range >> 1;

if(datum >= lower)
{
carry = dst;
while(!++*--carry);
}

datum -= lower;
range -= lower;

for(;;)
{
*dst++ = (uint8_t) (datum >> 24);

if(range < 0xFF000000)
break;

datum <<= 8;
range <<= 8;
}

return dst;
}

const uint8_t *
ac_decode(ac_char_t *dst, const uint8_t *src, const uint32_t *model, size_t len)
{
size_t min;
size_t mid;
size_t max;
uint32_t limit;
uint32_t lower;
uint32_t upper;
uint32_t range = 0xFFFFFFFF;
uint32_t datum = 0x00000000;

while(len--)
{
// Renormalize range and read one byte at a time
while(range >= 0xFF000000)
{
datum = datum << 8 | *src++;
range <<= 8;
}

// Use a binary search to find the corresponding symbol
min = 0;
mid = AC_ALPHABET >> 1;
max = AC_ALPHABET;

lower = 0;
upper = -(int32_t) range;

do
{
// Compare fractions through multiplication
limit = model[mid];
limit -= (uint32_t) ((uint64_t) range * limit >> 32);

if(datum < limit)
max = mid, upper = limit;
else
min = mid, lower = limit;

mid = (min + max) >> 1;
}
while(mid != min);

// Write the results. Note that we get the recalculated ranges as a
// byproduct from the search
*dst++ = (ac_char_t) mid;
datum -= lower;
range = lower - upper;
}

return src;
}

void
ac_model(const ac_char_t *src, uint32_t *model, size_t len)
{
size_t i;
uint32_t fudge;
uint32_t scale;
uint32_t step;

// Count the uses of each character
model = memset(model, 0, (AC_ALPHABET + 1) * sizeof *model);

for(i = len; i; --i)
++model[*src++];

// Avoid zero weights by reserving the minimum amount of space needed to
// uniquely identify a character out of the reciprocal frequency space
fudge = 0;

for(i = AC_ALPHABET; i; --i)
if(!*model++)
fudge += 0x100;

// Assign weights inversely proportional to the characters' frequencies
scale = (uint32_t) ((0x100000000 - fudge) / len);

for(i = AC_ALPHABET; i; --i)
{
step = *--model;
len -= step;

if(!step)
fudge -= 0x100;

*model = fudge + len * scale;
}
}
Please note that the majority of the weirder expressions in the code stem from wanting to squeeze the limit of 2^32 into a 32-bit variable, hence the available range stored in negated form and the upper limit in the encoder is recalculated based on the symbols' ranges rather than directly on their own upper limits. Using 64-bit arithmetic throughout rather than merely relying on the existence of 32-bit multiplication instructions with 64-bit products would simplify things greatly.

(For some bizarre reason the board insists on blanking out my plus signs and the workaround of writing &#43 only works within [code] tags and not [source] boxes. Whatever.)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this