Actually Revealing Arithmetic Coding

Started by
4 comments, last by implicit 14 years, 4 months ago
As a bit of a holiday project I've decided to familiarize myself with arithmetic coding, yet despite the popularity of the technique I'm having a bit of trouble finding decent introductory material on the subject. Oh, there's plenty of tutorials and papers out there and by and large they cover the basic coding principles admirably, but there seems to be a decided lack of coverage of the actual implementation details (for a typical example, see this paper purporting to be a comprehensive guide to arithmetic coding.) Perhaps it's just me but it's far from intuitively obvious that a limited-precision implementation would even work reliably in the first place, certainly how to best rearrange the arithmetic and round the numbers to maximize precision while avoiding error is anything but obvious. Hell, I have trouble even figuring out the minimum number of bits to flush at the end on my own. I suppose I could just copy one of the ready-made implementations, but then I wouldn't learn anything, and besides I can think of a dozen possible ways to improve precision and performance in special-cases yet I lack any real idea of how to properly analyze these changes to convince myself that I didn't miss any edge cases. Are there any readable resources out there covering all of the minutiae or is expecting to find any such material in the first place merely a symptom of my Internet-age refusal to think for myself combined with my intrinsic aversion to formal proofs?
Advertisement
wikipedia has a solid entry on range and arithmetic coding. There is some sample code there as well.
I remember going through the exact same thing when studying arithmetic coding. The theory is simple enough, but going from theory to code with the range encoding was a bit tough to figure out since nothing really described that part well when it came to implementation. My suggestion is to just check out a very basic, unoptimized, fully-functional implementation of arithmetic coding in a language you are comfortable with and see how they do it. It might work for you, or it might not - everyone learns differently.
NetGore - Open source multiplayer RPG engine
I found this article very good to read:
http://dogma.net/markn/articles/arith/part1.htm
Quote:Original post by vaneger
wikipedia has a solid entry on range and arithmetic coding. There is some sample code there as well.
Thank you. The wikipedia article is solid as usual however at the end of the day it's still an encyclopedic article, and so it lacks the depth I'm looking for (it also makes odd claims like "In frequency concept many computational challenges are avoided" without much in the way of motivation.)

Quote:Original post by Spodi
I remember going through the exact same thing when studying arithmetic coding. The theory is simple enough, but going from theory to code with the range encoding was a bit tough to figure out since nothing really described that part well when it came to implementation. My suggestion is to just check out a very basic, unoptimized, fully-functional implementation of arithmetic coding in a language you are comfortable with and see how they do it. It might work for you, or it might not - everyone learns differently.
I believe that I've mostly understood those algorithms, the bit I'm having trouble deciphering is the underlying motivation for why the code rounds the way it does and in general is written quite the way it is.
To put it bluntly my own implementation keeps failing in hard-to-find edge cases, so after half a dozen revisions intended to "just make things work" I'm having a hard time convincing myself that I have all of my I's dotted and t's crossed.

As an example it looks like people go to a ridiculous amount of trouble trying to deal with ranges straddling the halfway point, yet there is an "obvious" way of rearranging the arithmetic to increase precision and simplifying things in the encoder while avoid this mess entirely in the decoder. Now this means that either everyone is being stupid and I've just invented an awesome new trick, or proper rounding is a lot more subtle than I've been led to believe..

Quote:Original post by nmi
I found this article very good to read:
http://dogma.net/markn/articles/arith/part1.htm
Thank you, I'll check it out.
Well, it took me half of an eternity but I've finally caught on to the fact that using high precision and applying proper rounding was never all that important in the first place. I suppose what threw me was the fudge factors used in some coder sources to maintain an inclusive upper bound, so I had been trying to carefully and conservatively round the lower bound up and the upper bound down while keeping things separate during renormalization.

However a far less finicky way of doing things, and what I finally figured out that the real coders were really doing, is to use an exclusive upper bound and make sure it is recalculated using the exact same arithmetic which would have been used had it instead been the lower bound of the next higher character. A small bit of additional care needs to be taken to always round the boundaries down, since there's no guarantee that our imaginary next character will get renormalized in lock-step with our particular range, but that's about all there is to it.

I still haven't figured out what the deal is with seemingly everyone deferring renormalization of ranges straddling the middle yet. Summing up the lower bound of the characters into a separate accumulator and carrying back into the output buffer when it overflows appears to work just fine without complicating the decoder or losing one bit of precision.

In the unlikely event that someone is curious I have decided to submit the coder itself. It's still somewhere between one and two orders of magnitude slower than a fast Huffman coder and there's no adaptive modeling going on (although I'm dying to get to grips with those nifty Fenwick trees) but it's a start.
enum { AC_ALPHABET = 0x100 };typedef uint8_t ac_char_t;uint8_t *ac_encode(uint8_t *dst, const ac_char_t *src, const uint32_t *model, size_t len){	size_t token;	uint8_t *carry;	uint32_t limit;	uint64_t accum;	uint32_t lower;	uint32_t range = 0x00000000;	uint32_t datum = 0x00000000;	while(len--)	{		token = *src++;		// Narrow down the range according to the chosen character		limit = model[token + 0];		accum = (uint64_t) limit * range;		lower = limit - (uint32_t) (accum >> 32);		limit = model[token + 1] - limit;		accum = (uint64_t) limit * range + (uint32_t) accum;		range = (uint32_t) (accum >> 32) - limit;		// Allow carries back onto the output buffer		if(datum += lower, datum < lower)		{			carry = dst;			while(!++*--carry);		}		// Renormalize range and write one byte at a time		while(range >= 0xFF000000)		{			*dst++ = (uint8_t) (datum >> 24);			datum <<= 8;			range <<= 8;		}	}	// Flush significant bytes in the output buffer. This almost certainly	// isn't optimal, and to be honest I'm not particularily confident	// that it it works reliably in the first place	lower = 0x80000000 | range >> 1;	if(datum >= lower)	{		carry = dst;		while(!++*--carry);	}	datum -= lower;	range -= lower;	for(;;)	{		*dst++ = (uint8_t) (datum >> 24);		if(range < 0xFF000000)			break;		datum <<= 8;		range <<= 8;	}	return dst;}const uint8_t *ac_decode(ac_char_t *dst, const uint8_t *src, const uint32_t *model, size_t len){	size_t min;	size_t mid;	size_t max;	uint32_t limit;	uint32_t lower;	uint32_t upper;	uint32_t range = 0xFFFFFFFF;	uint32_t datum = 0x00000000;	while(len--)	{		// Renormalize range and read one byte at a time		while(range >= 0xFF000000)		{			datum = datum << 8 | *src++;			range <<= 8;		}		// Use a binary search to find the corresponding symbol		min = 0;		mid = AC_ALPHABET >> 1;		max = AC_ALPHABET;		lower = 0;		upper = -(int32_t) range;		do		{			// Compare fractions through multiplication			limit = model[mid];			limit -= (uint32_t) ((uint64_t) range * limit >> 32);			if(datum < limit)				max = mid, upper = limit;			else				min = mid, lower = limit;			mid = (min + max) >> 1;		}		while(mid != min);		// Write the results. Note that we get the recalculated ranges as a		// byproduct from the search		*dst++ = (ac_char_t) mid;		datum -= lower;		range = lower - upper;	}	return src;}voidac_model(const ac_char_t *src, uint32_t *model, size_t len){	size_t i;	uint32_t fudge;	uint32_t scale;	uint32_t step;	// Count the uses of each character	model = memset(model, 0, (AC_ALPHABET + 1) * sizeof *model);	for(i = len; i; --i)		++model[*src++];	// Avoid zero weights by reserving the minimum amount of space needed to	// uniquely identify a character out of the reciprocal frequency space	fudge = 0;	for(i = AC_ALPHABET; i; --i)		if(!*model++)			fudge += 0x100;	// Assign weights inversely proportional to the characters' frequencies	scale = (uint32_t) ((0x100000000 - fudge) / len);	for(i = AC_ALPHABET; i; --i)	{		step = *--model;		len -= step;		if(!step)			fudge -= 0x100;		*model = fudge + len * scale;	}}
Please note that the majority of the weirder expressions in the code stem from wanting to squeeze the limit of 2^32 into a 32-bit variable, hence the available range stored in negated form and the upper limit in the encoder is recalculated based on the symbols' ranges rather than directly on their own upper limits. Using 64-bit arithmetic throughout rather than merely relying on the existence of 32-bit multiplication instructions with 64-bit products would simplify things greatly.

(For some bizarre reason the board insists on blanking out my plus signs and the workaround of writing &#43 only works within [code] tags and not [source] boxes. Whatever.)

This topic is closed to new replies.

Advertisement