DXTn + Arithmetic Coding = Promising...
Well, here is the status...
I was recently skimming around, and looked at the VP8 and VP9 specs (used in WebM), and a detail stuck out:
the entropy coder (was called "BoolCoder" or similar...).
at first I was like, "hey, I have written something similar already...". then I looked more, realized what they were doing with it, and was like "oh... nifty!...". basically, they would compress the image data first using good old Huffman coding, and then feed this through the arithmetic coder. this then sidesteps the relative slowness of the arithmetic coder by having it not deal with as much data, yet still getting a relative compression increase to if they had used arithmetic coding or range coding directly...
I was then like, "well, you know what, I think I will dust off some of my old code and mess with it...".
alas, I got similar results, with the arithmetic coder able to squeeze another good 10-15% out of the JPEG images I was testing with.
then I decided to try it with my BTIC format, which is basically an LZ77 compressed version of DXTn textures...
for one image (*1), results were fairly impressive, with the image tested reducing output sizes by around 41%, and essentially giving file-sizes comparable to that of poorly compressed JPEG versions (20% quality), yet having a higher image quality than said 20% quality JPEG (which had very noticeable artifacts...).
for another image (a high-res photographic image, *2), results were much less impressive though, with the output file sizes of the BTIC image being more comparable to that of a higher quality (90% quality) JPEG, and the arithmetic coder only compressing it by about 9%.
*1: pony ("ponyboost")
*2: logs, leaves, car, lawnmower
in both cases, the BTIC images are still a fair bit smaller than the PNG versions though.
granted, more extensive fine-tuning and similar could possibly be done, and I haven't yet done any benchmarks on this yet (like, to analyze how much of a performance impact the arithmetic coder has, it may not be worthwhile if it turns out to be unreasonably slow...).
also remotely possible could be testing its use in combination with my BTAC (audio codec) format.
anyways, here is the arithmetic-coder code:
EDIT/ADD: not so promising are my initial benchmark results... will see about getting it faster...
ADD: more fiddling and testing continues. getting a better size/speed tradeoff with BTIC using Deflate than with AC.
current magic values are "BTIC1" (raw, *1), "BTIC1Z" (Deflate), "BTIC1A" (Arithmetic), "BTIC1AZ" (Arithmetic+Deflate).
it seems the images which compress well with AC also compress well with Deflate, and with plain Deflate the speeds are better (and 'A' and 'AZ' don't reliably produce smaller images than 'Z' either, *2).
*1: though the base-format itself does use its own internal compression, its output is byte-based and so still subject to (some) gains due to entropy coding. namely it uses a filter to reduce the number of unique image blocks, and also an LZ77 based compression stage (replacing repeating block patterns with references into a sliding window).
*2: overall, Deflate seems to be is winning here, mostly due to higher speeds...