so, I ended up for now just going and doing something lazier:
gluing a few more things onto my existing BTIC1C format.
predicted / differential colors (saves bits by storing many colors as an approximate delta value);
support for 2x2 pixel blocks (as a compromise between flat-color blocks and 4x4 pixel blocks, a 2x2 pixel block needs 8 bits rather than 32 bits);
simplistic motion compensation (blocks from prior frames may be translated into the new frame).
all were pretty lazy, most worked ok.
the differential colors are a bit problematic though as they are prone to mess up resulting in graphical glitches (blocks which seem to overflow/underflow the color values, or result in miscolored splotches);
basically, it uses a Paeth filter (like in PNG), and tries to predict the block colors from adjacent blocks, which allows (in premise), the use of 7-bit color deltas (as a 5x5x5 cube) instead of full RGB555 colors in many cases.
I suspect there is a divergence though between the encoder-side blocks and decoder-side blocks though, to account for the colors screwing up (the blocks as they come out of the quantizer look fine though, implying that the deltas and quantization are not themselves at fault).
the 2x2 blocks and motion compensation were each a little more effective. while not pixel-accurate, the motion compensation can at least sort of deal with general movement and seems better than having nothing at all.
I suspect in general it is doing "ok" with size/quality in that I can have a 2 minute video in 50MB at 512x512 and not have it look entirely awful.
decided to run a few benchmarks, partly to verify some of my new features didn't kill decode performance.
decode speed to RGBA: ~ 140 Mpix/sec;
decode speed to DXT5: ~ 670 Mpix/sec.
decode speed to RGBA: ~ 118 Mpix/sec;
decode speed to DXT5: ~ 389 Mpix/sec.
then started wondering what would be the results of trying a multi-threaded decoder (with 4 decoder threads):
420 Mpix/sec to RGBA;
2100 Mpix/sec DXT5 (IOW: approx 2.1 gigapixels per second).
this is for a non-Deflated version, as for the Deflated version, performance kind of goes to crap as the threads end up all ramming into a mutex protecting the inflater (not currently thread safe).
BTIC1C spec (working draft):
BTIC3A partial spec (idea spec):
(doesn't seem like much, but the issues are more subtle).
well, it looks like 3A may not be entirely dead, there are a few parts I am considering trying to "generalize out", so it may not all be loss. for example, the bitstream code was originally generalized somewhat (mostly as I was like "you know what, copy-pasting a lot of this is getting stupid", as well as it still shares some structures with BTIC2C).
likewise, I may generalize out the use of 256-bit meta-blocks on the encoder end (rather than a 128-bit block format), partly as the format needs to deal both with representing pixel data, and also some amount of internal metadata (mostly related to the block quantizer), and 256-bits provides a little more room to work with.
don't know yet if this could lead to a (probably less ambitious) 3B effort, or what exactly this would look like (several possibilities exist). partly tempted by thoughts of maybe using a PNG-like or DWT-based transform for the block colors.