the most basic idea is doing something like DXTn / BCn, but for audio.
this effectively means: fixed number of samples into fixed-size blocks.
it was initially planned to be like DXT, with only a single universal type of block, but experiments showed that no single block encoding was really doing ideally "for everything".
so, core idea:
a fixed size block of mono or stereo samples (currently 64) is encoded into a fixed-size block (currently 256 bits, or 32-bytes). this leads to about 4 bits per sample (for both mono and stereo), and a bit-rate of 176kbps for 44.1 kHz audio.
the partial idea here is to allow fast random access to sample blocks, without needing to first decode all of the audio (instead probably a fixed-size cache can be used). each block can then be decoded independently and stored in the cache. also, a design priority was that everything also be a nice power-of-2 size.
other options would have drawbacks:
for example, Ogg/Vorbis and MP3 aren't really well suited to random-access (likely requiring storing sound-effects decoded in advance into PCM form);
many traditional ADPCM variants also have a similar issue.
in both cases, it implies either linear stream-style playback, or needing to decode in advance.
also, if storing stereo audio, ADPCM requires 352 kbps at 44.1 kHz (because it stores both the left and right channels, rather than using joint-stereo).
staying at <= the bitrate of (mono) ADPCM, and ideally comparable or better audio quality, was a goal.
regardless of exact audio quality, the goal does seem to have been met.
(I can't say I have beaten ADPCM quality, but can at least say that in its present form it is not significantly worse...).
(my initial code I put up was just after I had started getting it working, and has considerably worse sound quality than the current form...).
/* Experimental block-based audio codec. Encodes blocks of 64 samples into 256 bits (32 bytes). At 44.1kHz this is 176kbps. It can encode stereo using a "naive joint stereo" encoding. Most block formats will encode a single center channel and will offset it for the left/right channel. Basic Format 0: 4 bit: Block-Mode (0) currently unused (12 bits, zeroed) 16 bit min sample (center) 16 bit max sample (center) 8 bit left-center min (truncated) 8 bit left-center max 64 Samples, 1 bits/sample (64 bits) 16x 4-bit min (64 bits) 16x 4-bit max (64 bits) The 4-bit values interpolate between the full min/max for the block. The 1-bit samples select between the min and max value for each sample. Note: Interpolated values are linear, thus 0=0/15, 1=1/15, 2=2/15, ..., 14=14/15, 15=15/15 Bit packing is in low-high order, and multibyte values are little-endian. Basic Format 1: 4 bit: Block-Mode (1) currently unused (12 bits, zeroed) 16 bit min sample (center) 16 bit max sample (center) 8 bit left-center min (truncated) 8 bit left-center max 32x 2-bit sample (64 bits) 32x 4-bit sample (128 bits) This directly codes all samples, with the 4-bit values encoding even samples, and the 2-bit values encoding odd samples. The 4-bit samples are encoded between the block min/max values, and the 2-bit samples between the prior/next sample. Sample interpolation (2 bit samples): 0=prior sample, 1=next sample, 2=average, 3=quadratic interpolated value. Basic Format 2: 4 bit: Block-Mode (2) currently unused (12 bits, zeroed) 16 bit min sample (center) 16 bit max sample (center) 8 bit left-center min (truncated) 8 bit left-center max 32x 6-bit samples (192 bits) This directly codes samples, with the 6-bit values encoding samples. The 6-bit samples are encoded between the block min/max values. This mode encodes even samples, with odd-samples being interpolated. The last sample is extrapolated. Stereo Format 3: 4 bit: Block-Mode (3) currently unused (12 bits, zeroed) 16 bit min sample (center) 16 bit max sample (center) 8 bit left-center min (truncated) 8 bit left-center max 32x 2-bit pan (64 bits) 32x 4-bit sample (128 bits) This directly codes samples, with the 4-bit values encoding even samples. The 2-bit pan value encodes the relative pan of the sample. The 4-bit samples are encoded between the block min/max values. The 2-bit samples represent values as: 0=center pan (offset): The sample will be offset for left/right channels. 1=center-pan (duplicate): The sample will be the same (center) value for both channels. 2=left-pan: The sample will be panned towards the left. 3=right pan: The sample will be panned towards the right. This mode encodes even samples, with odd-samples being interpolated. Basic Format 4: 4 bit: Block-Mode (4) currently unused (12 bits, zeroed) 16 bit min sample (center) 16 bit max sample (center) 8 bit left-center min (truncated) 8 bit left-center max 8x 4-bit min (32 bits) 8x 4-bit max (32 bits) 64x 2-bit sample (128 bits) The 4-bit values interpolate between the full min/max for the block. The 2-bit samples interpolate between the min and max values for each sub-block (0=min, 1=1/3, 2=2/3, 3=max). */
note that some things which seem like they would do better, actually do worse.
for example, 16x 12 bit samples with interpolated intermediate values: actually did poorly (the increase in sample precision did not offset for the reduction in the number of representable samples).
likewise, in early tests storing all samples directly as 3-bit interpolated values, didn't really do well (vs the use of 4-bit min-max values over groups with 1 or 2 bits per sample selecting each value).
likewise, block-mode 4 did pretty well, as it partly seems to overlap with the ranges of 0 and 1/2. however, none does clearly better for the various songs tested, as different songs seem to give different breakdowns of relative filter choices.
granted, a simpler filter would probably need to choose an option which does generally does fairly well, which at the moment is split mostly between 1, 2, and 4.
0 seems biased mostly for "noisy" sounds, and 3 is only really used much when there is a more significant left/right divergence.
I can't just do an ADPCM block, as I don't really have enough bits to really make this work out well (unless it were ADPCM + odd-sample interpolation, which is at least possible, but is uncertain how well it would work compared with the range-based approach).
this is actually closer to what I had initially imagined though, but I couldn't think up any good way to fit ADPCM in power-of-2 sized blocks with a power-of-2 number of samples while using less bits than other ADPCM strategies. (things are a lot less pretty at ~ 3-bits / sample).
also, I was initially working actually at a lower target bit-rate: 88 kbps, but I soon doubted I could actually pull off the whole "doesn't sound like total crap" part, so "upgraded" the design to using 176 kbps (by halving the target number of samples), which was the next step up with still keeping everything power-of-2. (there is still some code from the earlier 128-samples in 256-bits form, which is most closely related to block-type 0, just with 128 1-bit samples, and a smaller number of groups each addressing a larger number of samples, namely: 8 groups of 16 samples).
also, the current choice of 64 samples was specific:
much larger, and the waveform generally actually starts looking like a wave (as opposed to a shaky line, *1);
much smaller, and block overhead would eat up pretty much everything else.
64 samples in 256 bits seemed to be roughly the "local minimum". it was also chosen as 176 is fairly close to the "standard" 128kbps used for Ogg and MP3, so would produce "similar" sized files to MP3, even if albeit the quality will be a bit worse... (vs 352 kbps, which would be a bit steep...).
*1: a curve is a bit more of a problem than a relatively flat line, and a full cycle is just bad (as then we have to deal with a much larger value range).
EDIT / ADD:
core code has been made available:
yes, it is a little bigger/more complex than would be ideal...