second test with Quake3 Arena, this time in 1440x900...
differences from before:
* More performance tweaks / micro-optimization;
* Now the rasterizer supports multiple threads (2 threads were used in this test, with the screen divided in half);
* Inline ASM is used in a few places;
it is still sort of laggy, but I am not really sure how far software rasterization can be pushed on a generic desktop PC.
CPU: Phenom II X4 3.4 GHz;
RAM: 4x4GB PC3-1060
note: it is a fair bit faster at 1024x768 or lower...
basically, lacking much better to do recently, I wrote a basic software rasterizer and put an OpenGL front-end on it, then proceeded to make Quake 2 work on it:
yes, kind of sucks, but there are some limits as to what is likely doable on the CPU. also, it is plain C and single threaded scalar code (no SSE). a rasterizer using multiple threads and/or SSE could potentially do a little more, but I don't know, and don't expect there to really be much practical use for something like this, so alas.
writing something like this though does make one a little more aware what things are involved trying to get from geometry to the final output.
otherwise, may need to try to find something more relevant to do...
ADD: as-is, it basically mimics an "OpenGL Miniport" DLL, which means it exports the usual GL 1.1 calls, along with some WGL calls, and some wrapped GDI calls.
this is loaded up by Quake2, which then goes and uses "GetProcAddress" a bunch of times to fetch the various function pointers.
it has to export pretty much all of the 1.1 calls, though a lot of them are basically no-op stubs (would normally set an error status, currently rigged up to intentionally crash so the debugger can catch it...).
as for calls implemented: simple answer: about 1/4 to 1/2 of them.
as for functionality implemented by the rasterizer: * most stuff related to glBegin/glEnd; * things like glTexImage2D, glTexParameter, ... * various misc things, like glClear, glClearColor, glDepthRange, ... * matrix operations (glPushMatrix, glPopMatrix, ...) * ...
as for functionality not implemented: * texture-coordinate generation stuff; * display lists, selection buffers, accumulation buffer, ... * pretty much everything else where I was like "what is this and what would it be used for?" * currently doesn't do DrawArrays or DrawElements, but this may change. ** would basically be needed for Quake3 to work IIRC. ** partial provisions have been made, but logic isn't written yet. internally, it implements the actual drawing to the screen via CreateDIBSection and BitBlt and similar.
then it has a few buffers, for example, a color-buffer, implemented as an array of 32-bit pixel colors (in 0xAARRGGBB order, AKA, BGRA), as well as a Depth+Stencil buffer in Depth24_Stencil8 format (I was originally going to use Depth16 and no stencil, but then I realized that space for a stencil buffer could be provided "almost for free").
at its core, its main operation is "drawing spans", which look something like:
with functions which in turn use these for drawing triangles, ... (damn near everything is decomposed into triangles).
these functions are basically given raw pointers to the appropriate locations in the respective framebuffers.
the basic strategy for drawing each triangle is to sort the vertex from lowest to highest Y coordinate, then walk from one end of the triangle to the other, drawing each span.
* so: Y0=lowest, Y1=middle, Y2=highest * calculate stepping vectors for left/right sides (Y0 to Y1) * walk from Y0 to Y1, drawing each span. * recalculate vectors for Y1 to Y2 * walk from Y1 to Y2, drawing spans.
dunno if there is a better way.
the actual process rendering goes something like: * build up arrays of vertices via the glBegin/glEnd/glVertex interface; * if needed, decompose into triangles (pretty much everything other than GL_TRIANGLES is decomposed); ** ok, GL_QUADS is also semi-primitive, but in the rasterizer each quad is treated as a triangle pair. * feed vertex data through a combined Projection*ModelView matrix; * subdivide triangles, such that each triangle has a limited screen-area; ** needed or else textures warp all over the place (crazy bad texture deformation); ** each triangle, if sufficiently large, is split into 4 sub-triangles, which may happen recursively; *** Zelda Triforce logo configuration. ** quads are also divided into 4 pieces. * divide all vertex XYZ coordinates by W; * clip all the triangles/quads/... to fit on screen; * convert this into the form used by the rasterizer: ** fixed point XY values with separate Z. ** set magic numbers to indicate which drawing logic will be used. *** flat color? texture? needs fancy blending or tests? ... * hand them off to backend.
backend: * walks lists of triangles/quads, passing each to the appropriate rasterizer function. ** quads basically just invoke the triangle-drawer twice, first for vertices 0/1/2, then for 0/2/3. * there are different triangle-draw functions for different types of triangles (flat, textured, interpolated color, ...) ** in turn, function-pointers are often used to select the appropriate span-drawing functions. * the handling of blending/tests/... is basically done by assembling the blend-and-test logic out of function pointers. ** different functions for the different collections of tests to be performed; ** functions to select each individual operator for a given test *** GL_LESS vs GL_GEQUAL, GL_SRC_COLOR vs GL_DST_ALPHA, ... *** don't want to use switches, as these are *very slow* if done per-pixel. **** wanted to avoid the thing going at glacial speeds at least...
I had recently fiddled some with BTIC1C real-time recording, and have got some interesting results:
main desktop PC, mostly holds a solid 29/30 fps for recording.
* 1680x1050p30 on a 3.4 GHz Phenom II X4 with 16GB PC3-1066 RAM.
it holds 25-28 fps on my newer laptop:
* 1440x900p30 on a 2.1 GHz Pentium Dual-Core with 4GB RAM.
it does a solid 30 fps on my older laptop:
* 1024x768p30 on a 1.6 GHz Mobile Athlon (single-core) with 1GB RAM.
** thought it was 1.2 GHz, seems I was misremembering.
** kind of kills things though as comparatively, this laptop is too fast for the screen resolution.
*** to be fair, for this resolution, the CPU would have needed to be ~ 1.2-1.4 GHz.
was half-considering testing on an ASUS EEE, but I seem to have misplaced it.
running off other calculations, there is a statistically high chance that an EEE would be able to record full-screen video at ~ 20 fps or so (given its clock-speeds and resolution).
or, if I could build for Android, maybe testing on my tablet or phone (Sony Xperia X8, *).
*: like the EEE, it is basically what sorts of video encoding I can get out of an ~ 600MHz CPU.
linear extrapolation implies it should be able to pull around 20 fps from 800x480 and ~ 30fps from 640x480.
my tablet has HW stats on-par with my laptops though, so it may not mean a whole lot (ran 3DMark, got 5225... CPU speed is similar to old laptop, but graphics and framerates look a lot prettier than either of my laptops).
actually, theoretically, the Ouya also has HW stats on par with my laptops as well (raw clock speed in-between them, but has 4 cores and fast RAM).
ADD: ( testing desktop PC with 3DMark
Cloud Gate: 10890
Ice Storm: 81583
Fire Strike: 4083 (*)
*: Updated, didn't crash this time... )
here is from another test involving desktop recording and Minecraft:
most of the lag/jerkiness was actually from Minecraft itself, which is basically how it plays on my computer (but I was happy originally when I got some newer parts, mostly because Minecraft usually stays above 20).
messed recently with special 4:2:0 block-modes, which can improve image quality but hurt encoder speeds (they effectively store color information for each 2x2 pixel sub-block, rather than for the whole 4x4 block, and require more arithmetic in the pixel-to-block transform).
I did introduce a more limited form of differential color-coding, which seems to have actually increased encoder speed for some reason (colors will often be stored as a delta from the prior color rather than the color itself, *).
generally, color prediction has been largely restricted down to last-seen-color prediction, generally because this can be done without needing to retrieve colors from the output blocks (it will simply keep track of the last-seen colors, using these as the predictors, which is a little cheaper, and also less problematic).
*: as-is, for 23-bit colors, a 15-bit delta-color will be used, and will fall back to explicit colors for large deltas.
so, what happened recently:
one of my HDDs (in this case, a Seagate 2TB drive) decided to die on me mostly without warning.
* curse of Seagate: damn near every time I have used a Seagate drive, it has usually ended up dying on me.
** sort of like my experience with ASUS motherboards...
*** had reliability problems with several ASUS MOBOs, got a Gigabyte MOBO, and everything worked.
*** maybe not a good sign when the MOBOs come wrapped in cardboard.
** actually the MOBO issue is less severe, a crashy OS is less of an issue than lots of lost data.
luckily for me, it wasn't my main OS drive (which is currently a "WD Caviar Black").
well, originally I was using a "Caviar Green" for my main OS drive, but it was 5400 RPM and often had performance issues (computer would often lag/stall, becoming somewhat IO bound). using a (more expensive) 7200 RPM drive made performance a bit better.
but, yeah, didn't loose much particularly important, as (luckily) I tend to try to keep multiple copies of a lot of more important stuff (on different drives, ...). but, still did loose some amount of stuff (like, all my 2D character art, and random downloaded YouTube videos, and my installations of Steam and VS2013, ...), and may have lost one of my newer fiction stories, ...
some of this is because, sadly, both drive reliability (and the OS going and stupidly turning the filesystem into mincemeat) are not entirely uncommon IME (though the FS mincemeat issue seemed more common with WinXP and NTFS drives, not seen it happen with Win7 thus far, nor had I seen it happen with FAT32 drives).
in this case, the HDD itself mostly just stopped working, with Windows mostly just giving lots of "drive controller not ready" error messages, and Windows otherwise not listing the drive as working. though a few times it had worked sort-of, and (with luck) when I get a new HDD, maybe I can see if I can image the old contents onto a new drive.
otherwise, recently added an in-engine profiler.
mostly this was because after the HDD crash, and resorting back to VS2008 for building my 3D engine (VS2013 had been installed on the crashed HDD), CodeXL stopped being able to effectively profile the thing for some reason or another.
since both CodeAnalyst and CodeXL have a lot of "often don't work for crap" issues, I was like "oh hell with it" and basically just made a rudimentary profiler which can be run inside the engine. it aggregates things and tells me which functions use most of the time, which is the main thing anyways (source-level profiling is nice, but would be more involved, and would probably require a proper UI vs just dumping crap to the console).
did observe that the majority of the execution time in these tests was going into "NtDelayExecution", which was mostly related to sleeping threads. made it so that the statistics aggregation ignores this function, mostly so that more sane percentages can be given to other functions.
beyond this, most of the execution time seems to be going into the OpenGL driver, and into some otherwise unknown machine-code (not part of any of the loaded DLLs, nor part of the BSVM JIT / executable-heap). may be part of OpenGL.
this becomes more so if the draw-distance is increased.
did otherwise make some animated clouds and a new sun effect.
basically, rather than simply using a static skybox, it now uses a skybox with a sun overlay and some animated clouds overlaid (though with a few unresolved issues, like color-blending not working on the clouds for some reason I have yet to figure out).
new clouds and some tweaks to metal biome can be seen here:
started working on a 2D animation tool, but then ran into UI complexities;
UI handling in my 3D engine has become a bit of a tangled mess, and there is no real abstraction over it (most things are, for the most part, handling keyboard events and mouse movements...);
there is theoretically support for GUI widgets, but I wrote the code in question 10 years ago, and manage to do it sufficiently badly that doing UI stuff via drawing stuff and raw input handling is actually easier (*1);
sometimes I look into trying to clean up the GUI widgets thing, and am like "blarg" and don't make a whole lot of progress;
other times, I consider "maybe I will make a new GUI widgets system that *doesn't* totally suck", followed by "but I already have these existing widgets, maybe I can fix it up?" followed by "blarg!".
10 years ago I wrote a few things which were ok, and some other stuff which is just plain nasty, but has largely become a black box in that I can't really make it not suck, but also often can't easily replace it without breaking other stuff.
*1: but it is GUI widgets. don't these normally suck?...
well, yes, but this one extra sucks, as it was basically based around a stack-based mapping of XHTML forms to C;
but, without some way to distinguish form instances... so every widget is identified via a global 'id' name, and there may only be a single widget with this name, anywhere. also, no facilities were provided, you know, to update widget contents.
also didn't turn out to really be a sane design for most of the types of stuff I am doing.
so, somehow, I managed to make something pretty much less usable or useful than GTK or GDI...
doesn't help much when looking into it and realizing that there isn't much logic behind it that isn't stuff one would need to replace anyways (some of the structs are useful, but seemingly this is about it).
but, a 2D animation tool, while it needs a UI, doesn't necessarily need a traditional GUI.
"well, there are always modes and keyboard shortcuts!". yes, fair enough, but it doesn't help when one is left with a UI where pretty much everything (in the 3D engine) is a big monolithic UI, and there are few good options for keyboard shortcuts remaining (and "CTRL+F1,CTRL+SHIFT+G" is a bit outside "good" territory).
yes, my 3D modeller, game, mapper, ... all use the same keyboard and mouse-handling code, just with a lot of internal flags controlling everything. in earlier forms of my game effort, it actually required doing an elaborate keyboard dance of various shortcuts to get into a mode where the controls would work as-expected. I then later made the engine front-end set this up by default and effectively lock the UI configuration (short of a special shortcut to "unlock" the UI).
theoretically, I have added a solution to partly address this: now you can "ESC,~" (or "ESC,SHIFT+`") into a tabbed selector for "running programs", along with a possible option for a considered drop-list for launching programs (which would probably work by stuffing commands into the console, probably launching scripts...). clicking on tabs can then be used to switch focus between programs, and possibly allowing a cleaner way to handle various use-cases ("hell, maybe I could add a text-editor and a graphics program and a file manager...", "oh, wait...").
but, on the positive side, effectively this mode bypasses nearly all of the normal user-input handling, allowing each "program" a lot more free-reign over the keyboard shortcuts. architecturally, it is on-par with the console (toggled with "ALT+`", and is sort of like a shell just currently without IO redirection or pipes).
but, OTOH, I am not so happy with the present UI situation...
a lot of this, is, horrid...
thus far, in the 2D animation tool, I can sort of add items and move them around and step between frames (with the movement being interpolated, ...), so it is a start, but still falls well short of what would be needed for a usable 2D animation tool (that is hopefully less effort than the current strategy of doing basic 2D animation via globs of script code...).
probably will need a concept of "scenes", where in each scene it will be possible to add objects and set various keyframes, ... but, at the moment, I am less certain, seems all like a bit of an undertaking.
did at least go and make some improvements to the in-game video recording:
switched from using RPZA to a BTIC1C subset for recording, which has somewhat better image quality and lower bitrate, in this case using a more speed-oriented encoder (vs the main encoder, which more prioritizes size/quality);
basically holds up pretty well with tests for recording at full-screen 1680x1050p24;
made some tweaks to reduce temporal aliasing issues (mostly related to inter-thread timing issues);
also fiddled some with trying to get audio more in sync (recorded video had the audio somewhat out of sync, sort of fudged them more back into alignment via inserting about 400ms of silence at the start of the recording... but this is a crap solution... not sure at present a good way to automatically adjust for internal A/V latency).
the BTIC1C variant encoder basically mostly just uses straight RGB23 blocks (with no quantization stage), and a higher-speed single-pass single-stop entropy backend (uses an extended Deflate-based format). this allows faster encoding albeit with worse compression.
the normal Deflate/BTLZH encoder uses a 3-pass encoding strategy:
LZ77 encode data, count up symbol statistics, build and emit Huffman tables, emit Huffman-coded LZ data.
the current encoder speeds this up slightly by using the prior statistics for building the Huffman table, then doing the LZ77 and Huffman coding at the same time. it also uses another trick which is that it doesn't actually "search" for matches, just hashes the data it encounters, and sees if the current hash-table entry points to a match.
the compression is a little worse, but the advantage is in being able to use the entropy backend for real-time encoding (vs the primary encoder which is a bit slow for real-time).
the temporal aliasing issue was mostly a problem which resulted in a notable drop in the effective framerate of the recording, as I had found that many of the frames which were captured were being lost and many frames were being duplicated in the output. I ended up making some tweaks to the handling of accumulation timers and similar, and the number of lost and duplicate frames is notably reduced.
test from in-game recording: 1680x1050p24 RGB23 uses about 19Mbps, and about 0.46 bpp.
in other tests, this works out to around 6-7 minutes per GB of recording.
this is also a bit better than about 2 minutes per GB I can get from M-JPEG, and seems to have "mostly similar" video quality (and without the JPEG encoder's limitation of being too slow for recording at higher resolutions, which is part of the reason I had switched over to RPZA to begin with).
don't yet have any videos up for the current version.
the most recent video I have at the time of this writing is for a version of the new codec prior to addressing a few image quality issues nor the temporal aliasing or audio sync issues (so the video is a little laggy and the audio isn't really in-sync...).
what is it?... basically, an extended form of Deflate, intended mostly to improve compression with a "modest" impact on decoding speed (while also offering a modest boost in compression).
in its simplest mode, it is basically just Deflate, and is binary compatible; otherwise, the decoder remains backwards compatible with Deflate.
its extensions are mostly as such: bigger maximum match length (64KiB); bigger maximum dictionary size (theoretical 4GB, likely smaller due to implementation limits); optional arithmetic coded modes.
the idea was partly to have a compromise between Deflate and LZMA, with the encoder able to make some tradeoffs WRT compression settings (speed vs ratio, ...). the hope basically being to have something which could compress better than Deflate but decode faster than LZMA.
the arithmetic coder is currently applied after the Huffman and VLC coding. this speeds things up slightly by reducing the number of bits which have to be fed through the (otherwise slow) arithmetic coder, while at the same time still offering some (modest) compression benefit from the arithmetic coder.
otherwise, arithmetic coder can be left disabled (and bits are read/written more directly), in which case the decoding will be somewhat faster (it generally seems to make around a 10-15% size difference, but around a 2x decoding-speed difference).
ADD: in the tests with video stuff, overall I am getting around a 30% compression increase (vs Deflate).
what am I using it for? mostly as a Deflate alternative for the BTIC family of video codecs (many of which had used Deflate as their back-end entropy coder); possibly other use cases (compressing voxel region files?...). ...
otherwise, I am now much closer to being able to switch BTIC1C over to full RGB colors; most of the relevant logic has been written, so it is mostly finishing up and testing it at this point. this should improve the image-quality at higher quality settings for BC7 and RGBA output (but will have little effect on DXTn output).
most of the work here has been on the encoder end, mostly due to the original choice for the representation of pixel-blocks, and there being almost no abstraction over the block format here (it is sad when "move some of this crap into predicate functions and similar" is a big step forwards, a lot of this logic is basically decision trees and raw pointer arithmetic and bit-twiddling and similar). yeah, probably not a great implementation strategy in retrospect.
the current choice of blocks looks basically like: AlphaBlock:QWORD ColorBlock:QWORD ExtColorBlock:QWORD MetadataBlock:QWORD
so, each new encoder-side block is 256 bits, and spreads the color over the primary ColorBlock and ExtColorBlock. in total, there is currently about 60 bits for color data, which is currently used to (slightly inefficiently) encode a pair of 24-bit colors (had thought, "maybe I can use the other 32 bits for something else", may reconsider. had used a strategy where ExtColorBlock held a delta from the "canonical decoded color").
for 31F colors, I may need to use the block to hold the color-points directly: ExtColorBlock: ColorA:DWORD ColorB:DWORD
had also recently gained some quality improvement mostly by tweaking the algorithm for choosing color endpoints: rather than simply using a single gamma function and simply picking the brightest and darkest endpoints, it now uses 4 gamma functions. roughly, by fiddling, I got the best results with a CYGM (Cyan, Yellow, Green, Magenta) based color-space, where each gamma function is an impure form of these colors (permutations of 0.5, 0.35, 0.15). the block encoder then chooses the function (and endpoints) which generated the highest contrast. this basically improved quality with less impact on encoder speed than with some other options (it can still be done in a single pass over the input pixels). it generally improves the quality of sharp color transitions (reducing obvious color bleed), but does seem to come at the cost in these cases of slightly reducing the accuracy of preserved brightness.
this change was then also applied to my BC7 encoder and similar with good effect.
little provision is made for out-of-box use, as in, if anyone wants to compile or mess with it, probably some hackery will be needed. it provides VFW codec drivers for encoding/decoding, with the encoder currently hard-coded to use BTIC1C (with most of the encoding settings also hard-coded). in my tests it can encode videos with VirtualDub though, so it works here at least.
I may later consider adding a codec configuration UI or similar, as well as maybe clean up a few things (better provisions for handling logging and configuration). this would likely mean either putting the config information in the registry, or putting an INI somewhere (rather than just hard-coding stuff like where to put the log file and similar).
I have thus far not really finished the level of 1C encoder modifications needed to effectively support expanded color depths (moving forwards here largely requires some fairly non-trivial rewriting of the encoder, effectively moving the encoder over to a new intermediate block format, ...).
on another note: made a recent observation that speech is still intelligible at 8kHz 1bit/sample (just it has a harsh/buzzy "retro" sound); not sure as of yet if I will make much use of this. it could mostly be relevant WRT hand-editing sample data as sequences of hex-numbers or similar. a high-pass filter is needed though, otherwise there are significant audio problems. in my tests, I was having best results filtering out everything below about 250Hz. potentially, direct 4 bits/sample could also make sense, as it would map 1 sample per hex character.
example, simple sine wave: 89AB CDEE FFEE DCBA 8976 5432 1100 1123 4567 36 samples, 0.0045 seconds (222 Hz). as 1bpp: FF FF C0 00 0 or, as 2bpp (compromise): AAFF FFFA A550 0000 55
well, first off, recently did a test showing the image quality for BTIC1C:
this test was for a video at 1024x1024 with 8.6 Mbps and 0.55 bpp.
as noted, the quality degradation is noticeable, but "mostly passable". some amount of it is due largely to the conversion to RGB555, rather than actual quantization artifacts (partly because video compression and dithering don't really mix well in my tests). however, some quantization artifacts are visible.
I have split apart BTIC1C and RPZA into different codecs, mostly as 1C has diverged sufficiently from RPZA that keeping them as a single codec was becoming problematic.
BTIC1C now has BC6H and BC7 decode routes, with single-thread decode speeds of around 320-340 Mpix/sec for BC7, and around 400 Mpix/sec for BC6H (the speed difference is mostly due to the lack of an alpha channel in 6H, and slightly awkward handling of alpha in BC7).
as-is, both effectively use a subset of the format (currently Mode 5 for BC7, and Mode 11 for 6H).
the (theoretical) color depth has been expanded, as it now supports 23-bit RGB and 31-bit RGB. RGB23 will give (approximately) a full 24-bit color depth (mostly for BC7, possibly could be used for RGBA).
RGB31 will support HDR (for BC6H), and comes in signed and unsigned variants. as-is, it stores 10-bits per component (as floating-point).
likewise, the 256-color indexed block-modes have been expanded to support 23 and 31 bit RGB colors.
these modes are coerced to RGB565 for DXTn decoding, as well as RGB555 still being usable with BC7 and BC6H, ... this means that video intended for one format can still be decoded for another if-needed (though videos will still have a "preferred format").
as-is, it will still require some work on the encoder end to be able to generate output supporting these color depths (likely moving from 128 to 256 blocks on the encoder end).
the current encoder basically uses a hacked form of DXT5 for its intermediate form, where: (AlphaA>AlphaB) && (ColorA>ColorB) basically the same as DXT5. (AlphaA<=AlphaB) || (ColorA<=ColorB) special cases (flat colors, skip blocks, ...)
however, there are no free bits for more color data (at least while keeping block-complexity "reasonable"). so, likely, it will be necessary to expand the block size to 256 bits and probably use a 128-bit color block.
ex: 64-bits: tag and metadata 64-bits: alpha block 128-bits: expanded color block.
this would not effect the output format, as these blocks are purely intermediate (used for frame conversion/quantization/encoding), but would require a bit of alteration to the encoder-side logic.
it sort of works I guess...
video-texture, now with audio...
had an idea here for how to do a DXTn-space deblocking filter, but it would likely come with a bit of a speed cost. may try it out and see if it works ok though.
well, the BTIC3A effort also kind of stalled out, mostly as the format turns out to be overly complex to implement (particularly on the encoder). I may revive the effort later, or maybe try again with a simpler design (leaving blocks in raster order and probably designing it to be easier to encode with a multi-stage encoder).
so, I ended up for now just going and doing something lazier: gluing a few more things onto my existing BTIC1C format.
these are: predicted / differential colors (saves bits by storing many colors as an approximate delta value); support for 2x2 pixel blocks (as a compromise between flat-color blocks and 4x4 pixel blocks, a 2x2 pixel block needs 8 bits rather than 32 bits); simplistic motion compensation (blocks from prior frames may be translated into the new frame).
all were pretty lazy, most worked ok.
the differential colors are a bit problematic though as they are prone to mess up resulting in graphical glitches (blocks which seem to overflow/underflow the color values, or result in miscolored splotches);
basically, it uses a Paeth filter (like in PNG), and tries to predict the block colors from adjacent blocks, which allows (in premise), the use of 7-bit color deltas (as a 5x5x5 cube) instead of full RGB555 colors in many cases.
I suspect there is a divergence though between the encoder-side blocks and decoder-side blocks though, to account for the colors screwing up (the blocks as they come out of the quantizer look fine though, implying that the deltas and quantization are not themselves at fault).
the 2x2 blocks and motion compensation were each a little more effective. while not pixel-accurate, the motion compensation can at least sort of deal with general movement and seems better than having nothing at all.
I suspect in general it is doing "ok" with size/quality in that I can have a 2 minute video in 50MB at 512x512 and not have it look entirely awful.
decided to run a few benchmarks, partly to verify some of my new features didn't kill decode performance.
non-Deflated version: decode speed to RGBA: ~ 140 Mpix/sec; decode speed to DXT5: ~ 670 Mpix/sec.
Deflated version: decode speed to RGBA: ~ 118 Mpix/sec; decode speed to DXT5: ~ 389 Mpix/sec.
then started wondering what would be the results of trying a multi-threaded decoder (with 4 decoder threads): 420 Mpix/sec to RGBA; 2100 Mpix/sec DXT5 (IOW: approx 2.1 gigapixels per second).
this is for a non-Deflated version, as for the Deflated version, performance kind of goes to crap as the threads end up all ramming into a mutex protecting the inflater (not currently thread safe).
ADD2: well, it looks like 3A may not be entirely dead, there are a few parts I am considering trying to "generalize out", so it may not all be loss. for example, the bitstream code was originally generalized somewhat (mostly as I was like "you know what, copy-pasting a lot of this is getting stupid", as well as it still shares some structures with BTIC2C).
likewise, I may generalize out the use of 256-bit meta-blocks on the encoder end (rather than a 128-bit block format), partly as the format needs to deal both with representing pixel data, and also some amount of internal metadata (mostly related to the block quantizer), and 256-bits provides a little more room to work with.
don't know yet if this could lead to a (probably less ambitious) 3B effort, or what exactly this would look like (several possibilities exist). partly tempted by thoughts of maybe using a PNG-like or DWT-based transform for the block colors.
seeing as how my graphics hardware has a limited number of options for (non DXTn / S3TC) compressed texture formats, but does support BPTC / BC6H / BC7, which hinder effective real-time encoding (*), it may make sense to consider developing a video codec specifically for this.
*: though there is always the option of "just pick a block type and run with it", like always encoding BC7 in mode 5 or BC6H in mode 11 or something. note: BPTC here will be used (in the OpenGL sense) to refer both to BC6H and BC7. structurally, they are different formats, and need to be distinguished in-use. when relevant, BC6H and BC7 (their DirectX names) will be used (mostly because names like "RGBA_BPTC_UNORM" kind of suck...).
basic design: essentially fairly similar to BTIC1C and BTIC1D (which in turn both derive from Apple Video / RPZA).
unlike 1C and 1D, it (mostly) sidesteps a lot of the complexities of these texture formats, and essentially treats the blocks mostly as raw data. this should still allow a moderately simple and fast decoder (into BPTC or similar). also this stage of the process will be lossless.
this encoding allows a fairly arbitrary split between block-header and block data, which an encoder should be able to try to optimize for (and search for the "greatest savings" in terms of where to split up the block at). this also includes the ability to do "simple RLE runs" for repeating block-patterns, as well as to store raw/unencoded runs of blocks.
note that it isn't really viable to cleanly split between the header and index portions of a block given the way the blocks work.
Decode Process: Container/Packaging -> Inflate -> BTIC1E Decoder -> BPTC (passed to GL or similar).
the "Pixel Block Quantizer" step will basically try to fudge blocks to reduce the encoded image size; it is unclear exactly how it will tie in with the BPTC encoders. as-is, it is looking mostly like a tradeoff between an RGBA-space quantizer ("pre-cooking" the image) and a naive "slice and dice" quantizer (hack bits between blocks coming out of the BPTC encoder and see what it can get away with within the error threshold, basically by decoding the blocks to RGBA and comparing the results).
an issue: I have rather mixed feelings about BPTC. namely, it is only available in newer desktop-class GPUs, and could be rendered less relevant if ETC2 becomes widespread in upcoming GPUs (both having been promoted to core in OpenGL).
some of this could potentially lead to cases of needing multiple redundant animated-texture videos, which would be kind of lame (and would waste disk space and similar), though potentially still better than wasting video memory by always using an RGBA16F or RGB9_E5 version.
could almost be a case of needing to implement it and determine whether or not it sucks...
ADD: figured the likelihood of BTIC1E sucking was just too high.
which would be intended as a format to hopefully target both DXT and a BPTC subset, with other goals of being faster for getting to DXTn than BTIC2C, and compressing better than BTIC1C, target speed = 300 Mpix/sec for a single threaded decoder.
going and checking, the gap isn't quite as drastic as I had thought (if I can reduce the bitrate to 1/2 or 1/3 that of 1C, I will be doing pretty good, nevermind image quality for the moment).
I guess the reason many videos can fit 30 minutes in 200MB is mostly because of lower resolutions (640x360 has a lot fewer pixels than 1024x1024 or 2048x1024...).