Sign in to follow this  
  • entries
  • comments
  • views

About this blog

Probably status updates or something...

Entries in this blog

second test with Quake3 Arena, this time in 1440x900...

differences from before:
* More performance tweaks / micro-optimization;
* Now the rasterizer supports multiple threads (2 threads were used in this test, with the screen divided in half);
* Inline ASM is used in a few places;
* ...

it is still sort of laggy, but I am not really sure how far software rasterization can be pushed on a generic desktop PC.

CPU: Phenom II X4 3.4 GHz;
RAM: 4x4GB PC3-1060

note: it is a fair bit faster at 1024x768 or lower...
basically, lacking much better to do recently, I wrote a basic software rasterizer and put an OpenGL front-end on it, then proceeded to make Quake 2 work on it:

yes, kind of sucks, but there are some limits as to what is likely doable on the CPU.
also, it is plain C and single threaded scalar code (no SSE).
a rasterizer using multiple threads and/or SSE could potentially do a little more, but I don't know, and don't expect there to really be much practical use for something like this, so alas.

writing something like this though does make one a little more aware what things are involved trying to get from geometry to the final output.

otherwise, may need to try to find something more relevant to do...

as-is, it basically mimics an "OpenGL Miniport" DLL, which means it exports the usual GL 1.1 calls, along with some WGL calls, and some wrapped GDI calls.

this is loaded up by Quake2, which then goes and uses "GetProcAddress" a bunch of times to fetch the various function pointers.

it has to export pretty much all of the 1.1 calls, though a lot of them are basically no-op stubs (would normally set an error status, currently rigged up to intentionally crash so the debugger can catch it...).

as for calls implemented:
simple answer: about 1/4 to 1/2 of them.

as for functionality implemented by the rasterizer:
* most stuff related to glBegin/glEnd;
* things like glTexImage2D, glTexParameter, ...
* various misc things, like glClear, glClearColor, glDepthRange, ...
* matrix operations (glPushMatrix, glPopMatrix, ...)
* ...

as for functionality not implemented:
* texture-coordinate generation stuff;
* display lists, selection buffers, accumulation buffer, ...
* pretty much everything else where I was like "what is this and what would it be used for?"
* currently doesn't do DrawArrays or DrawElements, but this may change.
** would basically be needed for Quake3 to work IIRC.
** partial provisions have been made, but logic isn't written yet.
internally, it implements the actual drawing to the screen via CreateDIBSection and BitBlt and similar.

then it has a few buffers, for example, a color-buffer, implemented as an array of 32-bit pixel colors (in 0xAARRGGBB order, AKA, BGRA), as well as a Depth+Stencil buffer in Depth24_Stencil8 format (I was originally going to use Depth16 and no stencil, but then I realized that space for a stencil buffer could be provided "almost for free").

at its core, its main operation is "drawing spans", which look something like:void BGBRASW_DrawSpanFlatBasic( bgbrasw_pixel *span, int npix, bgbrasw_pixel clr){ bgbrasw_pixel *ct, *cte; ct=span; cte=span+npix; while((ct+16)<=cte) { *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; } if((ct+8)<=cte) { *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; } if((ct+4)<=cte) { *ct++=clr; *ct++=clr; *ct++=clr; *ct++=clr; } while(ct>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; } if((ct+4)<=cte) { clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; clr=(crag&0xFF00FF00)|((crrb>>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; } while(ct>8)&0x00FF00FF); crag+=cragv; crrb+=crrbv; *ct++=clr; }}...void BGBRASW_DrawSpanTextureInterpTestBlend( BGBRASW_TestBlendData *testData, bgbrasw_pixel *span, bgbrasw_zbuf *spanz, int npix, bgbrasw_pixel *tex, int txs, int tys, int st0s, int st0t, int st1s, int st1t, bgbrasw_pixel clr0, bgbrasw_pixel clr1, bgbrasw_zbuf z0, bgbrasw_zbuf z1){ BGBRASW_TestBlendFunc_ft testBlend; bgbrasw_pixel *ct, *cte; bgbrasw_zbuf *ctz, *ctze; u32 tz, tzv, tz2; int ts, tt, tsv, ttv, tx, ty; int cr, crv, crt; int cg, cgv, cgt; int cb, cbv, cbt; int ca, cav, cat; int clr, clrt; if(npix<=0)return; if((clr0==clr1) && (clr0==0xFFFFFFFF)) { BGBRASW_DrawSpanTextureBasicTestBlend( testData, span, spanz, npix, tex, txs, tys, st0s, st0t, st1s, st1t, z0, z1); return; } cr=BGBRASW_PIXEL_R(clr0)<<8; cg=BGBRASW_PIXEL_G(clr0)<<8; cb=BGBRASW_PIXEL_B(clr0)<<8; ca=BGBRASW_PIXEL_A(clr0)<<8; crv=((int)(BGBRASW_PIXEL_R(clr1)-BGBRASW_PIXEL_R(clr0))<<8)/npix; cgv=((int)(BGBRASW_PIXEL_G(clr1)-BGBRASW_PIXEL_G(clr0))<<8)/npix; cbv=((int)(BGBRASW_PIXEL_B(clr1)-BGBRASW_PIXEL_B(clr0))<<8)/npix; cav=((int)(BGBRASW_PIXEL_A(clr1)-BGBRASW_PIXEL_A(clr0))<<8)/npix; tz=z0; tzv=((s32)(z1-z0))/npix; tzv&=BGBRASW_MASK_DEPTH; ts=st0s; tt=st0t; tsv=(st1s-st0s)/npix; ttv=(st1t-st0t)/npix; testBlend=testData->testAndBlend; ct=span; cte=span+npix; ctz=spanz; ctze=spanz+npix; while(ct>8)&(txs-1); ty=((tt+128)>>8)&(tys-1); clrt=tex[ty*txs+tx]; crt=(BGBRASW_PIXEL_R(clrt)*cr); cgt=(BGBRASW_PIXEL_G(clrt)*cg); cbt=(BGBRASW_PIXEL_B(clrt)*cb); cat=(BGBRASW_PIXEL_A(clrt)*ca); clr=BGBRASW_MAKEPIXEL(crt>>16, cgt>>16, cbt>>16, cat>>16); testBlend(testData, &clr, &tz2, ct, ctz); ct++; ctz++; cr+=crv; cg+=cgv; cb+=cbv; ca+=cav; ts+=tsv; tt+=ttv; tz+=tzv; }}...
with functions which in turn use these for drawing triangles, ... (damn near everything is decomposed into triangles).

these functions are basically given raw pointers to the appropriate locations in the respective framebuffers.

the basic strategy for drawing each triangle is to sort the vertex from lowest to highest Y coordinate, then walk from one end of the triangle to the other, drawing each span.

* so: Y0=lowest, Y1=middle, Y2=highest
* calculate stepping vectors for left/right sides (Y0 to Y1)
* walk from Y0 to Y1, drawing each span.
* recalculate vectors for Y1 to Y2
* walk from Y1 to Y2, drawing spans.

dunno if there is a better way.

the actual process rendering goes something like:
* build up arrays of vertices via the glBegin/glEnd/glVertex interface;
* if needed, decompose into triangles (pretty much everything other than GL_TRIANGLES is decomposed);
** ok, GL_QUADS is also semi-primitive, but in the rasterizer each quad is treated as a triangle pair.
* feed vertex data through a combined Projection*ModelView matrix;
* subdivide triangles, such that each triangle has a limited screen-area;
** needed or else textures warp all over the place (crazy bad texture deformation);
** each triangle, if sufficiently large, is split into 4 sub-triangles, which may happen recursively;
*** Zelda Triforce logo configuration.
** quads are also divided into 4 pieces.
* divide all vertex XYZ coordinates by W;
* clip all the triangles/quads/... to fit on screen;
* convert this into the form used by the rasterizer:
** fixed point XY values with separate Z.
** set magic numbers to indicate which drawing logic will be used.
*** flat color? texture? needs fancy blending or tests? ...
* hand them off to backend.

* walks lists of triangles/quads, passing each to the appropriate rasterizer function.
** quads basically just invoke the triangle-drawer twice, first for vertices 0/1/2, then for 0/2/3.
* there are different triangle-draw functions for different types of triangles (flat, textured, interpolated color, ...)
** in turn, function-pointers are often used to select the appropriate span-drawing functions.
* the handling of blending/tests/... is basically done by assembling the blend-and-test logic out of function pointers.
** different functions for the different collections of tests to be performed;
** functions to select each individual operator for a given test
*** don't want to use switches, as these are *very slow* if done per-pixel.
**** wanted to avoid the thing going at glacial speeds at least...

I had recently fiddled some with BTIC1C real-time recording, and have got some interesting results:

main desktop PC, mostly holds a solid 29/30 fps for recording.
* 1680x1050p30 on a 3.4 GHz Phenom II X4 with 16GB PC3-1066 RAM.

it holds 25-28 fps on my newer laptop:
* 1440x900p30 on a 2.1 GHz Pentium Dual-Core with 4GB RAM.

it does a solid 30 fps on my older laptop:
* 1024x768p30 on a 1.6 GHz Mobile Athlon (single-core) with 1GB RAM.
** thought it was 1.2 GHz, seems I was misremembering.
** kind of kills things though as comparatively, this laptop is too fast for the screen resolution.
*** to be fair, for this resolution, the CPU would have needed to be ~ 1.2-1.4 GHz.

was half-considering testing on an ASUS EEE, but I seem to have misplaced it.
running off other calculations, there is a statistically high chance that an EEE would be able to record full-screen video at ~ 20 fps or so (given its clock-speeds and resolution).

or, if I could build for Android, maybe testing on my tablet or phone (Sony Xperia X8, *).

*: like the EEE, it is basically what sorts of video encoding I can get out of an ~ 600MHz CPU.
linear extrapolation implies it should be able to pull around 20 fps from 800x480 and ~ 30fps from 640x480.

my tablet has HW stats on-par with my laptops though, so it may not mean a whole lot (ran 3DMark, got 5225... CPU speed is similar to old laptop, but graphics and framerates look a lot prettier than either of my laptops).

actually, theoretically, the Ouya also has HW stats on par with my laptops as well (raw clock speed in-between them, but has 4 cores and fast RAM).

ADD: ( testing desktop PC with 3DMark
Cloud Gate: 10890
Ice Storm: 81583
Fire Strike: 4083 (*)
*: Updated, didn't crash this time... )

here is from another test involving desktop recording and Minecraft:

most of the lag/jerkiness was actually from Minecraft itself, which is basically how it plays on my computer (but I was happy originally when I got some newer parts, mostly because Minecraft usually stays above 20).

messed recently with special 4:2:0 block-modes, which can improve image quality but hurt encoder speeds (they effectively store color information for each 2x2 pixel sub-block, rather than for the whole 4x4 block, and require more arithmetic in the pixel-to-block transform).

I did introduce a more limited form of differential color-coding, which seems to have actually increased encoder speed for some reason (colors will often be stored as a delta from the prior color rather than the color itself, *).

generally, color prediction has been largely restricted down to last-seen-color prediction, generally because this can be done without needing to retrieve colors from the output blocks (it will simply keep track of the last-seen colors, using these as the predictors, which is a little cheaper, and also less problematic).

*: as-is, for 23-bit colors, a 15-bit delta-color will be used, and will fall back to explicit colors for large deltas.
so, what happened recently:
one of my HDDs (in this case, a Seagate 2TB drive) decided to die on me mostly without warning.
* curse of Seagate: damn near every time I have used a Seagate drive, it has usually ended up dying on me.
** sort of like my experience with ASUS motherboards...
*** had reliability problems with several ASUS MOBOs, got a Gigabyte MOBO, and everything worked.
*** maybe not a good sign when the MOBOs come wrapped in cardboard.
** actually the MOBO issue is less severe, a crashy OS is less of an issue than lots of lost data.

luckily for me, it wasn't my main OS drive (which is currently a "WD Caviar Black").
well, originally I was using a "Caviar Green" for my main OS drive, but it was 5400 RPM and often had performance issues (computer would often lag/stall, becoming somewhat IO bound). using a (more expensive) 7200 RPM drive made performance a bit better.

but, yeah, didn't loose much particularly important, as (luckily) I tend to try to keep multiple copies of a lot of more important stuff (on different drives, ...). but, still did loose some amount of stuff (like, all my 2D character art, and random downloaded YouTube videos, and my installations of Steam and VS2013, ...), and may have lost one of my newer fiction stories, ...

some of this is because, sadly, both drive reliability (and the OS going and stupidly turning the filesystem into mincemeat) are not entirely uncommon IME (though the FS mincemeat issue seemed more common with WinXP and NTFS drives, not seen it happen with Win7 thus far, nor had I seen it happen with FAT32 drives).

in this case, the HDD itself mostly just stopped working, with Windows mostly just giving lots of "drive controller not ready" error messages, and Windows otherwise not listing the drive as working. though a few times it had worked sort-of, and (with luck) when I get a new HDD, maybe I can see if I can image the old contents onto a new drive.

otherwise, recently added an in-engine profiler.

mostly this was because after the HDD crash, and resorting back to VS2008 for building my 3D engine (VS2013 had been installed on the crashed HDD), CodeXL stopped being able to effectively profile the thing for some reason or another.

since both CodeAnalyst and CodeXL have a lot of "often don't work for crap" issues, I was like "oh hell with it" and basically just made a rudimentary profiler which can be run inside the engine. it aggregates things and tells me which functions use most of the time, which is the main thing anyways (source-level profiling is nice, but would be more involved, and would probably require a proper UI vs just dumping crap to the console).

did observe that the majority of the execution time in these tests was going into "NtDelayExecution", which was mostly related to sleeping threads. made it so that the statistics aggregation ignores this function, mostly so that more sane percentages can be given to other functions.

beyond this, most of the execution time seems to be going into the OpenGL driver, and into some otherwise unknown machine-code (not part of any of the loaded DLLs, nor part of the BSVM JIT / executable-heap). may be part of OpenGL.

this becomes more so if the draw-distance is increased.

did otherwise make some animated clouds and a new sun effect.

basically, rather than simply using a static skybox, it now uses a skybox with a sun overlay and some animated clouds overlaid (though with a few unresolved issues, like color-blending not working on the clouds for some reason I have yet to figure out).

new clouds and some tweaks to metal biome can be seen here:
well, status recently:

started working on a 2D animation tool, but then ran into UI complexities;
UI handling in my 3D engine has become a bit of a tangled mess, and there is no real abstraction over it (most things are, for the most part, handling keyboard events and mouse movements...);
there is theoretically support for GUI widgets, but I wrote the code in question 10 years ago, and manage to do it sufficiently badly that doing UI stuff via drawing stuff and raw input handling is actually easier (*1);
sometimes I look into trying to clean up the GUI widgets thing, and am like "blarg" and don't make a whole lot of progress;
other times, I consider "maybe I will make a new GUI widgets system that *doesn't* totally suck", followed by "but I already have these existing widgets, maybe I can fix it up?" followed by "blarg!".

10 years ago I wrote a few things which were ok, and some other stuff which is just plain nasty, but has largely become a black box in that I can't really make it not suck, but also often can't easily replace it without breaking other stuff.

*1: but it is GUI widgets. don't these normally suck?...
well, yes, but this one extra sucks, as it was basically based around a stack-based mapping of XHTML forms to C;
but, without some way to distinguish form instances... so every widget is identified via a global 'id' name, and there may only be a single widget with this name, anywhere. also, no facilities were provided, you know, to update widget contents.
also didn't turn out to really be a sane design for most of the types of stuff I am doing.

so, somehow, I managed to make something pretty much less usable or useful than GTK or GDI...
doesn't help much when looking into it and realizing that there isn't much logic behind it that isn't stuff one would need to replace anyways (some of the structs are useful, but seemingly this is about it).

but, a 2D animation tool, while it needs a UI, doesn't necessarily need a traditional GUI.
"well, there are always modes and keyboard shortcuts!". yes, fair enough, but it doesn't help when one is left with a UI where pretty much everything (in the 3D engine) is a big monolithic UI, and there are few good options for keyboard shortcuts remaining (and "CTRL+F1,CTRL+SHIFT+G" is a bit outside "good" territory).

yes, my 3D modeller, game, mapper, ... all use the same keyboard and mouse-handling code, just with a lot of internal flags controlling everything. in earlier forms of my game effort, it actually required doing an elaborate keyboard dance of various shortcuts to get into a mode where the controls would work as-expected. I then later made the engine front-end set this up by default and effectively lock the UI configuration (short of a special shortcut to "unlock" the UI).

theoretically, I have added a solution to partly address this: now you can "ESC,~" (or "ESC,SHIFT+`") into a tabbed selector for "running programs", along with a possible option for a considered drop-list for launching programs (which would probably work by stuffing commands into the console, probably launching scripts...). clicking on tabs can then be used to switch focus between programs, and possibly allowing a cleaner way to handle various use-cases ("hell, maybe I could add a text-editor and a graphics program and a file manager...", "oh, wait...").

but, on the positive side, effectively this mode bypasses nearly all of the normal user-input handling, allowing each "program" a lot more free-reign over the keyboard shortcuts. architecturally, it is on-par with the console (toggled with "ALT+`", and is sort of like a shell just currently without IO redirection or pipes).

but, OTOH, I am not so happy with the present UI situation...
a lot of this, is, horrid...

thus far, in the 2D animation tool, I can sort of add items and move them around and step between frames (with the movement being interpolated, ...), so it is a start, but still falls well short of what would be needed for a usable 2D animation tool (that is hopefully less effort than the current strategy of doing basic 2D animation via globs of script code...).

probably will need a concept of "scenes", where in each scene it will be possible to add objects and set various keyframes, ... but, at the moment, I am less certain, seems all like a bit of an undertaking.

did at least go and make some improvements to the in-game video recording:
switched from using RPZA to a BTIC1C subset for recording, which has somewhat better image quality and lower bitrate, in this case using a more speed-oriented encoder (vs the main encoder, which more prioritizes size/quality);
basically holds up pretty well with tests for recording at full-screen 1680x1050p24;
made some tweaks to reduce temporal aliasing issues (mostly related to inter-thread timing issues);
also fiddled some with trying to get audio more in sync (recorded video had the audio somewhat out of sync, sort of fudged them more back into alignment via inserting about 400ms of silence at the start of the recording... but this is a crap solution... not sure at present a good way to automatically adjust for internal A/V latency).

the BTIC1C variant encoder basically mostly just uses straight RGB23 blocks (with no quantization stage), and a higher-speed single-pass single-stop entropy backend (uses an extended Deflate-based format). this allows faster encoding albeit with worse compression.

the normal Deflate/BTLZH encoder uses a 3-pass encoding strategy:
LZ77 encode data, count up symbol statistics, build and emit Huffman tables, emit Huffman-coded LZ data.

the current encoder speeds this up slightly by using the prior statistics for building the Huffman table, then doing the LZ77 and Huffman coding at the same time. it also uses another trick which is that it doesn't actually "search" for matches, just hashes the data it encounters, and sees if the current hash-table entry points to a match.

the compression is a little worse, but the advantage is in being able to use the entropy backend for real-time encoding (vs the primary encoder which is a bit slow for real-time).

the temporal aliasing issue was mostly a problem which resulted in a notable drop in the effective framerate of the recording, as I had found that many of the frames which were captured were being lost and many frames were being duplicated in the output. I ended up making some tweaks to the handling of accumulation timers and similar, and the number of lost and duplicate frames is notably reduced.

test from in-game recording: 1680x1050p24 RGB23 uses about 19Mbps, and about 0.46 bpp.
in other tests, this works out to around 6-7 minutes per GB of recording.

this is also a bit better than about 2 minutes per GB I can get from M-JPEG, and seems to have "mostly similar" video quality (and without the JPEG encoder's limitation of being too slow for recording at higher resolutions, which is part of the reason I had switched over to RPZA to begin with).

don't yet have any videos up for the current version.
the most recent video I have at the time of this writing is for a version of the new codec prior to addressing a few image quality issues nor the temporal aliasing or audio sync issues (so the video is a little laggy and the audio isn't really in-sync...).
recently threw together this thing:

what is it?...
basically, an extended form of Deflate, intended mostly to improve compression with a "modest" impact on decoding speed (while also offering a modest boost in compression).

in its simplest mode, it is basically just Deflate, and is binary compatible;
otherwise, the decoder remains backwards compatible with Deflate.

its extensions are mostly as such:
bigger maximum match length (64KiB);
bigger maximum dictionary size (theoretical 4GB, likely smaller due to implementation limits);
optional arithmetic coded modes.

the idea was partly to have a compromise between Deflate and LZMA, with the encoder able to make some tradeoffs WRT compression settings (speed vs ratio, ...). the hope basically being to have something which could compress better than Deflate but decode faster than LZMA.

the arithmetic coder is currently applied after the Huffman and VLC coding.
this speeds things up slightly by reducing the number of bits which have to be fed through the (otherwise slow) arithmetic coder, while at the same time still offering some (modest) compression benefit from the arithmetic coder.

otherwise, arithmetic coder can be left disabled (and bits are read/written more directly), in which case the decoding will be somewhat faster (it generally seems to make around a 10-15% size difference, but around a 2x decoding-speed difference).

ADD: in the tests with video stuff, overall I am getting around a 30% compression increase (vs Deflate).

what am I using it for?
mostly as a Deflate alternative for the BTIC family of video codecs (many of which had used Deflate as their back-end entropy coder);
possibly other use cases (compressing voxel region files?...).

otherwise, I am now much closer to being able to switch BTIC1C over to full RGB colors;
most of the relevant logic has been written, so it is mostly finishing up and testing it at this point.
this should improve the image-quality at higher quality settings for BC7 and RGBA output (but will have little effect on DXTn output).

most of the work here has been on the encoder end, mostly due to the original choice for the representation of pixel-blocks, and there being almost no abstraction over the block format here (it is sad when "move some of this crap into predicate functions and similar" is a big step forwards, a lot of this logic is basically decision trees and raw pointer arithmetic and bit-twiddling and similar). yeah, probably not a great implementation strategy in retrospect.

the current choice of blocks looks basically like:

so, each new encoder-side block is 256 bits, and spreads the color over the primary ColorBlock and ExtColorBlock.
in total, there is currently about 60 bits for color data, which is currently used to (slightly inefficiently) encode a pair of 24-bit colors (had thought, "maybe I can use the other 32 bits for something else", may reconsider. had used a strategy where ExtColorBlock held a delta from the "canonical decoded color").

for 31F colors, I may need to use the block to hold the color-points directly:

had also recently gained some quality improvement mostly by tweaking the algorithm for choosing color endpoints:
rather than simply using a single gamma function and simply picking the brightest and darkest endpoints, it now uses 4 gamma functions.
roughly, by fiddling, I got the best results with a CYGM (Cyan, Yellow, Green, Magenta) based color-space, where each gamma function is an impure form of these colors (permutations of 0.5, 0.35, 0.15). the block encoder then chooses the function (and endpoints) which generated the highest contrast.
this basically improved quality with less impact on encoder speed than with some other options (it can still be done in a single pass over the input pixels).
it generally improves the quality of sharp color transitions (reducing obvious color bleed), but does seem to come at the cost in these cases of slightly reducing the accuracy of preserved brightness.

this change was then also applied to my BC7 encoder and similar with good effect.
see here:

a new version of the engine is available.

as well, the image codec library (BGBBTJ) is also available as a stand-alone zip:

little provision is made for out-of-box use, as in, if anyone wants to compile or mess with it, probably some hackery will be needed. it provides VFW codec drivers for encoding/decoding, with the encoder currently hard-coded to use BTIC1C (with most of the encoding settings also hard-coded). in my tests it can encode videos with VirtualDub though, so it works here at least.

I may later consider adding a codec configuration UI or similar, as well as maybe clean up a few things (better provisions for handling logging and configuration). this would likely mean either putting the config information in the registry, or putting an INI somewhere (rather than just hard-coding stuff like where to put the log file and similar).

I have thus far not really finished the level of 1C encoder modifications needed to effectively support expanded color depths (moving forwards here largely requires some fairly non-trivial rewriting of the encoder, effectively moving the encoder over to a new intermediate block format, ...).

on another note:
made a recent observation that speech is still intelligible at 8kHz 1bit/sample (just it has a harsh/buzzy "retro" sound);
not sure as of yet if I will make much use of this. it could mostly be relevant WRT hand-editing sample data as sequences of hex-numbers or similar.
a high-pass filter is needed though, otherwise there are significant audio problems. in my tests, I was having best results filtering out everything below about 250Hz.
potentially, direct 4 bits/sample could also make sense, as it would map 1 sample per hex character.

example, simple sine wave:
89AB CDEE FFEE DCBA 8976 5432 1100 1123 4567
36 samples, 0.0045 seconds (222 Hz).
as 1bpp:
FF FF C0 00 0
or, as 2bpp (compromise):
AAFF FFFA A550 0000 55
well, first off, recently did a test showing the image quality for BTIC1C:

this test was for a video at 1024x1024 with 8.6 Mbps and 0.55 bpp.

as noted, the quality degradation is noticeable, but "mostly passable".
some amount of it is due largely to the conversion to RGB555, rather than actual quantization artifacts (partly because video compression and dithering don't really mix well in my tests). however, some quantization artifacts are visible.

as usual, working spec:

other recent changes:

I have split apart BTIC1C and RPZA into different codecs, mostly as 1C has diverged sufficiently from RPZA that keeping them as a single codec was becoming problematic.

BTIC1C now has BC6H and BC7 decode routes, with single-thread decode speeds of around 320-340 Mpix/sec for BC7, and around 400 Mpix/sec for BC6H (the speed difference is mostly due to the lack of an alpha channel in 6H, and slightly awkward handling of alpha in BC7).

as-is, both effectively use a subset of the format (currently Mode 5 for BC7, and Mode 11 for 6H).

the (theoretical) color depth has been expanded, as it now supports 23-bit RGB and 31-bit RGB.
RGB23 will give (approximately) a full 24-bit color depth (mostly for BC7, possibly could be used for RGBA).

RGB31 will support HDR (for BC6H), and comes in signed and unsigned variants. as-is, it stores 10-bits per component (as floating-point).

likewise, the 256-color indexed block-modes have been expanded to support 23 and 31 bit RGB colors.

these modes are coerced to RGB565 for DXTn decoding, as well as RGB555 still being usable with BC7 and BC6H, ...
this means that video intended for one format can still be decoded for another if-needed (though videos will still have a "preferred format").

as-is, it will still require some work on the encoder end to be able to generate output supporting these color depths (likely moving from 128 to 256 blocks on the encoder end).

the current encoder basically uses a hacked form of DXT5 for its intermediate form, where:
(AlphaA>AlphaB) && (ColorA>ColorB)
basically the same as DXT5.
(AlphaA<=AlphaB) || (ColorA<=ColorB)
special cases (flat colors, skip blocks, ...)

however, there are no free bits for more color data (at least while keeping block-complexity "reasonable").
so, likely, it will be necessary to expand the block size to 256 bits and probably use a 128-bit color block.

64-bits: tag and metadata
64-bits: alpha block
128-bits: expanded color block.

this would not effect the output format, as these blocks are purely intermediate (used for frame conversion/quantization/encoding), but would require a bit of alteration to the encoder-side logic.

it sort of works I guess...


video-texture, now with audio...

had an idea here for how to do a DXTn-space deblocking filter, but it would likely come with a bit of a speed cost.
may try it out and see if it works ok though.
well, the BTIC3A effort also kind of stalled out, mostly as the format turns out to be overly complex to implement (particularly on the encoder). I may revive the effort later, or maybe try again with a simpler design (leaving blocks in raster order and probably designing it to be easier to encode with a multi-stage encoder).

so, I ended up for now just going and doing something lazier:
gluing a few more things onto my existing BTIC1C format.

these are:
predicted / differential colors (saves bits by storing many colors as an approximate delta value);
support for 2x2 pixel blocks (as a compromise between flat-color blocks and 4x4 pixel blocks, a 2x2 pixel block needs 8 bits rather than 32 bits);
simplistic motion compensation (blocks from prior frames may be translated into the new frame).

all were pretty lazy, most worked ok.

the differential colors are a bit problematic though as they are prone to mess up resulting in graphical glitches (blocks which seem to overflow/underflow the color values, or result in miscolored splotches);

basically, it uses a Paeth filter (like in PNG), and tries to predict the block colors from adjacent blocks, which allows (in premise), the use of 7-bit color deltas (as a 5x5x5 cube) instead of full RGB555 colors in many cases.

I suspect there is a divergence though between the encoder-side blocks and decoder-side blocks though, to account for the colors screwing up (the blocks as they come out of the quantizer look fine though, implying that the deltas and quantization are not themselves at fault).

the 2x2 blocks and motion compensation were each a little more effective. while not pixel-accurate, the motion compensation can at least sort of deal with general movement and seems better than having nothing at all.

I suspect in general it is doing "ok" with size/quality in that I can have a 2 minute video in 50MB at 512x512 and not have it look entirely awful.

decided to run a few benchmarks, partly to verify some of my new features didn't kill decode performance.

non-Deflated version:
decode speed to RGBA: ~ 140 Mpix/sec;
decode speed to DXT5: ~ 670 Mpix/sec.

Deflated version:
decode speed to RGBA: ~ 118 Mpix/sec;
decode speed to DXT5: ~ 389 Mpix/sec.

then started wondering what would be the results of trying a multi-threaded decoder (with 4 decoder threads):
420 Mpix/sec to RGBA;
2100 Mpix/sec DXT5 (IOW: approx 2.1 gigapixels per second).

this is for a non-Deflated version, as for the Deflated version, performance kind of goes to crap as the threads end up all ramming into a mutex protecting the inflater (not currently thread safe).

or such...

BTIC1C spec (working draft):

BTIC3A partial spec (idea spec):
(doesn't seem like much, but the issues are more subtle).

well, it looks like 3A may not be entirely dead, there are a few parts I am considering trying to "generalize out", so it may not all be loss. for example, the bitstream code was originally generalized somewhat (mostly as I was like "you know what, copy-pasting a lot of this is getting stupid", as well as it still shares some structures with BTIC2C).

likewise, I may generalize out the use of 256-bit meta-blocks on the encoder end (rather than a 128-bit block format), partly as the format needs to deal both with representing pixel data, and also some amount of internal metadata (mostly related to the block quantizer), and 256-bits provides a little more room to work with.

don't know yet if this could lead to a (probably less ambitious) 3B effort, or what exactly this would look like (several possibilities exist). partly tempted by thoughts of maybe using a PNG-like or DWT-based transform for the block colors.
yes, yet more codec wackiness...

seeing as how my graphics hardware has a limited number of options for (non DXTn / S3TC) compressed texture formats, but does support BPTC / BC6H / BC7, which hinder effective real-time encoding (*), it may make sense to consider developing a video codec specifically for this.

*: though there is always the option of "just pick a block type and run with it", like always encoding BC7 in mode 5 or BC6H in mode 11 or something.
note: BPTC here will be used (in the OpenGL sense) to refer both to BC6H and BC7.
structurally, they are different formats, and need to be distinguished in-use.
when relevant, BC6H and BC7 (their DirectX names) will be used (mostly because names like "RGBA_BPTC_UNORM" kind of suck...).

basic design:
essentially fairly similar to BTIC1C and BTIC1D (which in turn both derive from Apple Video / RPZA).

unlike 1C and 1D, it (mostly) sidesteps a lot of the complexities of these texture formats, and essentially treats the blocks mostly as raw data. this should still allow a moderately simple and fast decoder (into BPTC or similar).
also this stage of the process will be lossless.

this encoding allows a fairly arbitrary split between block-header and block data, which an encoder should be able to try to optimize for (and search for the "greatest savings" in terms of where to split up the block at). this also includes the ability to do "simple RLE runs" for repeating block-patterns, as well as to store raw/unencoded runs of blocks.

note that it isn't really viable to cleanly split between the header and index portions of a block given the way the blocks work.

Enocde Process:
RGB(A) Source Image -> Pixel Block Quantizer + BPTC Encoder -> BTIC1E Frame Encoder -> Deflate -> Packaging/Container.

Decode Process:
Container/Packaging -> Inflate -> BTIC1E Decoder -> BPTC (passed to GL or similar).

the "Pixel Block Quantizer" step will basically try to fudge blocks to reduce the encoded image size; it is unclear exactly how it will tie in with the BPTC encoders. as-is, it is looking mostly like a tradeoff between an RGBA-space quantizer ("pre-cooking" the image) and a naive "slice and dice" quantizer (hack bits between blocks coming out of the BPTC encoder and see what it can get away with within the error threshold, basically by decoding the blocks to RGBA and comparing the results).

an issue: I have rather mixed feelings about BPTC.
namely, it is only available in newer desktop-class GPUs, and could be rendered less relevant if ETC2 becomes widespread in upcoming GPUs (both having been promoted to core in OpenGL).

some of this could potentially lead to cases of needing multiple redundant animated-texture videos, which would be kind of lame (and would waste disk space and similar), though potentially still better than wasting video memory by always using an RGBA16F or RGB9_E5 version.

could almost be a case of needing to implement it and determine whether or not it sucks...

figured the likelihood of BTIC1E sucking was just too high.

started working on another design:

which would be intended as a format to hopefully target both DXT and a BPTC subset, with other goals of being faster for getting to DXTn than BTIC2C, and compressing better than BTIC1C, target speed = 300 Mpix/sec for a single threaded decoder.

going and checking, the gap isn't quite as drastic as I had thought (if I can reduce the bitrate to 1/2 or 1/3 that of 1C, I will be doing pretty good, nevermind image quality for the moment).

I guess the reason many videos can fit 30 minutes in 200MB is mostly because of lower resolutions (640x360 has a lot fewer pixels than 1024x1024 or 2048x1024...).
recently was working some on a new interpreter design I was calling FRIR2.

what is it?

basically a Three-Address-Code Statically-Typed bytecode format;
the current intention was mostly to try to make a bytecode at least theoretically viable to JIT compile into a form which could be more performance competitive with native code, mostly for real-time audio/video stuff, while still allowing readily changing scripts (not requiring a rebuild, and possibly interactively being able to tweak things).

made some progress implementing it, but it still has a ways to go before it could be usable (and considerably more work before it is likely to be within the target range WRT performance).

not an immediate priority though.

ADD: FWIW, as-is FRIR2 ASM syntax will look something like:neg.i r13, r9; //2 byte instructionadd.i r14, r7, r11; //3 byte instruction...neg.i r19, r23; //5 byte instructionadd.i r42, r37, r119; //6 byte instruction...neg.v3f r19, r23; //6 byte instructionadd.v3f r42, r37, r119; //7 byte r3, 0L0:jmp_ge.ic r3, 10, L1inc.i r3, r3jmp L0L1:...//with declarations:var someVar:i; //someVar is an integerfunction SomeFunc:i(x:f, y:f) //int SomeFunc(float, float){ var z:f; add.f z, x, y; convto.f t0, z, 'i'; ret.i t0;}
otherwise, more idle thoughts for how to do alpha blending with Theora and XviD (within an AVI).

previously, I had tried the use of out-of-gamut colors, which while able to encode transparency, would do so with some ugly artifacts and limitations (namely violet bands and an inability to accurately encode colors for alpha-blended areas).

another possibility is to utilize some tricks similar to those used by Google for WebM, namely one of:
encode a secondary video channel containing alpha data (implementation PITA, little idea how existing video players will respond);
double the vertical resolution, encoding the extended information in the lower half, and indicating somehow that this has been done (would be handled via a special hack in the image decoder).

current leaning is toward the resolution-doubling strategy, as it is likely to be less effort.

the main issue is likely how to best encode the use of the hack:
somehow hacking it into one of the existing headers (how to best avoid breaking something?...);
possibly add an extra chunk which would mostly have the role of indicating certain format extensions (would need to be handled in the AVI code and passed back to the codec code).

contents of the extended components:
most likely, DAE (Depth, Alpha, Exponent).

Depth: used for bump-maps, possibly also for generating normal-maps via a Sobel filter (or cheaper analogue), ignored otherwise;
Alpha: obvious enough;
Exponent: Exponent for HDR images, ignored for LDR.

likely, DAE would still be subject to RGB/YUV conversions (could be skipped if only alpha were used).

resolution doubling at least should work without too much issue for existing video players and similar, but would double the height of the video for normal players (leaving all the alpha-related stuff in the bottom of the screen).

Theora and XviD compress a little better than my BTIC2C format, so this could offer a better size/quality tradeoff, but likely worse decoding speeds (BTIC2C is roughly on-par with XviD as-is while already using an alpha channel);
unlike some other options, this would still not support specular or glow maps.

most likely, this is more likely to be relevant to video sequences than for animated textures, where raw RGB or RGBA is more likely to be sufficient for video sequences.

still not sure if this is a big enough use-case to really bother with though.

this could potentially require a fairly significant increase in the cost of the color-conversion, doubling the amount of pixels handled and potentially adding some extra filtering cost for normal-maps;
this should still be fast enough for 720p-equivalent resolutions though.

FWIW, a similar cost is implied as with the BGBTech-JPEG format (which supports alpha and normal maps via additional images embedded within the main image).

otherwise, went and added more video textures (to my game project):
water and slime now are video-mapped (using the BTIC1C codec, *);
ended up using 256x256 for the video-textures (was going to use 512x512, figured this was overkill);
discovered and fixed a few bugs (some engine related, a few minor decoder bugs in 1C discovered and fixed, ...);
made a lot of minor cosmetic tweaks (scaling textures, ...);

a minor tweak is that 1C will now try to "guess" the missing green and alpha bits based on the other bits;
basically, 1C normally stores RGB in 555 format (vs 565 as DXTn uses), so there is a missing bit;
likewise, for alpha, which is stored using 7 bits, vs the usual 8.

in both cases, the guess is currently made by assuming that the low bit depends on the high bit, so it copies the bit, which while naive, seems to be better than just leaving it as 0.

the other option is preserving these bits, but the quality gain is not particularly noticeable vs the image size increase.

*: note, 1C and 2C are different formats. 1C uses an RPZA-based format (RPZA + Deflate + more features), whereas 2C is loosely JPEG-based (and does RGBA mostly by encoding 4-component YUVA images).

1C is primarily focused on decoding to DXTn. it is effectively LDR only (HDR is theoretically possible, but the size and quality from some tests is "teh suck"). while decoding to DXTn it is drastically faster than most other options.

2C is mostly intended for intermediate video and HDR (it can do HDR mostly by encoding images filled with 16-bit half-floats, and/or using one of several fixed-point formats). speed and perceptual size/quality are a little worse than XviD or Theora, but the image quality is much higher at higher bitrates (ex: 30-70 Mbps).

decode speeds are "similar" to those of XviD (both are fast enough to do 1080p30, but 2C can do 1080p30 with HFloat+Alpha). generally, it is ~80 Mpix/sec vs ~105 Mpix/sec.

if XviD were used at 2x resolution to do alpha, this would likely cut the effective speed to around 53 Mpix/sec.
similar applies to Theora.

note: BTJPEG is around 90 Mpix/sec for raw RGB images, and around 60 for RGB+Alpha, for similar reasons.

this leaves the advantage of XviD and Theora mostly in terms of better image quality at lower bitrates (IOW: not throwing 30+ Mbps at the problem...).

misc: dungeon test...

basically, example here:

essentially went and added basic procedurally generated dungeons and a new weapon (the "rocket shovel").
the dungeons are generated on a regular grid, basically by partially randomly spawning dungeon chunks and having new dungeon chunks grow off of existing dungeon chunks (when new terrain chunks are generated).

as the the chunks grow outward, they replace voxels in the newly spawns chunks they grow into with their own voxels, and as more new chunks spawn next to these chunks, they grow into them as well.

currently, there are 16 types of dungeon chunk, most being variations on rooms and tunnels.

planned features are ladders or stairs between levels of dungeon and also occasional surface-level access.

the new weapon is basically just a rocket launcher with the behavior tweaked, where it has the special ability to destroy terrain even when most normal weapons have terrain destruction disabled, but with the actual damage to entities reduced.
well, I had tried recently initially to write a VfW codec driver, but alas, it didn't work (initially).

I had a little more success writing a VLC Media Player plugin, which basically works by feeding the request into my codec system and using this to decode the video.

then, later got the VfW driver to work, turns out the issue was mostly that some of the logic was broke, and this was revealed after throwing together a mechanism to print messages in a log file.

also created a new codec I had called BTIC1D in an attempt to have higher image quality than RPZA / BTIC1C (mostly for storing intermediate video data, in contrast to BTIC1C which is "nearly ideal" for decoding to DXT1 or DXT5).

BTIC1D: summary:
it is a Block-VQ codec supporting Alpha, Layers, and HDR, using YCgCo and 4:2:0 chroma sub-sampling, and YUV bit-depths of 11:9:9 for LDR and 9:8:8 for HDR (with a 4-bit exponent), and also using Deflate for entropy-coding.

bit-rates are currently fairly high (30Mbps for 480p30 at 80% quality), and decode speeds "could be better" (decode is approx 120 Mpix/sec at present, for a single-threaded plain-C decoder), but it works (*).
most of the decode time goes into things like converting blocks to RGBA and also dealing with Deflate.

*: during design I was expecting it to break 200 Mpix/s or better, so the performance is a little disappointing, but alas...
it is fast enough to probably do 1080p60 or 2160i30, but the files would be huge.

probably better compression at similar decode speeds (and probably faster encode) could be possible if using WHT or similar instead (as-in the BTIC2 family).

now having usable codec drivers opens up more options, mostly as fully custom codecs can now be viewed in normal video-players (if albeit lacking many features). otherwise, it would probably require a specialized video player or similar to see videos using the extended features. but, in the common case, it works...

but, I seem to be wasting a bit too much time on all this, not really getting a whole lot else done...

ADD, some info:

and its predecessor:

no code currently available, might do so eventually if anyone is interested.

ADD 2:
Revived a past effort (and finished implementing it), mostly to compare things (see how BTIC1D fares against a more conventional design).

interestingly, the bitrate difference between BTIC1D and BTIC2C isn't particularly drastic, however, the decoding speed of BTIC2C is currently a fair bit lower (once again, mostly bogging down in the YCgCo->RGBA conversion, *1, but currently only pulling off about 35-40 Mpix/sec with an optimized build), though the encode speed is a bit faster.
temporarily dropped a few things to get it implemented more quickly though.
its implementation was mostly done by copy-pasting parts from several of my other codecs: JPEG, BTIC1D, a few parts from BTIC2B (never fully implemented). does bring up idle thoughts of if it could make sense to hybridize VQ and WHT+Huffman (say, allowing both within a shared bitstream).

*1: less certain is the reason for the big speed difference here, though cache patterns could be a big factor.
likely, things may be changed to perform color conversion on a block-by-block basis rather than via a big monolithic conversion pass.
so, off in this strange place known as the physical world, I recently set up a blue-screen.
technically, it is just a blue bed sheet, as this is pretty much the only thing I had on-hand (yes, green cloth would be better, but I don't currently have any).

did do a few tests... put them up:

and, as part of the process wrote a tool to composite the video streams.

pardon the awful blue-screening quality, my current setup is pretty bad (my camcorder can barely see the blue, ...).
some work is still needed in all this to try to make it not suck...

the video streams are basically treated as layers, and can be placed independently (*), and will be blended together into the output. currently this is done on the CPU using a 16-bit fixed-point representation for pixels (though a big part of the process at present uses a floating-point representation for pixels).

currently it mimics a GLSL-like interface and the video composition isn't particularly high-performance.
a simple batch-style command-oriented language is also used to drive the process.
I may or may not consider supporting use of my scripting-language for pixel-level calculations (it could be nifty, but is fairly likely to be slow).

*: they have both a bounding box and currently 2 transformation matrices, one allowing the layer to be placed in various orientations within the video-frame, and another for local coordinates to be transformed within texture-coordinate space.

I might later also consider triangles or polygons and a simple rasterizer (or take the naive-and-slow route of checking pixel-coordinates against triangles).

while it probably seems like a waste to do all this on the CPU (vs the GPU), I figured for what I was doing it was likely to be less effort, and performance isn't really critical for batch-tool video composition.

as-is, it implements both Photoshop style and OpenGL style blending modes (for example: "normal"/"overlay"/"color_burn"/... or "src_color one_minus_src_color", ...), though at present they are mutually exclusive for a given layer.

when using GL-style blending, layer-opacity still behaves in a PS-like manner (IOW: it effects overall layer blending, rather than being factored directly into the blending-calculations).

in more messing with video, I made another observation:

current major video codecs (H.263, H.264, XviD, Theora, ...) actually do a pretty poor job with things like image-quality and avoiding conversion loss and generational loss and similar, as-in, after transcoding video a few times, the image quality has basically gone to crap. like, they generally look ok in-motion, but if you pause and/or look closely, the poorness of the image quality becomes more obvious, even with maxed-out settings (100% quality), and gets worse with each transcode.

in contrast, you can save out an M-JPEG at 90% or 95% quality, and although the video file is huge, the image quality is generally pretty good. (and 90% quality in JPEG is somewhat higher than what passes for 100% in Theora or XviD).

I don't know of any obvious reason in the bitstream formats for why the major video codecs should have lackluster image quality, but suspect it is mostly because the existing encoders are tuned mostly for "streaming" bitrates, rather than ones more useful for editing or similar.

in all, the size/quality/speed tradeoff has mostly been leaning in favor of my custom RPZA variant (BTIC1C) though, as even if technically, the quality is pretty bad, and the compression is also pretty bad, for intermediate processing it still seems to fare a little better than XviD or Theora (encoding/decoding is considerably faster, and generation-loss seems to be a fair bit lower).

the most obvious quality limitation of BTIC1C at present for this use case though is the imitations of 15-bit colors.

I was left considering a possible extension to try to squeeze a little more image quality out of this (for dedicated 32-bit and 64-bit RGBA / HDR paths), namely probably using the main image to store a tone-mapped version, and then storing an additional extension-layer to recover the "true"/"absolute" RGBA values.

in this case, the RGB555 values wouldn't be absolute colors, but rather themselves treated as interpolated values.
the tone-mapping layer is likely to be stored in a format vaguely-similar to a byte-oriented PNG variant (stored at 1/4 resolution).

or such...
well, this was another experiment... (possibly a waste of time, but oh well...).

I had an idea for a tool idea to experiment with:
a tool which will take input video, try to up-sample it in a way which "hopefully doesn't look like crap", and save output video.

basically, it just decodes frames from one video, and then resamples and re-encodes them into another video.

the main thing here, granted, is the upsampler.

the algorithm used is currently fairly specialized, and currently only upsamples by powers of 2 (only 2x at a time, but may be used recursively).

I had originally considered trying a neural net (likely being trained over the video prior to upsampling), but realized I don't have anywhere near enough CPU cycles to make this viable (vs a more conventional strategy).

instead, an algorithm is used which combines bicubic filtering for the "main cross" (basically, a cross-shaped region over the pixels being interpolated), with linear extrapolation from the corners (these points become those used by the bicubic filter). the filter actually over-compensates slightly (by itself it introduces noticeable ringing, by design).

additionally, a very slight blur (~ 0.06) is applied both before and after upsampling, where the former helps reduce artifacts in the original image (which otherwise result in noise in the upsampled image, the noise is typically lower-intensity than are major details so is removed more easily by a slight blur), and the latter helps smooth out ringing artifacts.

the reason over-compensating is to try to compensate for the initial loss of sharpness by the first-pass blur, albeit at the cost of producing additional ringing artifacts near edges. the secondary blur helps smooth this out, and makes these look more like proper edges.

in all, it looks "pretty good", or at least, gives slightly nicer looking results than the upsampling provided by VirtualDub (with the files tested, mostly low-res AVIs of various 90s era TV shows). while VirtualDub provides blur filters, they are too strong, and so using a blur before upsampling results in an obvious loss of sharpness, and doing a blur afterwards does not reduce the effect of artifacts, which become much more pronounced after upsampling.

however, beyond this, the goodness ends:
the tool is slow (currently slower than real-time), currently produces overly large files (due mostly to me saving the output as MJPEG with 95% quality), has a fairly limited range of supported formats and codecs (ex: AVI with MJPG/RPZA/XviD/... and PCM, ADPCM, or MP3 audio, ..., *1), is awkward to use (it is a command-line tool), ...

*1: doesn't currently work with other common formats like MPG, FLV, RM, MOV, ... and only a few formats have encoders (MJPEG, RPZA, ...). I tried initially saving as RPZA, but it made little sense to spend so many cycles upsampling simply to save in a format which doesn't really look very great. (while XviD would be good here, using it isn't without issues..., *2).

granted, there is a possible option of just getting funky with the MJPEG encoding to ensure that the file stays under a certain limit (say, we budget 750MB or 900MB for a 1hr video, with per-frame quality being adjusted to enforce this limit).
for example: 1hr at 24Hz and a 900MB limit means, 86400 frames, and a limit of 10.42 kB/frame.
or, alternatively, a running average could be used to adjust quality settings (to maintain an average bit-rate).

*2: patents, dealing with GPL code, or a mandatory VfW dependency.

another limitation is that I don't currently know any good way to counteract the effects of motion compensation.

likewise would be for trying to counteract VHS artifacts (tracking issues, etc...).

ultimately, lost detail is still lost...

but, even as such, a video might still look better upsampled, say, from 352x240 to 704x480, than if simply stretched out using nearest or linear filtering or similar.

all this would be a bit much work though, as my initial goal was mostly just such that some older shows could be watched with more ability to see what was going on (but, then, VirtualDub resampling is still more convenient and works with a wider range of videos...).

then again, for all I know, someone already has a much better tool for this.

or such...

meh, adding Theora support...

the lib (libtheora) is BSD, so no huge issue, but had to hack on it a bit to make it buildable.
its code is nearly unreadable, I am almost left wondering off if I would be better off doing my own implementation, but probably just going to use the lib for now and replace it later if needed.

basically, actual video is a case where one has more need for an actual video codec (unlike animated textures / ...), as adding a feature to adaptively adjust quality has shown that 1 hour of video with MJPG with a size < 1GB results in awful quality, passable results seem to require a minimum of around 2GB per hour.

this is mostly because MJPG lacks P-Frames or similar (and there is no way to add them without breaking existing decoders, or at least looking really nasty, namely, every P-Frame comes out as a big grey image).

finding *anything* on this was a pain, as apparently AVI+Theora is very uncommon (just, don't want to mess with Ogg at the moment). but, OTOH, Theora is also one of the few major codecs which isn't "yet another reimplementation or variant of MPEG-4".

left thinking that my "JPEG" library is becoming basically a bit of a beast, basically handling image and video compression/decompression with a collection of formats and custom implementations of various codecs.
could be a little better organized, and ideally with less tangling between the AVI logic and codecs.

elsewhere in the renderer, there is a place where texture-management is mixed up with AVI and codec internals, but this may need to be cleaned up eventually, most likely via some sort of proper API for extended components and layers, and making DXTn be a more canonical image representation, likely also with mipmap and cube-map support.

well, and separating layer-management from codec internals would be nice (as-is, it is necessary to jerk off with layers specifically for each codec which uses them).

documentation would have been nice...

I eventually ended up going and digging around in the FFmpeg source trying to answer the question of how exactly their implementation works, which mostly boiled down to the use of 16-bit length fields between headers. knowing these sorts of things could have saved me a good number of hours.

Reused text...



Once again, it is a codec intended for quickly decoding into DXT for use in animated textures and video-mapped textures.



It is a direct continuation of my effort to implement Apple Video, but with a slight rename.



I ended up calling it BTIC1C as it is (despite being based on Apple Video), still within the same general "family" as my other BTIC1 formats, as otherwise I probably would have needed to call it "BTIC3" or something.



There may also be a Deflate-compressed variant, which will increase compression at some cost in terms of decoding performance.


Test of using codec for video capture...




basically, created a newer video codec experiment I am calling BTIC1C:

it is basically a modified Apple Video / RPZA with a few more features glued on.

the idea is based on the observation that the RPZA and DXT block structures are sufficiently similar that it is possible to convert between them at reasonably high speeds (mostly by feeding bytes through tables and similar).

this is basically how my current implementation works, basically directly transcoding to/from DXT, and using some of my existing DXT related logic to handle conversions to/from RGBA space, and also things like lossy block quantization.

this then allows the decoder to operate at around 450-480 Mpix/s (for a single-threaded decoder, decoding to DXT). though, granted, the image quality isn't particularly great (slightly worse than DXT at best).

my additions were mostly adding support for alpha transparency, and LZ77 based dictionary compression.

the alpha transparency is basically compatible with existing decoders (seems to be compatible with FFmpeg, which displays an opaque version of the image).

there is support for blended-alpha transparency, but this will not work correctly if decoded by a decoder which expects RPZA (so non-blended transparency is needed).

the LZ77 support is not compatible with existing (RPZA) decoders, but results in a modest increase in compression (in my tests, a 25-50% size reduction with some of the clips tested). this does not add any real cost in terms of decoding speed. the dictionary compression is done in terms of runs of blocks (rather than bytes).

compressed file sizes are "competitive" with M-JPEG, though with M-JPEG generally able to deliver higher image quality (for pretty much anything above around 75% quality), and compressing better with clips which don't effectively utilize the abilities of the codec (it is most effective at compressing images with little movement and large flat-colored areas).

for clips though with mostly static background images and large flat-colored areas, such as "cartoon" graphics, the codec seems particularly effective, and does much better than M-JPEG in terms of size/quality (it requires setting the JPEG quality very low, resulting in an awful-looking mess).

additional competitiveness is added by being around 6-7 times faster than my current M-JPEG decoder (currently does around 70-90 Mpix/s with a single-threaded decoder, when decoding directly to DXT).

comparison between this and a prior format, BTIC1A, is that this format compresses better, but has lower decode speeds.

tests have implied that Huffman compression would likely help with compression (based on a few experiments deflate-compressing the frames), but would likely hurt performance regarding the decode speeds.

I have not done tests here for multi-threaded decoding.

or such...


Planets, and Apple Video...

I have decided to change my voxel engine from being an "infinite" 2D plane, to being a bounded 2D region, which I will assert as being a "planet".

this basically consisted of adding wrap-around, and generating a topographical map for the world.
currently, the world size is fairly small, roughly a space about the size of the surface of Phobos, or otherwise, about 64km around (65536 meters).

an issue with planets:
there is no really obvious "good" way to map a 2D plane onto a sphere.
some people try, but things get complicated.

I took the easy route though, and simply made it wrap around, effectively making the ground-level version of the planet have a torus as its topology.

then, for those who notice that the ground-level view of the planet is geometrically impossible, I can be like "actually, the planet is a Clifford Torus", and the spherical view represents the part of the torus passing through normal 3D space...

kind of a hand-wave, but oh well...

meanwhile, went and threw together an implementation of the Apple Video codec (FOURCC='rpza').

I was thinking "you know, Apple Video and DXT1 are pretty similar...".

basically, they are sufficiently similar to where moderately fast direct conversion between them is possible.
granted, the size/quality tradeoff for this codec isn't particularly good.

while I was implementing support for the codec, I did add alpha-support, so this particular variant supports transparent pixels.

quick test:
I am decoding frames from a test video at around 432 Mpix/s, or a 320x240 test video decoding at around 5700 frames/second. the above is with a vertical flip as a post-step (omitting the flip leaves the decoder running at 470 Mpix/s, and 6130 frames/second).

and, then even as fast as it is going, checking in the profiler reveals a spot which is taking a lot of cycles, causing me to be like "MSVC, FFS!!"

the expressions:
tb[6]=0; tb[7]=0;

compile to:
0xfe08c6a mov edx,00000001h BA 01 00 00 00 0.44537419
0xfe08c6f imul eax,edx,06h 6B C2 06
0xfe08c72 mov [ebp-44h],eax 89 45 BC 0.21256495
0xfe08c75 cmp [ebp-44h],10h 83 7D BC 10
0xfe08c79 jnb $+04h (0xfe08c7d) 73 02 1.38335919
0xfe08c7b jmp $+07h (0xfe08c82) EB 05 0.19569471
0xfe08c7d call $+00018ce1h (0xfe2195e) E8 DC 8C 01 00
0xfe08c82 mov ecx,[ebp-44h] 8B 4D BC
0xfe08c85 mov [ebp+ecx-14h],00h C6 44 0D EC 00 0.25305352
0xfe08c8a mov edx,00000001h BA 01 00 00 00 0.50273299
0xfe08c8f imul eax,edx,07h 6B C2 07
0xfe08c92 mov [ebp-3ch],eax 89 45 C4 0.30703825
0xfe08c95 cmp [ebp-3ch],10h 83 7D C4 10
0xfe08c99 jnb $+04h (0xfe08c9d) 73 02 1.26189351
0xfe08c9b jmp $+07h (0xfe08ca2) EB 05 0.2598016
0xfe08c9d call $+00018cc1h (0xfe2195e) E8 BC 8C 01 00
0xfe08ca2 mov ecx,[ebp-3ch] 8B 4D C4
0xfe08ca5 mov [ebp+ecx-14h],00h C6 44 0D EC 00 0.00337405

which is actually, pretty much, just plain ridiculous...

yes, the above was for debug code...
but, there is presumably some limit to how wacky code built in debug mode should get.


the implementation of Apple Video was extended slightly, with this extended version hereby dubbed BTIC1C.

the primary additions are mostly alpha support and some LZ77 related features.
well, here is something unexpected.
I have yet to generally confirm this, or yet did any real extensive testing to confirm that what I have encountered works, in-general, but it appears to be the case at least in the cases encountered.

(I can't actually seem to find any mention of it existing, like somehow encountering a feature which should not exist?...).

(ADD 4: Turns out it was much more limited, see end of post...).

basically, what happened:
I had looked around on the internet, and recently saw graphs showing that newer versions of MSVC produced code which was notably faster than the version I was using (Platform SDK v6.1 / VS 2008).

I was like, "hell, I think I will go download this, and try it out...".

so, long story short (on this front, basically went and downloaded and installed a bunch of stuff and ended up needing to reboot computer several times), I got my stuff to build on the new compiler (Visual Studio Express 2013).

except, in several cases, it crashed...

the thing was not just that it crashed, but how it crashed:
it was via bounds-check exceptions (I forget the name).

not only this, but on the actual lines of code which were writing out of bounds...
and, this occurred in several places and under several different scenarios.

in past compilers, one may still get a crash, but it was usually after-the-fact (when the function returns, or when "free()" is called), but this is different.

looking around, I couldn't find much information on this, but did run across this paper (from MS Research):

this implies that either this (or maybe something similar) has actually been put into use in compilers deployed "in the wild", and that bounds-checking for C code, apparently, does now actually exist?... (ADD4: No, it does not, I was incorrect here...).

compiler: Visual Studio Express 2013 RC (CL version: 18.00.20827.3);
language: C;
code is compiled with debug settings.

ADD: ok, turns out this is a "Release Candidate" version, still can't find any reference to this feature existing.

I may have to go do some actual testing to confirm that this is actually the case, and/or figure out what else could be going on here... I am confused, like if something like this were added, wouldn't someone write about it somewhere?...

ADD2 (from VS, for one of the cases):
0FD83E3C sub eax,1
0FD83E3F mov dword ptr [ebp-0A4h],eax
0FD83E45 cmp dword ptr [ebp-0A4h],40h
0FD83E4C jae BASM_ParseOpcode+960h (0FD83E50h)
0FD83E4E jmp BASM_ParseOpcode+965h (0FD83E55h)
0FD83E50 call __report_rangecheckfailure (0FD9C978h)
0FD83E55 mov eax,dword ptr [ebp-0A4h]
0FD83E5B mov byte ptr b[eax],0

so, this one is a statically-inserted bounds-check.
(dunno about the others...).

ADD3 (this case has an explanation at least):

ADD4: more testing, it does not apply to memory allocated via "malloc()", which still does the crash on "free()" thing, rather than crash immediately.

the bounds-checking apparently only applies to arrays which the compiler knows the size for, but does not apply to memory accessed via a raw pointer.
idle thoughts here mostly:
project complexity is not nearly so easily estimated as some people seem to think it is;
time budget is, similarly, not nearly so easily estimated.

there is a common tendency to think that the effort investment in a project scales linearly (or exponentially) with the total size of the codebase.
my experiences here seem to imply that this is not the case, but rather that interconnectedness and non-local interactions are a much bigger factor.

like, the more code in more places that one has to interact with (even "trivially") in the course of implementing a feature, the more complex overall the feature is to implement.

similarly, the local complexity of a feature is not a good predictor of the total effort involved in implementing the feature.

like, a few examples, codecs and gameplay features and UIs:

codecs are have a fairly high up-front complexity, and so seemingly are an area a person dare not tread if they hope on getting anything else done.

however, there are a few factors which may be overlooked:
the behavior of a codec is typically extremely localized (typically, code outside the coded will have limited interaction with anything contained inside, so in the ideal case, the codec is mostly treated as a "black box");
internally, a codec may be very nicely divided into layers and stages, which means that the various parts can mostly be evaluated (and their behavior verified) mostly in isolation;
additionally, much of the logic can be copy/pasted from one place to another.

like, seriously, most of the core structure for most of my other codecs has been derived via combinations of parts derived from several other formats: JPEG, Deflate, and FLAC.
I originally just implemented them (for my own reasons, *1), and nearly everything since then has been generations of copy-paste and incremental mutation.

so, for example, BTAC-2A consisted mostly of copy-pasted logic.

*1: partly due to boredom and tinkering, the specs being available, and my inability to make much at the time "clearly better" (say, "what if I have something like Deflate, just with a 4MB sliding window?...", surprisingly little, and trying out various designs for possible JPEG like and PNG like formats, not getting much clearly better than JPEG and PNG, *2).

*2: the partial exception was some random fiddling years later, where I was able to improve over them, by combining parts from both (resulting in an ill-defined format I called "NBCES"), which in turn contributed to several other (incomplete) graphics codecs, and also BTAC-2A (yes, even though this one is audio, bits don't really care what they "are").
(basically its VLC scheme and table-structures more-or-less come from NBCES, followed by some of my "BTIC" formats).

the inverse case has generally been my experiences with adding gameplay, engine, and UI features:
generally, these tend to cross-cut over a lot of different areas;
they tend not to be nearly so well-defined or easily layered;

as a result, these tend to be (per-feature) considerably more work in my experience.
I have yet to find a great way to really make these areas "great", but generally, it seems that keeping things "localized" helps somewhat here;
however, localization favors small code, since as code gets bigger, the more reason there is to try to break it down into smaller parts, which in turn goes against localization (in turn once again increasing the complexity of working on it).

another strategy seems to be trying to centralize control, so while the total code isn't necessarily smaller, the logic which directly controls behavior is kept smaller.

like, the ideal seems to be "everything relevant to X is contained within an instance of X", which "just works", without having to worry about the layers of abstraction behind X:
well, it has to be synchronized via delta messages over a socket, has to make use of a 3D model and has tie-ins with the sound-effect mixing, may get the various asset data involved, ...

it is like asking why some random guy in a game can't just suddenly have his outfit change to reflect his team colors and display his respective team emblem, ... like, despite the seemingly trivial nature of the feature, it may involve touching many parts of the game to make it work (assets, rendering, network code, ...).

OTOH, compiler and VM stuff tends to be a compromise, namely that the logic is not nearly so nicely layered, but it is much more easily divided into layers than is gameplay logic.

say, for example, the VM is structured something like:
parser -> (AST) -> front-end -> (bytecode)
(bytecode) -> back-end -> JIT -> (ASM)
(ASM) -> assembler -> run-time linker -> (machine-code)

it may well end up being a big ugly piece of machinery, but at the same time, has the "saving grace" that most of the layers can mostly ignore most of the other layers (there is generally less incidence of "cross-cutting").

so, in maybe a controversial way, I suspect tools and infrastructure are actually easier overall, even with as much time-wasting and "reinventing the wheel" they may seem to be...

like, maybe the hard thing isn't making all this stuff, but rather making an interesting/playable game out of all this stuff?...

well, granted, my renderer and voxel system have more or less been giving me issues pretty much the whole time they have existed, mostly because the renderer is seemingly nearly always a bottleneck, and the voxel terrain system likes eating lots of RAM, ...

or such...
Some info available here:

this is basically a codec I threw together over the past some-odd days (on-off over the past 10 days), probably a solid 3 or 4 days worth of implementation effort though.

it was intended to try to have a better size / quality tradeoff than my prior BTAC, but thus far it hasn't quite achieved it, though it has gotten "pretty close to not having worse audio quality".

unlike the prior BTAC, it is a little more flexible.
given its use of Huffman-coding and similar, it is a bit more complicated though.

currently, the encoder is hard-coded to use 132 kbps encoding (basically, as a compromise between 88 and 176 kbps that is similar to 128).

both are designed for random access to blocks from within my mixer (as opposed to stream decoding), so this is their main defining feature (as far as the mixer can tell, it looks pretty similar to the prior BTAC).
(started writing a post, but not sure if this makes much sense as a post...).

currently, I have a 3D game (FPS style), but also a bit of a problem:
it kind of sucks, pretty bad;
I don't think anyone really cares.

I also have a few drawbacks:
not very good at 3D modeling;
I am worse at map-making (hence... why I ended up mostly using voxels...).

but, I can do 2D art "acceptably".

I was recently thinking of another possibility:
I instead do something "simpler" from a content-creation POV, like a platformer.

the core concept would be "basically rip-off something similar to the MegaMan games".

the next thought is how to structure the worlds.
I was initially thinking of using single giant PNG images for content creation, but ran into a problem: to represent the whole world as a single giant graphics image (with a 1:1 pixel density mapping with current monitors) would require basically absurdly large resolutions, and the graphics editor I am mostly using (Paint.NET) doesn't really behave well (it lags and stalls and uses GBs of RAM with a 65536 x 8192 image). note that it would probably need to be carved up by a tool prior in-game use (so that it can be streamed and fit more nicely in RAM).

another strategy would be basically to compromise, and maybe build the world as a collection of 1024x1024, 2048x2048, or 4096x4096 panels. each panel would then represent a given segment of the world, and is also a more manageable resolution.

if using 4096x4096 panels, it would probably still be needed to carve them up, mostly to avoid obvious stalls if demand-loading them.

the drawback: mostly having to deal with multiple panels in the graphics editor.

partly it has to do with world pixel density, as my estimates showed that a 16-meter tall visible area, with a close to 1:1 mapping to screen pixels (and a native resolution of 1680x1050 or 1920x1080), would use around 64 pixels per meter.

alternatives would be targeting 32 or 16 pixels/meter, but with the drawback of a lower relative resolution.

or, alternatively, using tiles instead of panels (but, for this, I would probably need to go write a tile-editor or similar, or use the trick of using pixel-colors for panel types). could consider "frames" to allow for larger items if needed.

if using tiles, 1 meter is most likely, or maybe 4 meters as a compromise between tiles and larger panels.

not sure if there are any standard practices here...

as for gameplay/logic, probably:
character walks left and right, can fire a gun, optionally with up/down aiming, and can also jump;
enemies will probably walk back and forth, fire projectiles if the player is seen, or other behaviors, probably with anim-loop driven logic.

unclear: if using a client/server architecture makes sense for a platformer, or if it is more sensible to do it with gameplay directly tied to rendering? (does anyone actually do network multiplayer in platformers?...)

more likely I could just start out by having partially disjoint render-entities and game-entities, with the option of switching over to client/server later if needed.

could consider this time writing the game-logic primarily in my script-language.

does still seem a bit like "more of the same...".

can't say if this will go anywhere.

optimization, ...

so, not much notable recently...

pretty much the past month it seems has mostly been things like performance optimizing and trying to shave down the memory footprint...

well, and I also added Oculus Rift support...
however, it is still a limiting factor that I can only use the thing for short periods of time before motion-sickness sets in, but OTOH, neither my engine nor Minecraft apparently pulls off the required framerates to effectively avoid motion sickness.

have a few times considered idle ideas for a few things:
possibly moving to a 3D model format with precomputed frame vertices (and deciding between something like ".md3" or "compressed dumped VBOs");
as-is, most of this stuff is computed at runtime, which is kind of expensive, as well as the model loading times being an issue (want to load them ideally without introducing a perceptible delay, which is a problem if it would involve calculating a bunch of stuff during loading);
ideally, I also want to keep the skeleton around in the off chance I later want to add something like ragdoll or similar.

also experimentally moved textures to a package, where converting the textures from PNG to BTJ reduced them from 70MB to 22MB.

also packaged up the audio, taking it from 30MB to 8MB, but there seem to be quality issues with many of my sound-effects and my BTAC codec. (ADD, FIXED: integer overflow, needed to add range-checks and clamping).

I am tempted to look into other possible options, with the goals:
similar or lower bitrates;
better size/quality tradeoff;
supports random-access decoding.

possible options:
messing more with quantized entropy-coded ADPCM-like algorithms (some of my past audio codecs), but granted my past attempts had size/quality issues;
fiddling around with an MDCT based codec (similar to Vorbis or MP3), just designing it more for random-access rather than stream decoding;
hacked random-access-oriented version of Vorbis?;

ADD: a new codec may be unnecessary... it seems the main sound-quality issue I was dealing with was due mostly to an integer overflow, which has now been fixed.

otherwise, had another idle thought related to image and video coding.
BTIC2B was probably a bit overly ambitious of a design (with too many tunable parameters, ...).

I may possibly consider an idea for "BTIC2C", which would mostly aim to be a simpler format with less encoding options.
probably: AYUV 4:2:0 with 8x8 WHT, and a YCoCg color-space, ... (and ideally simpler and faster than JPEG).

would probably aim to include basic video features in it. these would be mostly frame-deltas and block-motion-compensation. (main other options here: work more on video-coding stuff for my BTJ-NBCES format, or maybe consider using something like a hacked version of Theora...).

or such...

biomes test...

went and added biomes:

Well, a few updates:
Biomes have now basically been implemented;
The engine now has (essentially) infinite terrain (regions and chunks are generated as-needed, and regions are unloaded as the player moves too far away).

As the player travels further away from the origin, new regions will be created automatically, and more of the math has been cleaned up to better avoid floating point precision issues (coordinates are region-relative, ...).

Also fixed some issues with the "reference origin" system being broken (where the reference-origin system basically handles larger coordinate spaces by having everything in the scene be represented relative to a movable origin point).

The boundary walls are still enabled for now, where these walls basically surround a 1km^2 starting area. These may or may not be disabled later.

Biome types currently are stored per-chunk, and things like colors use quadratic interpolation between chunks.

Currently, there are grass, water, and sun-light colors for each biome, and the grass and leaves textures have been partly desaturated (to allow color modulation to be more effective).

so, yeah, basically each region is drawn translated by RegionOrigin-ReferenceOrigin, and things like camera and entity positions are also relative to the reference origin, ...

I had implemented it a while ago, but it wasn't really well maintained partly because previously memory use and bugs kept the player mostly confined to a reasonably small space (bounded by walls).

recently, more work is going into basically cleaning up some of these issues, trying to reduce memory use, and hopefully getting around to things like working some on improving performance, ... (terrain handling stuff is currently fairly expensive regarding CPU cycles...).

or such...
well, now more work is once again going into working on the voxel terrain system.
this is, at present, one of the bigger users of time and memory-resources in my 3D engine.

so, a few new things:
added "random think events", which basically allow random chunk-updates to occur (for example, grass now spreads and cactuses now grow, ...);
went and fully added "infinite terrain" support, though practically it is still not nearly as "infinite" as in Minecraft, it is at least possible to travel several km from the origin and return without too much issue (chunks and regions will load and unload as needed);
I am considering adding biomes, and started some work on this;
I am also adding support for indexed chunks.

I had tried, not entirely successfully, to change the region size in a "friendly" way, but ended up reverting to the prior region size mostly as this started resulting in "creeping corruption".

basically, I was wanting to have more symmetric cube-shaped regions (256x256x256), rather than asymmetric ones (512x512x128), but such a change naturally breaks compatibility with the pre-existing region files. I tried using a hack to "fudge over it", but this causes things to start quickly falling apart.

I was thinking it could be possible to replace the current use of spawner-blocks with random-think-events, which would basically again make it more like MC, where mobs can spawn on any qualifying block, rather than needing explicit spawner blocks to spawn them.

index chunks are basically an in-development memory-saving trick.

currently, there are 2 types of chunks:
raw chunks, where each voxel requires 8 bytes (leading to 32kB for a 16x16x16 chunk);
RLE chunks, where the chunk is stored in an RLE-compressed format (it is decompressed when accessed, and raw chunks revert to this form if not accessed for a certain number of ticks).

with an index chunk, it instead will hold a table of voxel values, and each voxel will hold an index into this table.
this should save a fair bit of memory in the case of chunks where only a moderately small number (< 256) unique voxels exist, without requiring explicit decompression (and only requires ~ 6kB for voxel data rather than 32kB).

they will also have their own RLE compressed form as well.
the main intent of the index-chunks is to reduce the need for dynamic RLE compression.

for their RLE form, I had idly considered using a predictor, but predictors aren't free, and I don't really know of an "ideal" predictor strategy for 8-bit indexed data anyways (IOW: cheap+effective).

a predictor would make more sense for a more aggressively compressed format (IOW: one with entropy coding), rather than a bytewise RLEB variant.

index chunks will not fully replace raw chunks as:
there is still a need for chunks with possibly more than 256 unique voxels (word-indices are possible but are less competitive, *1);
writing voxels to index chunks is somewhat more expensive (need to index the voxel-value).

*1: especially for chunks which are maybe mostly stone or dirt and maybe a few other things, which I suspect are typically the majority, given my region files are indicate an average of around 640 bytes per RLE chunk, which would not happen if there were any significant number of unique voxels in each chunk. nevermind that I gained compression by hacking RLEB to support runs > 256 bytes (IOW: very long runs all of the same value), ...
though, strickly speaking, the memory-break-even point is around 2k-3k unique voxels per chunk.
the indexing cost will also increase as the voxel-count increases.
here is the result of one test (for stereoscopic 3D output):

note: MovieMaker and YouTube seem to be conspiring against me on this one, basically making it a challenge trying to get a video with the correct aspect ratio.

and a slightly earlier test of direct anaglyph output (intended for green/magenta 3D glasses):

since then, I have gone and fixed up some things such that I can use higher resolutions, and also things like aspect ratio are tunable parameters.

in both cases though, for stereographic rendering, the image is naturally an awkward size.
I can render things scaled correctly with a video aspect of 8:3 or 32:9, but the video is basically a bar, and limits on horizontal resolution limit the vertical resolution, and MovieMaker inserts ugly black bars for everything that isn't 4:3 or 16:9.

this was what happened for the first stereo-3D video test (which came out looking awful):

for the later one, I rendered at a slightly higher resolution and with the image shoved into a 16:9 frame (to avoid the black bars), but alas it didn't come out at the correct aspect ratio (I rendered for 4:3, but YouTube displayed it at 16:9).

I could do a 3rd test, but I will leave it until later.

likely this would mean rendering/recording at 1280x720 or 1600x900 with dual 16:9.
this would work, but would halve the effective horizontal resolution (800x900).

note that a lot of this is also effectively limited by my monitor resolution (1680x1050).

note that for both stereographic and anaglyph modes, the scene is only rendered once and then warped as a post-processing effect to create the left/right views (the premise being that this is cheaper than rendering the scene twice, even if the quality is not as good).

or such...

3rd Test:

more experimentation and fiddling continues, along with trying to find the ideal settings for stereoscopic 3D.

I spent a while trying to figure out some math in the shader, later realizing that the 'relevance' of the variable amounted mostly to a constant factor. some other math works, somehow...

not really sure if there is any standardized definition of stereoscopic vision.

from what I can tell, it is mostly a matter of divergence from a center point, with the divergence increasing with distance (implying that each eye looks slightly outward, rather than straight forward or slightly cross-eyed). there is also a certain factor due to approximate eye distance.

most 3D content I have found, for whatever reason, seems to diverge much more rapidly than what feels natural to myself, and often to a degree where keeping the images integrated is difficult. I have generally gotten a more comfortable and natural effect using a smaller divergence.
Sign in to follow this