Jump to content

  • Log In with Google      Sign In   
  • Create Account


BGB

Member Since 02 Feb 2013
Offline Last Active Jul 10 2014 04:54 PM
***--

Topics I've Started

New VM (WIP), thoughts?...

20 May 2014 - 01:26 PM

(I had debate whether to put this here or in the journal, but I am looking for peoples thoughts on some of this...)

 

 

well, here is the how this got going:

for a while, I wanted a VM which could allow extending the reach of C to targets where C is normally unavailable or overly inconvenient (namely web-browsers and Android).

 

while there is Emscripten for browsers, my initial tests had shown that the size of the output code expands considerably (*), bringing doubts to its realistic viability for moderate/large codebases, more so as I am running a personal-run server and don't exactly have massive upload speeds (they give like 30 Mbps down but 2 Mbps up, but a person could get 25 down / 5 up for "only" $179/month on a 3 year contract... cough...).

 

while the NDK is available for Android, it has some serious drawbacks, making it not really a great option (which version of ARM does the device run? what if it runs x86? ...). things are nicer if Java/Dalvik can be used here.

 

*: (EDIT: Removed. Issue turns out to be "not so simple".).

 

 

also, recently, I have put my game project on hold, for sake of basically reaching a stage where I have run out of ideas for new functionality and burnt out about dealing always with the same basic issues.

 

have gone and messed with writing small apps, as tests, to test out various pieces of technology (ex: various transcompilers to JavaScript, ...).

 

 

so, decided recently (~ several weeks ago) to start working on a new VM, consisting of several major components:

a C compiler front-end, which compiles C source into the bytecode format;

* based on a pre-existing C compiler frontend of mine, which originally targeted a different IL.

** it also had partial support for C++, so a C++ subset may be possible (will probably lack templates).

** it has spent recent years mostly just being used as a code-processing / glue-code generation tool.

* IR is statically-typed and register based, vaguely Dalvik-like

 

an interpreter backend, which will be rewritten in the form as needed for each target.

* current (still incomplete/untested) backend is written in Java, but C and Dart or JS backends are planned.

* the plan for the JS backend would be to dynamically compile the bytecode into JS on the client.

** main roadblock: dealing with JS, not sure the best approach to go about debugging in JS.

** Java via Google Web Toolkit is a more likely initial target for browsers ATM.

* it uses a design intended mostly to allow a (relatively) simple interpreter to generate efficient code.

** this basically means a Register-IR

** while stack-machines are simpler overall, a simple stack interpreter will give lackluster performance.

*** getting more speed out of a stack machine means a lot of additional complexity.

** also don't confuse simplicity with smallness

*** N*M cases may make code bulkier, but don't add much to overall implementation complexity.

* bytecode images are likely to be Deflate or maybe LZMA compressed.

 

 

the VM design in question here is much lower-level than some of my other VMs, and makes the compiler logic a fair bit more complicated, so this is still the main hold-up at present (can't finish/test the backend without being able to produce code to be able to run on it).

 

the design is intended to be high-level enough to gloss over ABI differences, and allow some implementation freedom, but this is mostly about all it really does (it otherwise that far above machine-code in a level-of-abstraction sense).

 

note that, unlike, say, Flash, this will not require any browser plugins or similar, rather the plan is that the VM will itself be sent over to the browser, and then used to transform the code client-side.

this does carry the assumption that the VM will be smaller than the code which runs on it.

 

 

this is worrying as, as-is, it means lots of code I have as-of-yet been unable to verify, but I would need to write a bytecode assembler and disassembler to test the backend more directly. will probably need these eventually though.

 

for those interested, here is a recent-ish working-spec for the bytecode:

http://cr88192.mooo.com:8080/wiki/index.php/FRBC2C

 

still to be seen if I will be able to get all this to usably complete levels anytime soon.

 

 

thoughts?...


running flat, feeling burnt out...

11 April 2014 - 01:52 AM

so, here is my issue:

I had made my game project.

 

it kind of sucked as a game, granted.

no one cared, and no money was made.

now my ideas and motivation to work on it have largely run flat as well, few new ideas, fewer still that I feel motivated to work on.

 

 

spent a while working on video stuff, managed to get a specialized codec up to passable bitrate/quality.

bitrate/quality is worse than mainstream video codecs, but compares favorably here to many other low-complexity VQ based codecs, and can be used for recording full-screen video in real time both on my current main PC and also pretty good on an 11 year old laptop.

 

I was actually left considering trying to encode video on an ASUS EEE 701, except I seem to have misplaced the thing.

extrapolation gave a good solid "maybe" as to whether real-time encoding could be done on it (basically, seeing if 800x480 could be encoded real-time on a 630 MHz Celeron).

 

but, motivation fizzled, and there wasn't really much left that really needed to be worked on with this.

it basically achieved design goals (all that would really be left would be fine-tuning for supporting decoding to BC7 partitioned-mode blocks).

 

 

then, stuff has gone in a more pointless direction...

 

I also sat around recently and wrote a basic software rasterizer, with the goal of basically using it as a limited function OpenGL 1.x implementation. the OpenGL API layer hasn't been fully implemented, but it gives "reasonable" results for straightforward rendering tasks (simple rendering test-cases often in the triple-digit fps, on a 3.4GHz CPU, *).

 

*: it can also do 80-90 fps doing a "console" test, filling a 1440x900 image with 8x8 pixel textured triangles to draw characters. but, if drawing a single big triangle or quad, it can do ~ 250fps for a 1440x900 big textured quad, and ~ 400 fps at 1024x768 (and ~ 900 fps for flat-color fills).

 

it tries to gain speed mostly by cutting a lot of corners, like for example only supporting NEAREST and NEAREST_MIPMAP_NEAREST filtering, ... and would probably skip out on supporting a bunch of stuff I have pretty much never see used. is also tries to gain some speed via hard-coding a lot of "common-special-case" spans-drawing loops, and mostly using fixed-point and pseudo SIMD (where SIMD is faked mostly via packing multiple values into integer registers).

 

little to say how it would hold up in an actual game though. the big risk is mostly that a game may potentially have a lot of overdraw, or may draw a lot of geometry which falls outside the special-cases. the goal may be to eventually test using it with Quake 2 or similar (once the OpenGL layer is basically implemented and similar).

 

 

but, what people I know basically somewhat express hate on the idea...

basically, people being like "but Mesa already does software rendering!".

 

it is already established that it is most likely a software GL renderer is pretty much useless though.

 

 

vaguely related to the above was sitting around trying to write a basic Windows emulator, where in this case most of the effort was going into basically hacking together some basic mock-ups of basic Win32 DLLs (kernel32, gdi32, ws2_32, ...), which at this point consist almost entirely of stubs (basically, implementing functions when the program dies and helpfully gives a message which Win32 API function was called...).

 

it is also running in an x86 interpreter/emulator, so in this sense is technically a little closer to something like DosBox than it is like something like Wine. (also, the x86 interpreter itself runs a subset, limited pretty much entirely to Ring-3 32-bit PMode and a 486-like ISA). the PE/COFF loader is also basically built-into the interpreter.

 

the core libs were being rewritten partly both because the Wine libs wouldn't fit very well on my x86 emulator, and also because I am not particularly wanting the core machinery to be stuck with GPL or LGPL (probably, I will use an MIT license).

 

also, it is limited to a small subset of the Win32 API (+ GL), mostly because, seriously, I can't possibly, nor am I trying, to re-implement the entirety of Windows (and I also largely consider DirectX and .NET to be somewhat outside the scope of such an effort).

 

 

people don't seem to get the idea that it is not the goal to try to recreate the entire OS nor to have any world-changing or community building or whatever. (the initial goal would more be something like being able to run things like MinGW and GLQuake and similar on the thing...).

 

but, yeah, also a lot of hate over this one, and once again, it is already established that this stuff is probably pretty much useless...

 

 

I am like, "I feel like doing this stuff, I will do it", unless something better or more interesting comes along...

 

like, this stuff is new for me, I haven't personally done this stuff before, so it seems like something to do, and everything I am doing pretty much all seems to be a waste of time anyways, so not like it can probably be all that much worse off on this front...

 

 

thoughts / comments?...

 


CPU-side performance...

12 March 2014 - 03:32 AM

basically, I am sitting around mostly lazily just writing plain C code, largely without  resorting to anything "fancy" here in an attempt to get performance (ex: ASM, SIMD, multithreading, ...), though admittedly there was some amount of micro-optimizing and "fine tuning", ... (a lot of this is used elsewhere in the project, just not in the code in question...).

 

ok, so?...

 

 

I "recently" went and wrote a version of my currently-active video codec mostly for real-time capture, and with some fiddling got it up to doing around 1680x1050 at 30fps (with the recording front-end being VirtualDub). basically does full desktop capture at typically around 12% CPU load on my system (though if the CPU load for VirtualDub hits 15% it will drop below 30 and start dropping frames, with a single-threaded encoder).

 

what is a mystery is, ok, I have done this, but why then are there so few other options that can do similarly or better?...

one program will encode using MS-CRAM (MS Video 1), which goes fast enough but the video quality leaves much to be desired.

FRAPS sort of holds up ok (despite the free version having a recording time-limit and similar), but grinds the HDD pretty bad and goes through unreasonably large amounts of HDD space.

 

another program based around x264 basically runs the CPU at full load (on all cores), lags computer some, and has to use downsampling to be able to record in real-time (on settings for high-speed encoding).

 

well, ok, there is Lagarith, using Lagarith via VirtualDub pulling around 27 fps and running the CPU at 30%-40% load, with VirtualDub still dropping frames (still a viable option though).

 

also tested capturing using XviD, but it didn't hold up well (quickly dropped below 20 fps, while pretty much maxing-out the CPU in the process).

 

well, nevermind differences between the various formats, which can contribute a fair bit to the computational cost of encoding.

 

 

 

well, and me also throwing together a BC7 encoder (now with partition support) generally fast enough for load-time encoding (still probably a bit too slow for real-time though).

 

then again, I don't use a brute-force search, instead driving most of the process by arithmetic and lookup tables.

 

ex:

RGB -> YCbCr, then do VQ magic on the CbCr values, and use this to drive a lookup table (to get the partition, a series of LUTs are built when the coder initializes which map chroma-space vectors to partition numbers);

the partition is then used by the good old CYGM endpoint-selector (*), which chooses endpoints independently per-partition;

the various options (block-format / etc) are then evaluated and then the final output block is produced.

 

the logic could still be faster (ex: as-is, it has to run the filter / endpoint classifier multiple times per block, ... and seems to be the bottleneck here).

 

*: CYGM=Cyan, Yellow, Green, Magenta. basically the pixels are converted into a CYGM-based color-space, and this is used to evaluate the endpoints (in prior tests for naive linear classifier axes, CYGM seemed to give the best results). this was also computationally cheaper than some other options, while giving generally better results than simply classifying things based on Luma.

 

( the CYGM filter was previously used some with video recording, but framerates didn't hold at higher resolutions, so recording has reverted to a luma-driven selector, with a potentially cheaper algorithm considered as a possible option to improve image quality. still used for batch encoding and for load-time texture conversion and similar, as these are less speed-critical...).

 

 

I don't really get it, it didn't really seem all that difficult to get ok speeds out of BC7 encoding, but a lot of the existing encoders seem to take a very long time and need to resort to GPGPU stuff and similar...

 

unless it is mostly time spent on looking more for an "optimal" solution, rather than finding a "good approximate" to be sufficient?... does still seem a little steep though.

 

 

well, also BC7 is used as the current primary texture format for sending video to the GPU (replacing DXT5, mostly as BC7 can give better image quality).

 

 

sorry if there is no particular question here, but, thoughts?...


misc: texture compression and video textures...

09 December 2013 - 02:04 AM

basically, a fair amount of work recently in my case has been going into video textures, mostly trying to find a "good" solution to the various sorts of problems that come up when doing these sorts of things.

 

sorry, not sure where to best post this (I am using OpenGL at least, I guess...).

 

 

basically, here is the issue:

 

one can have a more general-purpose video format (MJPEG / Theora / ...), which might internally represent the video frames in terms of a YUV colorspace or similar, and then transcode to a compressed texture format (DXT1 or DXT5 or BC6H or BC7 or whatever else). the advantage is that the video can be more general purpose, and get a good size/quality tradeoff, though the disadvantage is that the decoding speed isn't as good, nor is the final image quality particularly great (since the texture conversion needs to be done in real-time).

 

for example, for normal texture maps, I am generally using load-time conversion, as the speed is less important and it can do a slightly better looking conversion.

 

 

alternatively, one can have a codec which specifically targets a given compressed texture format, and can potentially invest more effort (at encoding time) into generating better final image quality, as well as having higher decoding speeds, and more features can be supported for video-textures (such as mipmaps, ...).

 

the problem: the need for multiple versions of the video to target each format used, as well as to use a codec potentially specific to each compressed texture format (for example, having one codec that does DXT1 and DXT5, and another that does BC6H and BC7), as well as possibly still needing to keep around a "generic" version (for decoding to RGBA or similar).

 

for example, it isn't necessarily good if a person needs 2 or 3 versions of a given video-texture, each of which uses a different codec (wasting space and similar). (there end up being a lot of specialized codecs in use mostly as none does particularly great with "everything"...).

 

and, what of other compressed-texture formats? ...

 

 

for example:

textures/base_vid/sometex_dxt5.avi  //DXT5 or similar

textures/base_vid/sometex_bptc.avi  //BC6H or BC7

textures/base_vid/sometex.avi  //Generic (RGBA or real-time conversion)

 

or (what I have often currently been doing):

only having the version intending for decoding to DXTn, with a fallback case for decoding to RGBA or similar if needed (issue: weak image quality in fallback cases).

 

I am not sure if anyone has a particularly good strategy here?...

 

(well, granted, besides obvious things, like not using videos as animated textures or similar...).

 

 


A Wild switch() Appears!

20 November 2013 - 11:46 PM

yeah, switches are nothing new, and can sometimes if left unchecked get pretty big and ugly.

 

 

recently, in a piece of code of mine, this particularly glorious example has appeared:

	ret=0; mode=0; nextmode=0;
	while((cs<cse) && (ct<cte) && !ret)
	{
		i=*cs++;
		switch(i&0xE0)
		{
		case 0x80:
			n1=(i&31)+1;
			ct+=n1*stride;
			break;			
		case 0xE0:
			switch(i)
			{
			case 0xE0:
				if(mode)
					{ mode=0; nextmode=0; }
				else
					{ ret=1; }
				break;
			case 0xE1:
				cs+=3;
				break;
			case 0xE8:
				bgbbtj_rpza_memcpy8(ct, ct-stride);
				ct+=stride;
				break;
			case 0xE9:
				j=(*cs++)+1;
				while(j--)
				{
					bgbbtj_rpza_memcpy8(ct, ct-stride);
					ct+=stride;
				}
				break;
			case 0xEA:
				j=(*cs++)+1;
				if((ct-j*stride)<blks) { ret=-1; break; }
				bgbbtj_rpza_memcpy8(ct, ct-j*stride);
				ct+=stride;
				break;
			case 0xEB:
				j=(cs[0]<<8)+cs[1]+1;
				cs+=2;
				if((ct-j*stride)<blks) { ret=-1; break; }
				bgbbtj_rpza_memcpy8(ct, ct-j*stride);
				ct+=stride;
				break;
			case 0xEC:
				j=(*cs++)+1;
				k=(*cs++)+1;
				if((ct+j*stride)>cte) { ret=-1; break; }
				if((ct-k*stride)<blks) { ret=-1; break; }
				while(j--)
				{
					bgbbtj_rpza_memcpy8(ct, ct-k*stride);
					ct+=stride;
				}
				break;
			case 0xED:
				j=(cs[0])+1;
				k=(cs[1]<<8)+cs[2]+1;
				cs+=3;

				if((ct+j*stride)>cte) { ret=-1; break; }
				if((ct-k*stride)<blks) { ret=-1; break; }

				while(j--)
				{
					bgbbtj_rpza_memcpy8(ct, ct-k*stride);
					ct+=stride;
				}
				break;
			case 0xF0:
				switch(cs[0])
				{
				case 0xF0: cs++; mode=0; nextmode=0; break;
				case 0xF1: cs++; mode=1; nextmode=1; break;
				case 0xF2: cs++; mode=2; nextmode=2; break;
				case 0xF3: cs++; mode=3; nextmode=3; break;
				default: nextmode=mode; mode=0; break;
				}
				break;
			case 0xF1: nextmode=mode; mode=1; break;
			case 0xF2: nextmode=mode; mode=2; break;
			case 0xF3: nextmode=mode; mode=3; break;
			
			case 0xF8:
				switch(*cs++)
				{
				case 0x81:
					for(j=0; j<256; j++)
						{ ctx->pal256[j]=rpza_blkidxcolor[j]; }
					for(j=0; j<16; j++)
						{ ctx->pal16[j]=rpza_blkidxcolor[j]; }
					for(j=0; j<256; j++)
						{ ctx->pat256[j]=rpza_blkidx_pixpat[j]; }
					break;
				case 0x82:
					j=*cs++;
					k=(*cs++)+1;
					for(; j<k; j++)
					{
						l=(cs[0]<<8)|cs[1];
						cs+=2;
						ctx->pal256[j]=l;
					}
					break;
				case 0x83:
					j=(cs[0]>>4)&15;
					k=(cs[0]&15)+1;
					for(; j<k; j++)
						{ ctx->pal16[j]=ctx->pal256[*cs++]; }
					break;
				case 0x84:
					j=(cs[0]>>4)&15;
					k=(cs[0]&15)+1;
					for(; j<k; j++)
					{
						l=(cs[0]<<8)|cs[1]; cs+=2;
						ctx->pal16[j]=l;
					}
					break;
				case 0x85:
					j=*cs++;
					k=(*cs++)+1;
					for(; j<k; j++)
					{
						l=(cs[0]<<24)|(cs[1]<<16)|(cs[2]<<8)|cs[3];
						cs+=4;
						ctx->pat256[j]=l;
					}
					break;
				}
			default:
				break;
			}
			break;

		default:
			switch(mode)
			{
			case 0:
				switch(i&0xE0)
				{
				case 0xA0:
					j=(cs[0]<<8)|cs[1];
					cs+=2;
					l=j;
					j=((j&0x7FE0)<<1)|(j&0x1F);

					if(l&0x8000)
					{
						ctb[0]=j&0xFF;
						ctb[1]=(j>>8)&0xFF;
						ctb[2]=j&0xFF;
						ctb[3]=(j>>8)&0xFF;
						ctb[4]=0xFF; ctb[5]=0xFF;
						ctb[6]=0xFF; ctb[7]=0xFF;
					}else
					{
						ctb[0]=j&0xFF;
						ctb[1]=(j>>8)&0xFF;
						ctb[2]=j&0xFF;
						ctb[3]=(j>>8)&0xFF;
						ctb[4]=0; ctb[5]=0;
						ctb[6]=0; ctb[7]=0;
					}

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
			
					break;
				case 0xC0:
					j=(cs[0]<<8)|cs[1];
					k=(cs[2]<<8)|cs[3];
					cs+=4;
					l=k;
					j=((j&0x7FE0)<<1)|(j&0x1F);
					k=((k&0x7FE0)<<1)|(k&0x1F);

					if(l&0x8000)
					{
						if(j<=k)
						{
							ctb[0]=j&0xFF;
							ctb[1]=(j>>8)&0xFF;
							ctb[2]=k&0xFF;
							ctb[3]=(k>>8)&0xFF;
							csm=rpza_blkmap1;
						}else
						{
							ctb[0]=k&0xFF;
							ctb[1]=(k>>8)&0xFF;
							ctb[2]=j&0xFF;
							ctb[3]=(j>>8)&0xFF;
							csm=rpza_blkmap2;
						}
					}else
					{
						if(j>k)
						{
							ctb[0]=j&0xFF;
							ctb[1]=(j>>8)&0xFF;
							ctb[2]=k&0xFF;
							ctb[3]=(k>>8)&0xFF;
							csm=rpza_blkmap1;
						}else
						{
							ctb[0]=k&0xFF;
							ctb[1]=(k>>8)&0xFF;
							ctb[2]=j&0xFF;
							ctb[3]=(j>>8)&0xFF;
							csm=rpza_blkmap2;
						}
					}

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						ctb[4]=csm[cs[0]]; ctb[5]=csm[cs[1]];
						ctb[6]=csm[cs[2]]; ctb[7]=csm[cs[3]];
						cs+=4;
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
					break;

				default:
					if(cs[1]&0x80)
					{
						cs--;
						j=(cs[0]<<8)|cs[1];
						k=(cs[2]<<8)|cs[3];
						cs+=4;
						j=((j&0x7FE0)<<1)|(j&0x1F);
						k=((k&0x7FE0)<<1)|(k&0x1F);

						if(j>k)
						{
							ctb[0]=j&0xFF;
							ctb[1]=(j>>8)&0xFF;
							ctb[2]=k&0xFF;
							ctb[3]=(k>>8)&0xFF;
							csm=rpza_blkmap1;
						}else
						{
							ctb[0]=k&0xFF;
							ctb[1]=(k>>8)&0xFF;
							ctb[2]=j&0xFF;
							ctb[3]=(j>>8)&0xFF;
							csm=rpza_blkmap2;
						}

						ctb[4]=csm[cs[0]]; ctb[5]=csm[cs[1]];
						ctb[6]=csm[cs[2]]; ctb[7]=csm[cs[3]];
						cs+=4;
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}else
					{
						memset(ctb, 0, 8);
						//dummy...
						cs+=31;
						bgbbtj_rpza_memcpy8(ct, ctb); ct+=stride;
					}
					break;
				}
				break;
			case 1:
				switch(i&0xE0)
				{
				case 0xA0:
					j=ctx->pal256[*cs++];
					j=((j&0x7FE0)<<1)|(j&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=j&0xFF;
					ctb[3]=(j>>8)&0xFF;
					ctb[4]=0; ctb[5]=0;
					ctb[6]=0; ctb[7]=0;

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
					break;
				case 0xC0:
					j=ctx->pal256[cs[0]];
					k=ctx->pal256[cs[1]];
					cs+=2;
					j=((j&0x7FE0)<<1)|(j&0x1F);
					k=((k&0x7FE0)<<1)|(k&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=k&0xFF;
					ctb[3]=(k>>8)&0xFF;
					csm=rpza_blkmap1;

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						j=rpza_blkidxmap1[cs[0]];
						k=rpza_blkidxmap1[cs[1]];
						cs+=2;

						ctb[4]=csm[(j>>8)&255]; ctb[5]=csm[j&255];
						ctb[6]=csm[(k>>8)&255]; ctb[7]=csm[k&255];
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
					break;
				default:
					cs--;
					j=ctx->pal256[cs[0]];
					k=ctx->pal256[cs[1]];
					cs+=2;
					j=((j&0x7FE0)<<1)|(j&0x1F);
					k=((k&0x7FE0)<<1)|(k&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=k&0xFF;
					ctb[3]=(k>>8)&0xFF;
					csm=rpza_blkmap1;

					j=rpza_blkidxmap1[cs[0]];
					k=rpza_blkidxmap1[cs[1]];
					cs+=2;

					ctb[4]=csm[(j>>8)&255]; ctb[5]=csm[j&255];
					ctb[6]=csm[(k>>8)&255]; ctb[7]=csm[k&255];
					bgbbtj_rpza_memcpy8(ct, ctb);
					ct+=stride;

					break;
				}
				break;
			case 2:
				switch(i&0xE0)
				{
				case 0xA0:
					j=ctx->pal16[(*cs++)&15];
					j=((j&0x7FE0)<<1)|(j&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=j&0xFF;
					ctb[3]=(j>>8)&0xFF;
					ctb[4]=0; ctb[5]=0;
					ctb[6]=0; ctb[7]=0;

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
					break;
				case 0xC0:
					j=ctx->pal16[(cs[0]>>4)&15];
					k=ctx->pal16[(cs[0]>>0)&15];
					cs++;
					j=((j&0x7FE0)<<1)|(j&0x1F);
					k=((k&0x7FE0)<<1)|(k&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=k&0xFF;
					ctb[3]=(k>>8)&0xFF;
					csm=rpza_blkmap1;

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						j=ctx->pat256[*cs++];
						ctb[4]=csm[(j>>24)&255]; ctb[5]=csm[(j>>16)&255];
						ctb[6]=csm[(j>> 8)&255]; ctb[7]=csm[(j    )&255];
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
					break;
				default:
					cs--;
					j=ctx->pal16[(cs[0]>>4)&15];
					k=ctx->pal16[(cs[0]>>0)&15];
					cs++;
					j=((j&0x7FE0)<<1)|(j&0x1F);
					k=((k&0x7FE0)<<1)|(k&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=k&0xFF;
					ctb[3]=(k>>8)&0xFF;
					csm=rpza_blkmap1;

					j=ctx->pat256[*cs++];
					ctb[4]=csm[(j>>24)&255]; ctb[5]=csm[(j>>16)&255];
					ctb[6]=csm[(j>> 8)&255]; ctb[7]=csm[(j    )&255];
					bgbbtj_rpza_memcpy8(ct, ctb);
					ct+=stride;

					break;
				}
				break;
			case 3:
				switch(i&0xE0)
				{
				case 0xA0:
					j=ctx->pal256[*cs++];
					j=((j&0x7FE0)<<1)|(j&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=j&0xFF;
					ctb[3]=(j>>8)&0xFF;
					ctb[4]=0; ctb[5]=0;
					ctb[6]=0; ctb[7]=0;

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
					break;
				case 0xC0:
					j=ctx->pal256[cs[0]];
					k=ctx->pal256[cs[1]];
					cs+=2;
					j=((j&0x7FE0)<<1)|(j&0x1F);
					k=((k&0x7FE0)<<1)|(k&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=k&0xFF;
					ctb[3]=(k>>8)&0xFF;
					csm=rpza_blkmap1;

					n1=(i&31)+1;
					for(i=0; i<n1; i++)
					{
						j=ctx->pat256[*cs++];
						ctb[4]=csm[(j>>24)&255]; ctb[5]=csm[(j>>16)&255];
						ctb[6]=csm[(j>> 8)&255]; ctb[7]=csm[(j    )&255];
						bgbbtj_rpza_memcpy8(ct, ctb);
						ct+=stride;
					}
					break;
				default:
					cs--;
					j=ctx->pal256[cs[0]];
					k=ctx->pal256[cs[1]];
					cs+=2;
					j=((j&0x7FE0)<<1)|(j&0x1F);
					k=((k&0x7FE0)<<1)|(k&0x1F);

					ctb[0]=j&0xFF;
					ctb[1]=(j>>8)&0xFF;
					ctb[2]=k&0xFF;
					ctb[3]=(k>>8)&0xFF;
					csm=rpza_blkmap1;

					j=ctx->pat256[*cs++];
					ctb[4]=csm[(j>>24)&255]; ctb[5]=csm[(j>>16)&255];
					ctb[6]=csm[(j>> 8)&255]; ctb[7]=csm[(j    )&255];
					bgbbtj_rpza_memcpy8(ct, ctb);
					ct+=stride;

					break;
				}
				break;

			default:
				break;
			}
			mode=nextmode;
			break;
		}
	}

context:

this is the primary image decoding loop from one of my video codecs (BTIC1C).

 

it started out simpler (when the format had a few less features) but is getting a bit more hairy as time goes on.

part of the hair is due to the recent addition of modality in terms of the block-encoding (optional cheaper / lower-quality blocks).

 

side note: I actually have a small army of specialized codecs at this point, it is actually starting to get a bit silly...

each is slightly different though with different trade-offs.

 

 

OTOH, some big switches have appeared elsewhere, but have generally been big flat switches, rather than nested ones.

 


PARTNERS