• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
cr88192

A Wild switch() Appears!

11 posts in this topic

@Nypyren:

when I had written an x86 disassembler, and later an interpreter/emulator (userland only), I had used the listing tables (basically, the same tables as used by my assembler). the listings were basically held in a text file which was using a notation derived from the Intel docs, which during the build process would be converted into C source files and matching headers via a tool (so parts of the assembler and disassembler/emulator were procedurally generated).

 

for the interpreter, for speeding up the decoding, I had used a set of index tables (IIRC, 2x 256), where one table would match single-byte instructions, and fall-back tables would match 2-byte forms, followed by table searches for anything that couldn't be matched in 2 bytes (it would have a list of possible matching instructions). this made the code a lot less hairy, but potentially lots of big switches could have been faster.

 

granted, to make the interpretation faster, I had ended up effectively implementing a CPU cache which would convert "traces" of x86 instructions into threaded code (sequences of calls via function pointers), which was later used by my script-language interpreter (and, my script VM JIT uses a hybrid strategy, mixing direct instruction sequences with calls back into C land, hence part of the reason for its weak performance).

 

decided to refrain from going to much into the nature and issues with trying to get my script VM to go faster (summary: most of the time anyways goes into the C->script calls, but there is no "good" way to really optimize this short of making the C side VM interface behave more like COM+ or similar, *, ...).

 

granted, decoding instructions for my script language does itself use some giant switch() blocks (quick check, the "main one" is about 2000 lines, or about 4x longer than the one from BTIC1C), which mostly just serves for unpacking the instruction into a structure and picking the correct function-pointer to execute it (however, I consider it "cleaner" and less "wild" mostly as it is better organized and is a single flat-list switch, vs some unwieldy mess of nested switches implementing a small state-machine).

 

most of the actual "execution logic" within the C side of the VM then mostly consists of short functions often doing a single operation (such as: adding two integers and storing the result, ...)

 

 

*: basically, ability to get C side interfaces for various script-language classes, and then make more direct calls into the compiled script-code, vs calling via variable-argument C functions which end up wasting clock-cycles doing things like copying function-arguments from a va_list based on a signature string, ... as-is, this actually outweighs the cost of the naive JIT output, as most of the script functions / methods are fairly short.

Edited by BGB
0

Share this post


Link to post
Share on other sites
My previous disassembler worked like that - a large text file that I would parse into a data structure that was used to disassemble.  Unfortunately my existing data structure wasn't very well suited to dealing with SSE/VEX prefixes and was very difficult to debug.  So when I was updating my disassembler to support SSE/AVX instructions I decided to change to hardcoded switch statements.  It's much easier to debug overall, and I can use more clever rules:
 
return ValidSV13(Opcode.VPCMPEQB, Vdq, Hdq, Wdq);
The "ValidSV13" is a special way of dealing with SSE vs. AVX instructions which are almost identical. Most AVX instructions only differ from their SSE counterparts by having one extra operand (a nondestructive output register). However, there are a few different patterns:

(excuse my naming convention - I often called AVX "VEX" in my code instead due to dealing with the VEX prefixes too often)

        private bool Valid(Opcode opcode, params Func<Operand>[] operandDecoders)
        {
            AccountForSIMD66(opcode); // If prefix 66 is present but the opcode is SIMD, revert the operandSize to default!
            
            Operand[] operands = new Operand[operandDecoders.Length];
            for (int i=0; i<operandDecoders.Length; ++i)
               operands[i] = operandDecoders[i](); 
            
            var machineCode = stream.ToArray().Take((int)stream.Position).ToArray();
            
            Instruction = new Instruction(addr, machineCode, opcode, operands);
            
            return true;
        }

        private void AccountForSIMD66(Opcode opcode)
        {
            if (g3 == 0x66 && simd66Opcodes.Contains(opcode))
            {
                g3 = 0;
                ApplyOperandSizePrefixes();
            }
        }

        /// <summary>
        /// Valid, only in non-64-bit modes.
        /// </summary>
        private bool Valid_i64(Opcode opcode, params Func<Operand>[] operandDecoders)
        {
            if (x64)
                return false;
            
            return Valid(opcode, operandDecoders);
        }
        
        /// <summary>
        /// Valid, only in 64-bit mode.
        /// </summary>
        private bool Valid_o64(Opcode opcode, params Func<Operand>[] operandDecoders)
        {
            if (!x64)
                return false;
            
            return Valid(opcode, operandDecoders);
        }
        
        /// <summary>
        /// Valid, in 64-bit mode the 32-bit operand size is forced to 64-bits.
        /// </summary>
        private bool Valid_d64(Opcode opcode, params Func<Operand>[] operandDecoders)
        {
            if (x64 && operandSize == 32)
                operandSize = 64;
            
            return Valid(opcode, operandDecoders);
        }
        
        /// <summary>
        /// Valid, in 64-bit mode all operand sizes are forced to 64-bits.
        /// </summary>
        private bool Valid_f64(Opcode opcode, params Func<Operand>[] operandDecoders)
        {
            if (x64)
                operandSize = 64;
            
            return Valid(opcode, operandDecoders);
        }
        
        // Naming convention for the following functions:
        // SV###.  Each digit indicates the operand that the SSE version takes.  The VEX version takes all arguments.
        
        // For example, SV13 means:
        // SV = SSE/VEX, automapping the opcode from VEX opcode to SSE.
        // 1 = argument 1 is the first SSE operand.
        // 3 = argument 3 is the second SSE operand.
        
        // Functions which take all the same arguments are provided for consistency.
        
        /// <summary>
        /// SSE or VEX both taking the same two operands.
        /// Auto-maps the VEX opcode to SSE when SSE is used.
        /// </summary>
        private bool ValidSV12(Opcode vex, Func<Operand> v1, Func<Operand> v2)
        {
            return vexPresent ? Valid(vex, v1, v2)
                              : Valid(VEXtoSSE[vex], v1, v2);
        }
        
        /// <summary>
        /// SSE of VEX both taking the same three operands.
        /// Auto-maps the VEX opcode to SSE when SSE is used.
        /// </summary>
        private bool ValidSV123(Opcode vex, Func<Operand> v1, Func<Operand> v2, Func<Operand> v3)
        {
            return vexPresent ? Valid(vex, v1, v2, v3)
                              : Valid(VEXtoSSE[vex], v1, v2, v3);
        }
        
        /// <summary>
        /// SSE with operands 1 and 3 or VEX with all three.
        /// Auto-maps the VEX opcode to SSE when SSE is used.
        /// </summary>
        private bool ValidSV13(Opcode vex, Func<Operand> v1, Func<Operand> v2, Func<Operand> v3)
        {
            return vexPresent ? Valid(vex, v1, v2, v3)
                              : Valid(VEXtoSSE[vex], v1, v3);
        }

        /// <summary>
        /// SSE with operands 2 and 3 or VEX with all three.
        /// Auto-maps the VEX opcode to SSE when SSE is used.
        /// </summary>
        private bool ValidSV23(Opcode vex, Func<Operand> v1, Func<Operand> v2, Func<Operand> v3)
        {
            return vexPresent ? Valid(vex, v1, v2, v3)
                              : Valid(VEXtoSSE[vex], v2, v3);
        }
        
        /// <summary>
        /// SSE with operands 1, 3 and 4 or VEX with all four.
        /// Auto-maps the VEX opcode to SSE when SSE is used.
        /// </summary>
        private bool ValidSV134(Opcode vex, Func<Operand> v1, Func<Operand> v2, Func<Operand> v3, Func<Operand> v4)
        {
            return vexPresent ? Valid(vex, v1, v2, v3, v4)
                              : Valid(VEXtoSSE[vex], v1, v3, v4);
        }
0

Share this post


Link to post
Share on other sites

@Nypyren:

I still use the listing-based strategy with SSE, XOP, and AVX stuff, but at one point did have to tweak things to allow opcodes with more than 3 operands (IIRC, I increased the limit to 7);

 

also had to extend the notation some to allow representing the rules for the added prefixes (there was a little hair WRT these prefixes, and neither Intel nor AMD stuck with a straightforward plain-ASCII notation, so another notation was devised). actually, the listing notation also includes special characters for where the various prefixes go, ...

 

 

my interpreter didn't use SSE, FWIW, but was limited to a more 486-like subset, namely basic integer operations and the x87 FPU (emulated mostly via 64-bit integer math).

 

FWIW: at this point, this is the only really free / unencumbered form of x86, as having MMX or SSE support or similar risks stepping on patents (so, for example, a CPU vendor could freely make a chip using a 486 like version of the ISA, but would have to license things from Intel and AMD if they wanted things like SSE and x86-64).

 

 

I think I had planned on adding emulated MMX and SSE, but didn't get to it though, and the interpreter turned out to be "nowhere near as useful as originally imagined". mostly at the time I had figured x86 could be used as an IL, but then I came to realize: x86 doesn't really make for a good IL.

0

Share this post


Link to post
Share on other sites

...x86 doesn't really make for a good IL.


I wholeheartedly agree. The x86 encoding's main redeeming feature over common RISC encodings is that the flexible encoding length allows for much more space-efficient code (doing MOVs are typically 2 bytes, immediate operands don't require two separate hi/lo instructions, common stuff like push/pop are encoded in 1 byte). But for all other considerations it's WAY too complicated.

My decompiler uses an IL with 15 instructions (5 of which are for the SCAS/MOVS/etc instructions since those are almost always part of a strlen/strcpy/memcpy function that got inlined and are much easier to detect if I don't convert them to full loops in IL).
0

Share this post


Link to post
Share on other sites

 

...x86 doesn't really make for a good IL.


I wholeheartedly agree. The x86 encoding's main redeeming feature over common RISC encodings is that the flexible encoding length allows for much more space-efficient code (doing MOVs are typically 2 bytes, immediate operands don't require two separate hi/lo instructions, common stuff like push/pop are encoded in 1 byte). But for all other considerations it's WAY too complicated.

My decompiler uses an IL with 15 instructions (5 of which are for the SCAS/MOVS/etc instructions since those are almost always part of a strlen/strcpy/memcpy function that got inlined and are much easier to detect if I don't convert them to full loops in IL).

 

 

 

the main problem I had with trying to use x86 as an IL (for an interpreter) was that it tends to get too many architecture details tied up with how the instructions work, so it is pretty much "nearly impossible" to effectively gloss over things like what the pointer-size is or whether registers are naturally 32 or 64 bits, ... while still keeping it binary compatible with native x86.

 

the result then ends up looking like a sub-program operating within a main program with needing contortions to get data or calls into or out-of the interpreter, which wasn't really worthwhile. it could work maybe, but compared poorly with just using a script language compiled to a bytecode (and leaving most other things to native code).

 

 

meanwhile, a bytecode then can be designed to gloss over a lot of architectural details (like how big the native-size pointer or integer is, how exactly many values are represented, ...).

 

most of my bytecode formats have generally been stack-based (for example, current script-VM bytecode is a statically-typed stack-machine). it has a lot of instructions, but is in some ways "rather CISC".

 

I have a few times considered trying to move over to 3AC (*), but the level of work involved (would need to re-implement much of the VM core) has thus far prevented me from doing so.

 

some early tests (with a small prototype interpreter loop and hand-crafted instruction sequences) had shown some rather promising looking performance statistics (I was getting within 9x of C just with just an interpreter). my estimates are that a JIT for it could likely produce C competitive performance.

 

besides the effort thus far killing off most efforts to make the move, there has been conflict over the best way to get from the ASTs to 3AC (whether to first compile to the existing bytecode, then convert this to 3AC and run it through an optimizer, or to write a new upper-end and compile directly from the ASTs to 3AC), ...

 

 

*: sort of like a RISC but with variable-length instructions.

most of the 3AC bytecode wouldn't be all that notable.

 

sort of gives the basic idea:

http://en.wikipedia.org/wiki/Three_address_code

 

the logical representation of the instructions was in an ASM-like format (*2), but the actual instructions generally consisted of a prefix+opcode pair followed by any operands (the prefix would encode the value-type and operand layout, with the opcode saying which logical operation to perform).

 

*2:

opcode.type dest, srca, srcb

opcode.type dest, src, immed

...

0

Share this post


Link to post
Share on other sites

OP

 

cant help much - irregular logic code is just irregular...

 

Possibly use a C macro for the stuffing of the   ctb[0]..[3] to reduce it to one line      and  another for ctb[4] ..[7]

0

Share this post


Link to post
Share on other sites

OP

 

cant help much - irregular logic code is just irregular...

 

Possibly use a C macro for the stuffing of the   ctb[0]..[3] to reduce it to one line      and  another for ctb[4] ..[7]

 

 

yeah, it is possible...

 

a lot of code is written fairly quickly and so tends to develop repeating patterns and some verbosity.

 

some things are left expanded out in case I see a way to make them faster, where putting things in macros can sometimes hide potential optimizations, or sometimes make some issues for looking at and stepping through stuff in the debugger.

 

though, for example, "*_memcpy8()" is actually a macro in this case (chooses a target-specific mechanism to copy 8 bytes).

0

Share this post


Link to post
Share on other sites

 

OP

 

cant help much - irregular logic code is just irregular...

 

Possibly use a C macro for the stuffing of the   ctb[0]..[3] to reduce it to one line      and  another for ctb[4] ..[7]

 

 

yeah, it is possible...

 

a lot of code is written fairly quickly and so tends to develop repeating patterns and some verbosity.

 

some things are left expanded out in case I see a way to make them faster, where putting things in macros can sometimes hide potential optimizations, or sometimes make some issues for looking at and stepping through stuff in the debugger.

 

though, for example, "*_memcpy8()" is actually a macro in this case (chooses a target-specific mechanism to copy 8 bytes).

 

I looked at it for a while looking for processing commonality and that was the closest/easiest/clear thing just to shrink the long code length down  a bit.

 

Those 2 macros  I mention seemed to be simple patterns reused more than a few times  and  they can  just stuff 4 given parameters (the assignment equations) into the 0,1,2,3  and 4,5,6,7  offsets of the of  4 consecutive ctb[] array slots - it coukld shrink the function length some significant amount  (I usually use ALL uppercase for macro names to make them stand out clearly, but thats just my style)   I'd make the macro name short so you wont have the params running off the right side (I also indent only 3 blanks so I can have quite long 1 line macro param sets fit)

 

 

 

The other thing I notice (I think Ive done something like this code at one time)   is to document the mode data formats (words bytes bits) and command codes Right in with the Code at teh  spots that carry out each variation.   (also maybe name the encoding variations even to make it more mnemonic).

 

Also Im not sure of the effects on the flow clarity but those 'modes' ---can they be broken out into a seperate function each mode to get rid of some of the nesting?

 

Again, its quite irregular  and very few common processing variations to not just spell most of it out linear like that (I dont really see that going away)

0

Share this post


Link to post
Share on other sites

 

 

OP
 
cant help much - irregular logic code is just irregular...
 
Possibly use a C macro for the stuffing of the   ctb[0]..[3] to reduce it to one line      and  another for ctb[4] ..[7]

 
 
yeah, it is possible...
 
a lot of code is written fairly quickly and so tends to develop repeating patterns and some verbosity.
 
some things are left expanded out in case I see a way to make them faster, where putting things in macros can sometimes hide potential optimizations, or sometimes make some issues for looking at and stepping through stuff in the debugger.
 
though, for example, "*_memcpy8()" is actually a macro in this case (chooses a target-specific mechanism to copy 8 bytes).

 

 
I looked at it for a while looking for processing commonality and that was the closest/easiest/clear thing just to shrink the long code length down  a bit.
 
Those 2 macros  I mention seemed to be simple patterns reused more than a few times  and  they can  just stuff 4 given parameters (the assignment equations) into the 0,1,2,3  and 4,5,6,7  offsets of the of  4 consecutive ctb[] array slots - it coukld shrink the function length some significant amount  (I usually use ALL uppercase for macro names to make them stand out clearly, but thats just my style)   I'd make the macro name short so you wont have the params running off the right side (I also indent only 3 blanks so I can have quite long 1 line macro param sets fit)

 

 
it is possible, I may consider it.

many of them are for things like filling in word-values, which could be done in macros without obscuring the behavior.

macros are usually, but not always, uppercase. in this case it is lowercase because it directly replaced the use of memcpy calls, mostly as early on in tests these memcpy calls were eating a lot of time, and MSVC can be a little "naive" about things like this sometimes.

 

sometimes whether or not the macro is uppercase depends on whether or not it is intended to be treated mostly as if it were a normal function call, but making function names all lower-case is also special in my conventions, as normal function names are FirstLettersCapital.

 

The other thing I notice (I think Ive done something like this code at one time)   is to document the mode data formats (words bytes bits) and command codes Right in with the Code at teh  spots that carry out each variation.   (also maybe name the encoding variations even to make it more mnemonic).
 
Also Im not sure of the effects on the flow clarity but those 'modes' ---can they be broken out into a seperate function each mode to get rid of some of the nesting?
 
Again, its quite irregular  and very few common processing variations to not just spell most of it out linear like that (I dont really see that going away)

the basic format is documented on my wiki:

http://cr88192.dyndns.org:8080/wiki/index.php/BTIC1C

 

I may consider reworking the organization of the loop some anyways mostly as I had recently been thinking that some sort of motion compensation is needed, as basically a sort of skip+translate thing, which will require some modest changes.

 

 

what the modes basically do is allow for more compact block encodings to be used, at the cost of some color accuracy.

for example, we can have a mode where 8 and 4 bit palette based colors are used instead of RGB555, and where 1bpp and 0.5bpp interpolation is used (in the 0.5bpp interpolation mode, blocks are only represented "approximately" via tables of block-patterns).

 

effectively, they change the meaning and format of the commands which follow them, either for a single command, or until the next time the mode is changed.

 

I had considered breaking it up into multiple functions a few times, but had put it off previously as that would have made the code a little longer and also require moving some more of the state into the context structure or similar (as opposed to keeping it in local variables).

 

 

 

I am also idly thinking about something, options between:

implement a new codec (I am calling BTIC1E) to deal with BPTC (BC6H and BC7), which could offer high-speed and high quality, but would likely use a high bitrate;

switch to a more generic format for BC6H and BC7 support (using real-time encoding, possibly using BTIC2C, would cost in terms of performance and image quality);

newer (possibly hacky) special BTIC1C decoder loop to target them (probably a subset, though this is unlikely to offer a whole lot over DXT);

mostly not bothering with BC6H and BC7 for video (I have been using them some for textures, though my BPTC encoder currently mostly ignores block-modes which use partitions, *).

 

*: I am not sure if there are any good algorithms for dealing with partitioned block modes.

it appears as if it would be necessary to iterate through all of the possible partitions to find a good one for a given block, which is likely to be slow.

there have been a few ideas though for a more direct (partially DCT and lookup-table based) strategy, which could be faster, but is likely to still be a bit slow for use in real-time or load-time conversion.

Edited by BGB
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0