New VM (WIP), thoughts?...

Started by
8 comments, last by cr88192 9 years, 11 months ago

(I had debate whether to put this here or in the journal, but I am looking for peoples thoughts on some of this...)

well, here is the how this got going:

for a while, I wanted a VM which could allow extending the reach of C to targets where C is normally unavailable or overly inconvenient (namely web-browsers and Android).

while there is Emscripten for browsers, my initial tests had shown that the size of the output code expands considerably (*), bringing doubts to its realistic viability for moderate/large codebases, more so as I am running a personal-run server and don't exactly have massive upload speeds (they give like 30 Mbps down but 2 Mbps up, but a person could get 25 down / 5 up for "only" $179/month on a 3 year contract... cough...).

while the NDK is available for Android, it has some serious drawbacks, making it not really a great option (which version of ARM does the device run? what if it runs x86? ...). things are nicer if Java/Dalvik can be used here.

*: (EDIT: Removed. Issue turns out to be "not so simple".).

also, recently, I have put my game project on hold, for sake of basically reaching a stage where I have run out of ideas for new functionality and burnt out about dealing always with the same basic issues.

have gone and messed with writing small apps, as tests, to test out various pieces of technology (ex: various transcompilers to JavaScript, ...).

so, decided recently (~ several weeks ago) to start working on a new VM, consisting of several major components:

a C compiler front-end, which compiles C source into the bytecode format;

* based on a pre-existing C compiler frontend of mine, which originally targeted a different IL.

** it also had partial support for C++, so a C++ subset may be possible (will probably lack templates).

** it has spent recent years mostly just being used as a code-processing / glue-code generation tool.

* IR is statically-typed and register based, vaguely Dalvik-like

an interpreter backend, which will be rewritten in the form as needed for each target.

* current (still incomplete/untested) backend is written in Java, but C and Dart or JS backends are planned.

* the plan for the JS backend would be to dynamically compile the bytecode into JS on the client.

** main roadblock: dealing with JS, not sure the best approach to go about debugging in JS.

** Java via Google Web Toolkit is a more likely initial target for browsers ATM.

* it uses a design intended mostly to allow a (relatively) simple interpreter to generate efficient code.

** this basically means a Register-IR

** while stack-machines are simpler overall, a simple stack interpreter will give lackluster performance.

*** getting more speed out of a stack machine means a lot of additional complexity.

** also don't confuse simplicity with smallness

*** N*M cases may make code bulkier, but don't add much to overall implementation complexity.

* bytecode images are likely to be Deflate or maybe LZMA compressed.

the VM design in question here is much lower-level than some of my other VMs, and makes the compiler logic a fair bit more complicated, so this is still the main hold-up at present (can't finish/test the backend without being able to produce code to be able to run on it).

the design is intended to be high-level enough to gloss over ABI differences, and allow some implementation freedom, but this is mostly about all it really does (it otherwise that far above machine-code in a level-of-abstraction sense).

note that, unlike, say, Flash, this will not require any browser plugins or similar, rather the plan is that the VM will itself be sent over to the browser, and then used to transform the code client-side.

this does carry the assumption that the VM will be smaller than the code which runs on it.

this is worrying as, as-is, it means lots of code I have as-of-yet been unable to verify, but I would need to write a bytecode assembler and disassembler to test the backend more directly. will probably need these eventually though.

for those interested, here is a recent-ish working-spec for the bytecode:

http://cr88192.mooo.com:8080/wiki/index.php/FRBC2C

still to be seen if I will be able to get all this to usably complete levels anytime soon.

thoughts?...

Advertisement
I see a moderately detailed bytecode spec, but not a VM...


That aside, I have some reservations about your strategy. First, your bytecode is pretty low-level in a lot of ways (known bit widths, etc.) and looks to have been heavily inspired by stripped-down/cleaned-up x86. This isn't inherently a problem, but it will make your implementation a lot harder than it needs to be. It also locks you in to certain decisions that may be hard to anticipate until it's too late and you're stuck with some weird consequence on some random platform. I generally prefer very high level VM instructions - not to the level of moving bytes and words around, but rather more logical operations, like "perform computation" or "invoke virtual method on object."

My reasoning for this is pretty simple: if your bytecode is abstract, you save in two ways. For your case the big win is in code size. If I can use a single instruction to replace ten instructions, there's at least 9 bytes of savings depending on your encoding. The other major win is that it radically simplifies building both the front-end and the back-end of the compiler, and makes building the VM itself fairly easy by comparison. Worrying about register scheduling, for instance, will make the compiler back-end a lot more complex. If you just hoist yourself up one level of abstraction, you eliminate a lot of the hard work of writing a compiler.

Another consideration that I think merits a lot of careful thought: why are you introducing yet another bytecode/VM setup to the world? If it's purely to learn and have fun, then cool - but be upfront about your motives and don't try and target real applications as your justification for building the software.

On the other hand, if you're genuinely interested in building a system to solve portability issues, you should strongly consider piggybacking on the existing, massively supported projects that have similar aims. At the very least, you should research existing solutions prior to designing your own.

The LLVM ecosystem, and PNaCl in particular, would be great guidance for this project. Personally, this is how I'd do it:
  • Design a method of storing LLVM bitcode that suits your purposes. You might just use the LLVM implementation directly, perhaps
  • Build something that can turn these compact, binary-encoded bit streams back into LLVM IR
  • Write a thin shim that can pipe this IR into either the actual LLVM back-end for native generation, or into an interpreter
  • Modify/adapt existing LLVM IR software interpreter into something that suits your purposes
  • Glue it all together with a simple JS library or something that transports the binary stream (maybe Base64'd), unpacks it into IR, and executes it
This solves your stated problem as well as leaving you with three important advantages: you can interoperate with any other LLVM front-end, so e.g. you can use Clang to compile C++ for your platform; you can optionally use native compilation for speed if you want to pay the hit for embedding the LLVM libraries (which honestly isn't bad); and you leverage existing expertise from the open-source community surrounding LLVM to help with the details and with adoption of your finished product.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Also note that there is already Cling, which is just Clang + the LLVM JIT backend (for the most part).

while there is Emscripten for browsers, my initial tests had shown that the size of the output code expands considerably (*), bringing doubts to its realistic viability for moderate/large codebases,


Unreal and Unity manage just fine. Keep in mind that the output of Emscripten is not compressed. You typically serve Web content with DEFLATE enabled, which compresses text very well. In some cases the compressed asm.js has been shown to be _smaller_ than the equivalent native machine code (I don't know if this is the common case or a special case though).

In any event, the size of your textures/sounds/models/etc. are going to outstrip your binary size in any event if we're talking games. Toss in the traffic from interaction (WebSockets, XHR, etc.) and the size of your compiled binary is the least of your concerns, especially if you properly use the HTML5 application cache.

Sean Middleditch – Game Systems Engineer – Join my team!

@Apoch:

granted, but the choice for a very low-level representation was deliberate:
matches the execution model expected from typical C code (where C is the primary HLL);
is IME fairly effective at allowing going more directly from bytecode to passably efficient output with a "naive backend".


for the Java-based interpreter, much of the complexity thus-far actually is going more into the emulated Virtual-Address-Space, but this is pretty much inescapable I think (most C code assumes a linear address space).

most of the rest is basically a big mass of inner-classes, whole thing thus far being roughly about 9 kLOC.
the core ISA is basically implemented, but not implemented yet is the system-calls and API wrapping, but I will basically need to have the compiler written for a lot of this part.

while, arguably, the value sizes are not ideal, say, for targeting JavaScript, non-fixed sizes would pose a big headache for running C. again, mostly as for better or worse, lots of C code tends to assume certain value sizes. like, you will have a big mess if "int" isn't 32-bits, ... most of this stuff basically has to be emulated anyways on these targets.


I have used a lot of higher-level designs for VMs in the past (typically stack-machines), but these have the drawback of typically leaving more complexity in the backend, either in terms of code complexity or slower execution times (a simple stack-based interpreter is simple, but not particularly fast).

generally it requires (internally) largely going over to a register-machine model or similar anyways to get good speeds out of a threaded-code backend or JIT (typically by "unrolling" the stack over the logical temporaries/registers).

( ADD/CLARIFICATION: I had prioritized being able to have a moderately fast and low-complexity backend over that of having a simple compiler. )

the FRBC model is actually fairly close to that used in my BGBScript VM JIT backend, albeit the BSVM uses a stack-based high-level bytecode.

granted, there is a difference in that FRBC2C allows accessing locals and arguments directly, whereas most of my other VMs would require loading/storing and performing operations only on temporaries. this should cut operation counts by about 2x-3x.

for example:
y=m*x+b;
could be compiled into 2 operations rather than 6.


while x86 emulation would also be an option, my past x86 emulator would be pretty bulky and had pretty bad performance (~ 70x slower than native).

though, the address-space machinery in the Java-based interpreter was at least vaguely similar (though in some ways has more in common with how I did pointer-operations in BGBScript, namely using "pointer objects" for internal operations rather than raw addresses).

ADD 2: also like the BSVM, it breaks code into EBBs/Traces and uses Call-Threaded-Code combined with a trampoline-loop (as opposed to "loop-and-switch" or similar, which is usually a bit slower IME). some aspects of the bytecode were also designed around the behavior of the algorithms which do the trace-building (mostly to allow the trace-builder to be single-pass).


as for LLVM and PNaCl, I am aware of them.

PNaCl is a drawback mostly in that it is browser-specific, and I wanted to be able to support non-Chrome browsers, leaving pretty much JavaScript as the primary target (either directly, or via something like Google Web Toolkit).

I couldn't think of a good way to fit LLVM in with all this, and it wasn't an ideal fit for what I was doing, so I didn't bother, and I also had most of a C compiler front-end available anyways (from a past project trying to use C as a script-language, turned out C wasn't a great option for scripts...).

I had also looked at the Quake3 C VM, but noted that it would need a bit of modification to fully support C (they were essentially running a C subset), and also I would need to rewrite most of it anyways to not be stuck with GPL.

ADD: as for "why?":

partly because I am bored / feel like it, and partly for personal use (probably mostly for my own code).


or such...

Also note that there is already Cling, which is just Clang + the LLVM JIT backend (for the most part).


work well on Android + web-browsers?

while there is Emscripten for browsers, my initial tests had shown that the size of the output code expands considerably (*), bringing doubts to its realistic viability for moderate/large codebases,


Unreal and Unity manage just fine. Keep in mind that the output of Emscripten is not compressed. You typically serve Web content with DEFLATE enabled, which compresses text very well. In some cases the compressed asm.js has been shown to be _smaller_ than the equivalent native machine code (I don't know if this is the common case or a special case though).

In any event, the size of your textures/sounds/models/etc. are going to outstrip your binary size in any event if we're talking games. Toss in the traffic from interaction (WebSockets, XHR, etc.) and the size of your compiled binary is the least of your concerns, especially if you properly use the HTML5 application cache.


could be...

though, as-is, I have most of my assets compressed as well (most textures using a custom JPEG variant, and audio also uses a custom lossy codec).

the release-assets in their original formats (PNG, WAV, ...) are closer to about 400MB, but lossy compress to around 80MB.

the development-assets directory is closer to around 11GB (though includes lots of working data and data that is not currently included in the final version, ...).


the last release version of my game project was around 132MB, with about 80MB of this being data, and 52MB for code and other stuff (while Deflate-compressed).
text usually compresses to about 10%-30% its original size IME.

uncompressed, this is about 30MB for binaries (EXE/DLL), 27MB for C source+headers (21 MB C, 6 MB H).
project size is approx 880 kLOC.

( EDIT / Removed: prior observations of significant expansion in a few early tests. )

this was in contrast to GWT and Google Dart typically seemed to have a much smaller amount of code expansion for their JS output (though I lack any sufficiently-large globs of Java to throw at GWT or similar to really find out).

so, using Java (via GWT) or Dart would probably be mostly a non-issue.

the hope had been mostly to try generating most of the ASM.js client-side.


could look into it more...
was mostly just messing with it for now mostly due to lacking much better to do...

ADD / EDIT:

more testing with Emscripten has shown that the relation between input and output file sizes is apparently far from a simple matter, as I have gotten between significant code expansion, and little expansion, depending on which code is tested...

example of very little expansion: SQLite, for which the JS output is nearly the same size as the C input.

example of more significant expansion: my codec library. granted, it consists almost entirely of micro-optimized fixed-point arithmetic, bit-twiddly, and pointer arithmetic, and a fair bit of macros as well (Self-Correction: memory error, it was only ~ 10x bigger, not 100x bigger...).

ADD 2:

looking into it a bit more, I may be better off using C and Emscripten for compiling the FRBC VM backend (as opposed to the existing Java-based interpreter), and then maybe consider compilation to ASM.js from within Emscripten (apparently, this is plenty doable...).

advantages:

can leverage Emscripten's existing address-space and API wrappers (less cross-language issues);

should probably (hopefully) perform better than an interpreter written in Java (and compiled to JS).

still needs further investigation.

would probably implement (as-needed):

* basic threaded-code interpreter;

* ASM.js JIT (for Emscripten);

* x86 (and maybe x86-64) JIT (for native x86).

What we did was write a LLVM compiler plugin for VS. You could probably skip this step and use CLang etc.

Then we wrote a platform specific app to handle IO. We actually used OpenKode as a spec and built our app around that.

This worked very, very well.

The only issue with it is we required executable memory. The basic idea is ...

Run app

App loads game, translates to native machine code and links in the hooks to system IO

Game runs from memory

Works exceptionally well in web browsers, Android, Windows (all versions), Linux.... You have a single game binary for all platforms yet still have native speed.

Doesn't work at all on IOS

Apple decided that executable memory was a bad thing and nasty game developers could not have any.

So if you are thinking about supporting IOS you need to keep that in mind.

What we did was write a LLVM compiler plugin for VS. You could probably skip this step and use CLang etc.

Then we wrote a platform specific app to handle IO. We actually used OpenKode as a spec and built our app around that.

This worked very, very well.

The only issue with it is we required executable memory. The basic idea is ...

Run app

App loads game, translates to native machine code and links in the hooks to system IO

Game runs from memory

Works exceptionally well in web browsers, Android, Windows (all versions), Linux.... You have a single game binary for all platforms yet still have native speed.

Doesn't work at all on IOS

Apple decided that executable memory was a bad thing and nasty game developers could not have any.

So if you are thinking about supporting IOS you need to keep that in mind.

yeah.

part of the issue I am facing is that a few targets are kind of a pain to deploy C onto, hence why a VM to run the C code starts looking like more of an option (less need to deal with an assortment of wacky build-tools, multiple CPU architectures, ...).

(can't really support iOS though, don't really have money to afford Apple HW...).

as for RWX memory, yep.

not sure about iOS (never developed on it), but I am left wondering if the double-mapping trick would work (this being a workaround for targets which don't allow RWX memory, but only RW and RX, where one basically maps the memory twice using the RW mapping to write into the RX mapping), or if they went further and disallowed any user-created executable memory.

in the latter case, probably one will need an interpreter fallback.

for interpreters, I have generally had good luck with more or less unpacking the bytecode into structs and function pointers, then typically using manually-unrolled loops (themselves given via function pointers), to execute them.

like, with a "loop-and-switch", it seems hard to break much below the 100x mark (100x slower than native C, *), but with the function-pointer tricks, I have got it down to around 10x for some experimental interpreters (and around 40x for some in-use interpreters).

but, yeah, getting below 10x is pretty hard IME absent a JIT.

for a JIT, the last 2x-3x is the hard part (one is then competing with the C compilers' optimizer).

to a large degree, I haven't really gone up this final ramp.

*: at around 100x, one usually finds that the "while()" loop and "switch()" consume nearly the entire running time, earlier on I had called this the "switch limit" thinking this to be a theoretical limit for interpreter speed (until noting all the craziness I could do with function-pointers).

now, the limit is how quickly I can run things like:


	op=ops[ 0]; op->run(frm, op);	op=ops[ 1]; op->run(frm, op);
	op=ops[ 2]; op->run(frm, op);	op=ops[ 3]; op->run(frm, op);
	op=ops[ 4]; op->run(frm, op);	op=ops[ 5]; op->run(frm, op);
	op=ops[ 6]; op->run(frm, op);	op=ops[ 7]; op->run(frm, op);
	op=ops[ 8]; op->run(frm, op);	op=ops[ 9]; op->run(frm, op);
	op=ops[10]; op->run(frm, op);	op=ops[11]; op->run(frm, op);
	op=ops[12]; op->run(frm, op);	op=ops[13]; op->run(frm, op);
	op=ops[14]; op->run(frm, op);	op=ops[15]; op->run(frm, op);
	...

but, then one is still left back to trying to figure out ways to cut down how many operations are needed to accomplish a given task, which is part of the reason for some of the funkiness in my bytecode (basically, trying to match common-cases in C code).

so, for example, while a register-IR doesn't beat out a stack-machine either in terms of code-density nor necessarily in time-cost-per-operation, it can beat it out in terms of needing fewer operations-per-statement (by addressing variables more directly, giving more freedom to the front-end compiler, ...).

typically, there is a process which decodes the bytecode and tries to build a "trace-graph", and part of the overly low-level design is related to trying to keep this logic relatively simple in this case (for example: minimizing need for analysis or pattern-matching, in favor of linear unpacking, and mostly exposing the "guts" of the interpreter at the bytecode level).

also, partly because the VM would be distributed along with the code that runs on it (as opposed to distributing them separately), generality is less critical in this case.

( ADD: for an x86 JIT, typically each VM operation will map to 3-5 x86 instructions, but this can be cut down a bit if the JIT has the logic needed to cache things in registers...

a "naive JIT" may not bother, so say:

BINOP.I add, r0, l1, l2

BINOP.I add, l4, r0, l1

might become, say:

;;first op

mov eax, [ebp-12]

mov ecx, [ebp-16]

add eax, ecx

mov [ebp-48], eax

;;second op

mov eax, [ebp-48]

mov ecx, [ebp-12]

add eax, ecx

mov [ebp-24], eax

whereas, a slightly less stupid JIT would probably notice that some of this sort of stuff is redundant.

a naive JIT though has an advantage of being smaller and simpler. )

Doesn't work at all on IOS

Apple decided that executable memory was a bad thing and nasty game developers could not have any.

This is very common on 'closed platforms'... I've worked on a bunch of games that were written in Lua, but then *interpreted* by LuaJIT (because we had to disable the JIT part).
In many cases, it's still only ~5x slower than native though ;)

Keep in mind that depending on the closed platform, running an embedded interpreter might also be considered "evil." There was some FUD back in the early days of the App Store that Apple forbade the use of interpreters/JITs/etc. that downloaded code from outside sources. I don't know firsthand if that's true, but be prepared for some platforms to be off-limits because of draconian garbage like that.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Doesn't work at all on IOS

Apple decided that executable memory was a bad thing and nasty game developers could not have any.

This is very common on 'closed platforms'... I've worked on a bunch of games that were written in Lua, but then *interpreted* by LuaJIT (because we had to disable the JIT part).
In many cases, it's still only ~5x slower than native though ;)

yeah, this is one good point about having an interpreter fallback.

may be more portable, and doesn't leave things SOL if for some reason JIT is not really a possibility.

though 5x does seem a little fast for an interpreter, but I guess it depends a lot on what kinds of code one is testing with, and maybe also the specifics of the hardware one is running on, ...

my figures were mostly from past tests involving calculation-centric tests on AMD x86 hardware with prior interpreters.

I don't yet have any real information on the current interpreter though (now being rewritten in C).

ADD: very initial test (artificial, randomized counter-opcodes), interpreter loop pulls off around 130-200 Mips (~ 17-26 clock cycles per opcode).

could be better, could be worse... will need to get it more complete to be able to determine its C-relative performance.

ADD 2: Simple NOPs: 360 Mips (~ 9 cycles per operation), so about a 2x speed difference is due to each operation incrementing a counter in the former case ( "frm->ctx->stat_n_op++;" ), vs doing nothing (and accumulating at the end of the current trace).

it is 107 Mips (~ 32 cycles / operation) with randomized 3-address arithmetic operators (ex: "frm->reg[op->d].i=frm->reg[op->s].i + frm->reg[op->t].i;" ).

operation payload seems to be a notable factor in this case.

This topic is closed to new replies.

Advertisement