AngelScript JIT/AOT implementation details (technical)

quarnster · 2009-07-24T18:33:19

Hello! I know the topic of JITs have been up on the table before, but I couldn't find any details concerning actually implementing one. This is mostly directed to you, Andreas, but as this is information that will interest others wanting to add a JIT/AOT compilation for their processor of choice, I choose to post this here (and in English!) rather than emailing you directly. I started playing around with AOT compiling for AS as a fun sparetime project to get to know the ARM architecture a bit better. But before I get too deep into the actual implementation (and would have to rewrite the whole shebang to fit with your ideas) I figured I'd ask you what your thoughts are when it comes to JIT/AOT compiling and how it all would hook into the AngelScript engine. I think writing a JIT/AOT compiler is a big project to take on unless you can split it out into chunks and take it one step at a time. As such, one of the features I'm interested in is to be able to mix native machine code with AS bytecode and switch between them. Thus I suggest a new AS bytecode instruction is implemented "JIT" (or something else, but I'll use that name from now on). I've not thought this through 100% so the actual implementation will likely change during the course of the work on this. But my current idea is that the "JIT" bytecode would: - if JIT is enabled: - save l_bc, l_sp, l_fp to the context - Call "ExecuteJIT(this, machineCode)", where "this" is the asCContext and "machineCode" is a pointer to the array containing the native code to execute (where this code buffer is actually stored is to be decided...) - load l_bc_l_sp,l_fp from the context - else: - nop The JIT/AOT compiler does not replace the implemented bytecodes but rather just injects the JIT bytecode before the section that it does support natively (see further down for an actual example). This allows us to disable the native code at runtime and treat it as a nop. Otherwise, as exactly what happens to l_bc can be unknown at compile time, l_bc in the context must be updated by the machineCode to skip the AS bytecodes and their arguments that were implemented by the JIT/AOT. It could also potentially work when implementing jumps/calls/suspend in native code, although exactly how the return/resume would work I've not figured out yet. But figuring out how to really compile calls into native code is IMO a separate issue as there are plenty of bytecodes to implement that would not be affected no matter if script calls are actually done within bytecode or machine code, so I've not really spent any time thinking about this. In any case, having this ability to mix native and AS bytecode, would allow me to implement one AS bytecode at a time, would not break as soon as a new bytecode is added (although if it's a frequently used one it would obviously hurt perf), would still allow for co-routines and would allow you to disable the JIT at runtime if you want to step through the AS bytecode with a debugger. So with this new bytecode instruction in mind, lets turn our attention to which part of the code would emit it. As I'm only in the very early testing phase, I added a call into my AStoARM compiler between Optimize and ResolveJumpAddresses in asCByteCode's Finalize method. This is easy and straight forward, however the biggest problem I can think of offhand is that this will mess with crossplatform loading and saving of the bytecode (when this becomes available). Surely code saved on one platform and loaded on another would want to have the JIT/AOT step redone for maximum perf. The other alternative I can think of is to save/load a version that does not have any JIT instructions and do the native compilation step as a post process operation. However it's important to note that all PC relative instructions would have to be patched when the new instruction is inserted between for example a jump and its destination. As I don't want the JIT to break once a new PC relative instruction is added, I suggest a common interface for adding the JIT instruction that will take care of this patching so that JIT don't have to know how to do it itself. The problem I can think of with this solution is that it would have to shuffle memory around when inserting the JIT instruction, although it could keep a buffer around with the JIT instructions to insert and not actually do it until the compiler says it is done. This or something similiar is probably a must for writing an actual JIT rather than an AOT. Any other suggestions? For the second alternative, where would the native compiler hook in? Another interface needed is one for taking ownership of the buffer that is to contain the native machine code. I suggest that the JIT instruction's parameter is an index that identifies which buffer to use, although exactly whom the JIT/AOT compiler should talk to to allocate the buffer and get the index I don't know. The script function, the module, the engine? What do you suggest? I'm sure there was something I forgot, but this has been a long post already. What do you think? Any alternative solutions you've been thinking of? Disregarding the problems with having the compilation done before ResolveJumpAddresses, and native machinecode buffer management (ie I only have one buffer), as a test I have implemented two of AS's bytecode instructions MULi and ADDi. The AngelScript source code used in the test was: int TestBasic(int a, int b, int c) { return a + b * c; } Which compiles into: Temps: 1 0 0 * PUSH 1 - 3,5 - 1 1 * SUSPEND 2 0 JIT 0 ; Notice the new instruction 3 1 * MULi v1, v-1, v-2 5 1 * ADDi v1, v0, v1 7 1 * CpyVtoR4 v1 8 1 * 0: 8 0 * RET 3 With the native code for the JIT-supported section (from MULi to (including) ADDi) being: 0xe92d4030 stmdb sp!, {r4, r5, lr} ; Prologue ; Save return pointer and scratched ; registers we are required to save on ; the (native) stack 0xe5902028 ldr r2, [r0, #0x28] ; Load AS's stack frame pointer from the asCContext 0xe5921004 ldr r1, [r2, #0x4] ; Load v-1 0xe5923008 ldr r3, [r2, #0x8] ; Load v-2 0xe0040391 mul r4, r1, r3 ; Perform the multiplication 0xe5925000 ldr r5, [r2] ; Load v0 0xe0854004 add r4, r5, r4 ; Perform the add ; Epilogue ; Here at this code we notice that we don't support the next AS bytecode ; As such we need to flush any changed data and return from the code ; ; If on the other hand the next opcode would also have been supported, we ; would have continued executing even further and would not write any data ; back until it is needed, either by running out of native registers to ; use, or when it is time to exit 0xe5024004 str r4, [r2, #-0x4] ; Save v1 to the stack as it changed 0xe5904020 ldr r4, [r0, #0x20] ; Load byteCode pointer from the context 0xe2844014 add r4, r4, #0x14 ; Add the number of AS bytecode data to ; skip (includes the JIT instruction) 0xe5804020 str r4, [r0, #0x20] ; Save back to the context 0xe8bd8030 ldmia sp!, {r4, r5, pc} ; Restore scratched registers we had to ; save (according to the ARM calling ; convention) and return (by setting ; pc = lr) This piece of code has been verifed as working. Obviously this is not optimized assembly code for the ARM platform, as multiple sequenceive loads could be done with a single instruction, and the multiply and add too. But optimizing the generated native code is a completely different topic.

AngelCode Affiliates

Started by quarnster June 14, 2009 09:21 AM

48 comments, last by quarnster 14 years, 9 months ago

quarnster

266

Author

July 21, 2009 05:19 AM

Letting the jit manipulate the callstack doesn't need to be that hard, it could just call into asCContext::CallScriptFunction directly. If it is worth going to that extent or not I don't know, but definitely something to try out.

Compiling nanojit into an executable produces a 168kb large file on win32 x86 using MSVC, around 100kb on win32 arm. Compiling libjit into a dll using mingw32 (won't go through the effort of trying to make it compile with msvc until maybe I decide that libjit is for me) creates a dll that's around 1Mb on both win32 xp and arm, but ~400k on my ppc mac.

I haven't decided which one of the two I am going to use yet, so I'm just going to go ahead and create a small jit capable of running a simple benchmark and compare both of them in terms of ease to use and the quality of code that they output, and of course the execution speed. I'll report back with a more detailed pro/con table of the two systems once I've got them both tested.

BTW, for libjit I recommend using the head git version (http://git.savannah.gnu.org/cgit/dotgnu-pnet/libjit.git) as all others have only been giving me troubles.

quarnster

266

Author

July 21, 2009 03:38 PM

Ok, I've decided that nanojit is in too much flux at the moment for me to start using it. I don't want to commit to a moving target and have to change all of my code when updating to the latest nanojit to get the latest bug fixes.

There's Adobe's version here: http://hg.mozilla.org/tamarin-redux, Mozilla's version here: http://hg.mozilla.org/tracemonkey, and then the merge that is supposed to happen between them here (which doesn't compile currently): https://developer.mozilla.org/en/NanojitMerge.

If someone wants to experiment with nanojit I would recommend Mozilla's version as it doesn't have as much dependencies to other code. In fact someone ripped a version of Mozilla's nanojit and all its dependencies out and placed it here for easy access: http://github.com/doublec/nanojit/tree/master.

Libjit it is for me then.

quarnster

266

Author

July 23, 2009 10:46 AM

WitchLord, is the type of a temp "register" (v3 for example) constant over the whole function or can it change mid-run from say int to float or 32 bit to 64 bit?

The problem if it can change is when it comes to saving the value back to the VM from a native register as it can exist in either on of an integer or a float register for example, as there might be an unimplemented intstruction or a suspend happens inbetween the two.

I guess one way to manage that is to keep a separate value (possibly packed) indicating which type the register currently is in.

Loading is fine as we can always load the value into each one of the separate registers needed to represent it.

Btw as a sidenote I was wrong about the size of the libjit dll as I had compiled it with debug info enabled, it is 327k on xp. I had not done that mistake with LLVM though...

WitchLord

4,834

July 23, 2009 04:46 PM

The stack space for temporary variables are reused for different primitive types, so you can't rely on a temp being an integer or float forever.

The type of the value should preferably be deferred from the use of the value, rather than having to encode it somewhere.

I'll think about what can be done to aid the construction of a JIT compiler with regards to knowing the type and lifetime of temporary variables.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

WitchLord

4,834

July 24, 2009 11:54 AM

I could add a couple of byte codes to give hints to the JIT compiler about the type and scope of variables, e.g.

 asBC_RESERVE var, typeid asBC_DISCARD var

These would only be added in case the application has indicated the intention to use JIT compilation, so it won't affect other applications.

Would that work for you?

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

quarnster

266

Author

July 24, 2009 01:52 PM

Hmm, not so sure about that.

So imagine we use this basic construct for compiled functions:

1) call conv prologue
2) load AS stack variables into native registers
3) jump table for resuming at the correct jitEntry
4) jitentry1 code
jitentry2 code
...
5) save the temp stack variables back to the vm (I believe this can be ignored if we reach a "RET", but must be respected if the function exits for any other reason)
6) Save the value register and others non-temp variables
7) call conv epilogue

The problematic step is 5, as we could get there from different blocks in the code where the temp stack value is in different registers. Basically it is the same problem as the Phi function solves when merging two blocks in SSA form (http://en.wikipedia.org/wiki/Static_single_assignment_form)... As to how the Phi function is actually implemented I have yet to find anyone describe.

Maybe the solution is as simple as making the temporaries that can change type be loaded/stored for each code block rather than at the global level.

And as yet another side point when it comes to my library evaluations, libjit produces sub-optimal code for ARM so it is out of the question for me. Continuing with my own homerolled ARM code is starting to look better and better for the needs I have, as neither libjit nor nanojit seems to solve the problems I originally hoped them to solve. I've gotten some good ideas from them though.

WitchLord

4,834

July 24, 2009 03:13 PM

I think it will be extremely difficult for you to load the variables from the stack into the registers at global level. Instead I suggest you load the variable into the registers as they are used.

You can eliminate the use of the stack for temporary variables if, and only if, there is no way for control to pass to or from the the JIT function during the lifetime of the variable.

While the control is with the VM the application can actually go in and change the values of variables through the debug interface. Of course, you may choose to ignore that with your JIT compiler if your application won't use it.

Fortunately most temporary variables are short lived. They are allocated and freed with each expression, sometimes even sub expressions. The life cycle of most temporary variables is this: allocated -> written to -> read from -> deallocated. Very rarely is a temporary variable read multiple times.

Without the hints of when a temporary variable is allocated and deallocated, all you will see is the space on the stack that they are occupy. But in reality this space is used by multiple temporary variables at different times.

Quote:
The problematic step is 5, as we could get there from different blocks in the code where the temp stack value is in different registers. Basically it is the same problem as the Phi function solves when merging two blocks in SSA form (http://en.wikipedia.org/wiki/Static_single_assignment_form)... As to how the Phi function is actually implemented I have yet to find anyone describe.

From the article I understood that the Phi function is not really a function, instead it is just a hint to tell the compiler that all variables in the argument should occupy the same space. That is, if the variable Y is written two in two distinct branches, the Phi function tells the compiler that both of these new instances should occupy the same space (memory or register).

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

quarnster

266

Author

July 24, 2009 04:00 PM

Quote:Original post by WitchLord
I think it will be extremely difficult for you to load the variables from the stack into the registers at global level.

Really? Why? This is what I'm already doing and it has been proved working with the line callback/suspend. I don't see a problem with loading the most commonly used temp stack variables from the VM at the global scope as if a value can be both a float and an int it could just be loaded into both native registers.

Quote:Instead I suggest you load the variable into the registers as they are used.

This is exactly what I want to avoid as if I need to load a variable into a register when it is used, I also need to write it back all the time when it changes.

As I see it this is only necessary when:
a) We give back control to the VM
b) We run out of registers and need to flush something back to free one (or more) registers up
c) We need to resolve a Phi function

And thus that's what I will be aiming for.

Quote:You can eliminate the use of the stack for temporary variables if, and only if, there is no way for control to pass to or from the the JIT function during the lifetime of the variable.

I meant that if we reach the RET bytecode during runtime the temp variables don't have to be written back to the VM as they are no longer used for anything. All other exits out of the jit function would restore the stack to exactly the state that the original VM would have made it or the jit is fundamentally broken.

Quote:Fortunately most temporary variables are short lived. They are allocated and freed with each expression, sometimes even sub expressions. The life cycle of most temporary variables is this: allocated -> written to -> read from -> deallocated. Very rarely is a temporary variable read multiple times.

Which sounds to me like there aren't that many temporary variables around at a time. In other words, there won't be too many register spills as most (if not all) temp variables will fit in native registers.

Quote:Without the hints of when a temporary variable is allocated and deallocated, all you will see is the space on the stack that they are occupy. But in reality this space is used by multiple temporary variables at different times.

Indeed, and as this memory is non-volatile while the jit function is executing there's no need to load or store them more than once unless absolutely necessary as the only thing changing the meaning of the temp variables is the jit itself.

WitchLord

4,834

July 24, 2009 05:03 PM

Well, it's really just a hunch of mine. I've never tried writing a JIT compiler so I can't say what the best way of doing it is. However, I still believe that with the hints of when a temp is allocated and freed, you will have a much easier time optimizing things, because most of the time you won't have to load or store the value on the stack at all.

But, you're the one who is writing the JIT compiler, you're the one that knows what you need.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

quarnster

266

Author

July 24, 2009 06:33 PM

Quote:Original post by WitchLord
I've never tried writing a JIT compiler so I can't say what the best way of doing it is.

That's a coincidence, neither have I ;)
Up until now that is.

Quote:However, I still believe that with the hints of when a temp is allocated and freed, you will have a much easier time optimizing things, because most of the time you won't have to load or store the value on the stack at all.

After giving this some thought I think this is better if the assumption is made that the jit will need to break out to the VM often. Each jitEntry must be treated separately when it comes to loading/storing values, but as this is the case it'll only load/store the values actually used in this block.

Globally loading/storing the temp variables is better if we assume that we don't need to break out to the vm as variables shared across jitEntry blocks will only be loaded/stored once.

I don't know, I think the globally loading/storing makes for a cleaner implementation, but maybe I'll change my mind if I run into some unforeseen problem.

Quote:But, you're the one who is writing the JIT compiler, you're the one that knows what you need.

Possibly, throwing the ball around to generate new ideas always helps though so I appreciate your input.

AngelScript JIT/AOT implementation details (technical)

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

AngelScript JIT/AOT implementation details (technical)

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines