Jump to content
  • Advertisement
Sign in to follow this  
  • entries
  • comment
  • views

About this blog

Tracks progress on my work for the Google Summer of Code 2011; work is being done on the Mono x86 runtime.

Entries in this blog


Progress Update

It's been a while since I've posted here, but progress has moved forward steadily since starting work on the Mono runtime. As indicated a few weeks back, my first step was to enhance all of the existing floating point opcodes to emit SSE instructions whenever the enhancement is turned on. This optimization, like all others in Mono, is gated both by a runtime flag as well as the actual capabilities of the native hardware. The former has the added benefit of allowing easy testing of both code paths to point out when I've screwed things up.

Each method that gets JIT compiled is broken down into a graph of connected code chunks called basic blocks. These blocks are connected by branch instructions, and register allocation is done both within each block (locals) and on the method as a whole (globals). The Mono runtime breaks itself up into a collection of architecture independent pieces which call into specific functions that are "overloaded" at compile time by a preprocessor flag. Thus, the main codepath handles setting up a MonoCompile object for a given method before calling into the mono_arch_output_basic_block function to translate IR into raw machine code for a given CPU.For my project, I only have to worry about modifying this function for x86.

A quick overview of that function:

void mono_arch_output_basic_block (Instruction *ins) {
// some setup stuff
switch (ins->opcode) {
case OP_ADD: /* blah */ break;
case OP_SUB: /* blah */ break;
case OP_CALL: /* blah */ break;
// etc.
This goes on for hundreds of opcodes, with each case taking the current instruction's source and destination registers and using a wide host of macros to output the necessary machine code to implement that operation. I've been adjusting various cases that deal with floating point operations to support an SSE code path:

case OP_FMUL:
if (X86_USE_SSE_FP(cfg)) {
x86_sse_mulsd_reg_reg (code, ins->dreg, ins->sreg2);
} else {
x86_fp_op_reg (code, X86_FMUL, 1, TRUE);
Most of them have been fairly straightforward, with a few requiring some more thought than the others. For example, the conversion operators to and from 64-bit integers are particularly difficult to deal with on the x86 platform, since the standard SSE conversion operations will only work on 32-bit memory locations when you don't have the x64 REX prefix applied. Thus the runtime breaks up the long into two separate 32-bit registers, and it's up to your opcode implementation to combine and convert them properly. For now I've simply taken the easy way out and pushed both halves onto the x87 FP stack and taken advantage of its ability to write a 64-bit memory location, but in the future I'd like to come back to this and see if there's a better way that doesn't involve pushing onto the stack, popping it back off, and then moving the result into the XMM registers.

Several other instructions have issues as well. The intrinsic sin/cos/tan and frem/round had nice implementations in x86 thanks to their built-in support on the FPU. However, since we're now storing all FP values in XMM registers, using that support requires several additional and wasteful moves. In some cases I went ahead and did this anyway, since I couldn't see any quick and easy way to get around it. That's another area to look at later when it comes time to polish everything up.

At this point I've got all the opcodes ported over to using SSE, and most of the regression tests will pass when run with SSE enabled. There are a few areas I need to focus on next:
Mono.SIMD intrinsics trample over registers that are now being shared, so any code involving them gets screwed up. At the moment all method calls involve moving everything out of the XMM registers onto the stack and then loading them up again right away when the method starts. The procedure is reversed when leaving the method for outargs and the return value. All of this is quite wasteful, and I'll be focusing on letting the data remain in registers for as long as is possible. P/Invoke and native calls operate with a specific calling convention, one that has now changed since I'm using XMM registers instead of the FP stack. I'll need to marshal values between registers whenever one of these calls occurs. Additionally, the XMM registers need to be saved during these calls, since native code might mess with them while it's running. The register allocator currently uses compile-time switches to determine behavior regarding various register banks. Changing the allocator to share XMM registers for FP and SIMD is simple in theory, but will require a lot of plumbing to get the runtime flags down the call stack and into the appropriate methods. Since the register allocator is a shared component for all architectures, changing any public functions there will require sweeping changes across the runtime, which isn't desirable.
Once I get all of that done, I'll turn an eye towards optimizations and cleaning up the generated code. Right now a lot of extraneous work is being done, mostly in moves to and from the XMM registers, that can be optimized out. Additionally, there are optimizations that were precluded by the use of the x87 stack that are now open for exploration. While I'm working on this stage I'll be running all sorts of tests and writing performance benchmarks to help determine what impact my changes have made on the runtime as compared to the old codepaths using the FP stack.




Start of Project

My proposal to work on the Mono runtime was accepted for the Google Summer of Code program, so I'll be spending some time on that this summer. The project involves modifying the x86 JIT to use SSE for floating point operations, as opposed to the x87 FP stack that it's using now. The use of SSE registers is much cleaner and faster than using the FP stack, and the amd64 runtime already uses them entirely, since the x64 calling convention dictates that they be used. My work will bring the x86 runtime closer to the 64 bit runtime, and hopefully give a modest performance boost as well.

The coding period for GSoC started last Monday, so I spent the last week getting to know the runtime source code. The whole system is pretty complex, large, and entirely in C, so it hasn't been easy following it all. Luckily my mentor has been great about answering questions whenever I have them. A quick overview of how the runtime works:

The JIT is started, internal calls are hooked up, and CPU features are detected The assembly to execute is loaded and its IL instructions loaded. The instructions are all read and converted into IR instructions, which are an expanded internal representation that is still architecture independent but is easier to manipulate and generate code for. A control flow graph is built and the program is broken down into "basic blocks". The IR instructions are running through a "decompose" pass that breaks large instructions down into several smaller ones, or equivalent faster operations. At this point, we have a graph of blocks that are in decoded IR form. From here on out we move into separate paths depending on the architecture we're running, which could be x86, amd64, PowerPC, ARM, etc. The IR opcodes are run through a lowering pass, which converts complex opcodes into simpler ones that more closely map to machine instructions. The opcodes are then run through an initial peephole optimization pass, which performs any optimizations that can be made before register allocation. The register allocator is run on the block to schedule registers for each instruction. Another optimization pass is run on the instruction stream. Finally, the JIT runs through each instruction in the block and outputs the corresponding machine instruction for the given architecture.
My job will be to modify the x86-specific architecture code to output SSE opcodes for various floating point operations. This sounds rather simple, but the implications of not using the FP stack are wide-reaching. Besides modifying machine code output, I'll also have to modify the register allocator to make use of the XMM registers for floating point values being passed to and from functions, since the current runtime puts them all on the x87 stack.

Although the register allocator is probably the most complex part of the process, there are plenty of little things to take into account as well. I'll have to modify the P/Invoke and native interop code to save and restore the XMM registers whenever the boundary is crossed. Also, any code that uses Mono.SIMD could possibly conflict with registers that we're now using for FP ops, so checks will have to be put in place to handle that condition. Finally, there are optimizations that open up when SSE registers are used, and hopefully I'll have time to do some work on them as well.

It's a lot of work ahead of me, but it's interesting getting to know how a large and commercial-grade JIT operates, and the end result should be useful for a wide group of people. I'm looking forward to diving more deeply into this as the summer progresses.



Sign in to follow this  
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!