• Advertisement
Sign in to follow this  
  • entries
  • comment
  • views

Progress Update

Sign in to follow this  


It's been a while since I've posted here, but progress has moved forward steadily since starting work on the Mono runtime. As indicated a few weeks back, my first step was to enhance all of the existing floating point opcodes to emit SSE instructions whenever the enhancement is turned on. This optimization, like all others in Mono, is gated both by a runtime flag as well as the actual capabilities of the native hardware. The former has the added benefit of allowing easy testing of both code paths to point out when I've screwed things up.

Each method that gets JIT compiled is broken down into a graph of connected code chunks called basic blocks. These blocks are connected by branch instructions, and register allocation is done both within each block (locals) and on the method as a whole (globals). The Mono runtime breaks itself up into a collection of architecture independent pieces which call into specific functions that are "overloaded" at compile time by a preprocessor flag. Thus, the main codepath handles setting up a MonoCompile object for a given method before calling into the mono_arch_output_basic_block function to translate IR into raw machine code for a given CPU.For my project, I only have to worry about modifying this function for x86.

A quick overview of that function:

void mono_arch_output_basic_block (Instruction *ins) {
// some setup stuff
switch (ins->opcode) {
case OP_ADD: /* blah */ break;
case OP_SUB: /* blah */ break;
case OP_CALL: /* blah */ break;
// etc.

This goes on for hundreds of opcodes, with each case taking the current instruction's source and destination registers and using a wide host of macros to output the necessary machine code to implement that operation. I've been adjusting various cases that deal with floating point operations to support an SSE code path:

case OP_FMUL:
if (X86_USE_SSE_FP(cfg)) {
x86_sse_mulsd_reg_reg (code, ins->dreg, ins->sreg2);
} else {
x86_fp_op_reg (code, X86_FMUL, 1, TRUE);

Most of them have been fairly straightforward, with a few requiring some more thought than the others. For example, the conversion operators to and from 64-bit integers are particularly difficult to deal with on the x86 platform, since the standard SSE conversion operations will only work on 32-bit memory locations when you don't have the x64 REX prefix applied. Thus the runtime breaks up the long into two separate 32-bit registers, and it's up to your opcode implementation to combine and convert them properly. For now I've simply taken the easy way out and pushed both halves onto the x87 FP stack and taken advantage of its ability to write a 64-bit memory location, but in the future I'd like to come back to this and see if there's a better way that doesn't involve pushing onto the stack, popping it back off, and then moving the result into the XMM registers.

Several other instructions have issues as well. The intrinsic sin/cos/tan and frem/round had nice implementations in x86 thanks to their built-in support on the FPU. However, since we're now storing all FP values in XMM registers, using that support requires several additional and wasteful moves. In some cases I went ahead and did this anyway, since I couldn't see any quick and easy way to get around it. That's another area to look at later when it comes time to polish everything up.

At this point I've got all the opcodes ported over to using SSE, and most of the regression tests will pass when run with SSE enabled. There are a few areas I need to focus on next:
  • Mono.SIMD intrinsics trample over registers that are now being shared, so any code involving them gets screwed up.
  • At the moment all method calls involve moving everything out of the XMM registers onto the stack and then loading them up again right away when the method starts. The procedure is reversed when leaving the method for outargs and the return value. All of this is quite wasteful, and I'll be focusing on letting the data remain in registers for as long as is possible.
  • P/Invoke and native calls operate with a specific calling convention, one that has now changed since I'm using XMM registers instead of the FP stack. I'll need to marshal values between registers whenever one of these calls occurs. Additionally, the XMM registers need to be saved during these calls, since native code might mess with them while it's running.
  • The register allocator currently uses compile-time switches to determine behavior regarding various register banks. Changing the allocator to share XMM registers for FP and SIMD is simple in theory, but will require a lot of plumbing to get the runtime flags down the call stack and into the appropriate methods. Since the register allocator is a shared component for all architectures, changing any public functions there will require sweeping changes across the runtime, which isn't desirable.
    Once I get all of that done, I'll turn an eye towards optimizations and cleaning up the generated code. Right now a lot of extraneous work is being done, mostly in moves to and from the XMM registers, that can be optimized out. Additionally, there are optimizations that were precluded by the use of the x87 stack that are now open for exploration. While I'm working on this stage I'll be running all sorts of tests and writing performance benchmarks to help determine what impact my changes have made on the runtime as compared to the old codepaths using the FP stack.
Sign in to follow this  


Recommended Comments

There are no comments to display.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement