Start of Project
My proposal to work on the Mono runtime was accepted for the Google Summer of Code program, so I'll be spending some time on that this summer. The project involves modifying the x86 JIT to use SSE for floating point operations, as opposed to the x87 FP stack that it's using now. The use of SSE registers is much cleaner and faster than using the FP stack, and the amd64 runtime already uses them entirely, since the x64 calling convention dictates that they be used. My work will bring the x86 runtime closer to the 64 bit runtime, and hopefully give a modest performance boost as well.
The coding period for GSoC started last Monday, so I spent the last week getting to know the runtime source code. The whole system is pretty complex, large, and entirely in C, so it hasn't been easy following it all. Luckily my mentor has been great about answering questions whenever I have them. A quick overview of how the runtime works:
- The JIT is started, internal calls are hooked up, and CPU features are detected
- The assembly to execute is loaded and its IL instructions loaded.
- The instructions are all read and converted into IR instructions, which are an expanded internal representation that is still architecture independent but is easier to manipulate and generate code for.
- A control flow graph is built and the program is broken down into "basic blocks".
- The IR instructions are running through a "decompose" pass that breaks large instructions down into several smaller ones, or equivalent faster operations.
- At this point, we have a graph of blocks that are in decoded IR form. From here on out we move into separate paths depending on the architecture we're running, which could be x86, amd64, PowerPC, ARM, etc.
- The IR opcodes are run through a lowering pass, which converts complex opcodes into simpler ones that more closely map to machine instructions.
- The opcodes are then run through an initial peephole optimization pass, which performs any optimizations that can be made before register allocation.
- The register allocator is run on the block to schedule registers for each instruction.
- Another optimization pass is run on the instruction stream.
- Finally, the JIT runs through each instruction in the block and outputs the corresponding machine instruction for the given architecture.
Although the register allocator is probably the most complex part of the process, there are plenty of little things to take into account as well. I'll have to modify the P/Invoke and native interop code to save and restore the XMM registers whenever the boundary is crossed. Also, any code that uses Mono.SIMD could possibly conflict with registers that we're now using for FP ops, so checks will have to be put in place to handle that condition. Finally, there are optimizations that open up when SSE registers are used, and hopefully I'll have time to do some work on them as well.
It's a lot of work ahead of me, but it's interesting getting to know how a large and commercial-grade JIT operates, and the end result should be useful for a wide group of people. I'm looking forward to diving more deeply into this as the summer progresses.