The coding period for GSoC started last Monday, so I spent the last week getting to know the runtime source code. The whole system is pretty complex, large, and entirely in C, so it hasn't been easy following it all. Luckily my mentor has been great about answering questions whenever I have them. A quick overview of how the runtime works:
- The JIT is started, internal calls are hooked up, and CPU features are detected
- The assembly to execute is loaded and its IL instructions loaded.
- The instructions are all read and converted into IR instructions, which are an expanded internal representation that is still architecture independent but is easier to manipulate and generate code for.
- A control flow graph is built and the program is broken down into "basic blocks".
- The IR instructions are running through a "decompose" pass that breaks large instructions down into several smaller ones, or equivalent faster operations.
- At this point, we have a graph of blocks that are in decoded IR form. From here on out we move into separate paths depending on the architecture we're running, which could be x86, amd64, PowerPC, ARM, etc.
- The IR opcodes are run through a lowering pass, which converts complex opcodes into simpler ones that more closely map to machine instructions.
- The opcodes are then run through an initial peephole optimization pass, which performs any optimizations that can be made before register allocation.
- The register allocator is run on the block to schedule registers for each instruction.
- Another optimization pass is run on the instruction stream.
- Finally, the JIT runs through each instruction in the block and outputs the corresponding machine instruction for the given architecture.
My job will be to modify the x86-specific architecture code to output SSE opcodes for various floating point operations. This sounds rather simple, but the implications of not using the FP stack are wide-reaching. Besides modifying machine code output, I'll also have to modify the register allocator to make use of the XMM registers for floating point values being passed to and from functions, since the current runtime puts them all on the x87 stack.
Although the register allocator is probably the most complex part of the process, there are plenty of little things to take into account as well. I'll have to modify the P/Invoke and native interop code to save and restore the XMM registers whenever the boundary is crossed. Also, any code that uses Mono.SIMD could possibly conflict with registers that we're now using for FP ops, so checks will have to be put in place to handle that condition. Finally, there are optimizations that open up when SSE registers are used, and hopefully I'll have time to do some work on them as well.
It's a lot of work ahead of me, but it's interesting getting to know how a large and commercial-grade JIT operates, and the end result should be useful for a wide group of people. I'm looking forward to diving more deeply into this as the summer progresses.