Playing With The .NET JIT Part 4

Published September 14, 2014
Advertisement
As noted previously there are some cases where the performance of unmanaged code can beat that of the managed JIT. In the previous case it was the matrix multiplication function. We do have some other possible performance benefits we can give to our .NET code, specifically, we can NGEN it. NGEN is an interesting utility, it can perform heavy optimizations that would not be possible in the standard runtime JIT (as we shall see). The question before us is: Will it give us enough of a boost to be able to surpass the performance of our unmanaged matrix multiplication?An Analysis of Existing Code
We haven't looked at the current code that was produced for our previous tests yet, so I feel that it is time we gave it a look and see what we have. To keep this shorter we'll only look at the inner product function. The code produced for the matrix multiplication suffers from the same problems and benefits from the same extensions. For the purposes of this writing we'll only consider the x64 platform. First up we'll look at our unmanaged matrix multiplication, which as we may recall is an SSE2 version. There some things we should note: this method cannot be inlined into the managed code, and there are no frame pointers (they got optimized out).00000001`800019c3 0f100a movups xmm1,xmmword ptr [rdx] 00000001`800019c6 0f59c8 mulps xmm1,xmm0 00000001`800019c9 0f28c1 movaps xmm0,xmm1 00000001`800019cc 0fc6c14e shufps xmm0,xmm1,4Eh 00000001`800019d0 0f58c8 addps xmm1,xmm0 00000001`800019d3 0f28c1 movaps xmm0,xmm1 00000001`800019d6 0fc6c11b shufps xmm0,xmm1,1Bh 00000001`800019da 0f58c1 addps xmm0,xmm1 00000001`800019dd f3410f1100 movss dword ptr [r8],xmm0 00000001`800019e2 c3 ret
The code used to produce the managed version shown below has undergone a slight modification. No longer does the method return a float, instead it has an out parameter to a float, which ends up holding the result of the operation. This change was made to eliminate some compilation issues in both the managed and unmanaged versions. In the case of the managed version below, without the out parameter the store operation (at 00000642`801673b3) would have required a conversion to a double and back to a single again, the new versions are shown at the end of this post. Examining the managed inner product we get a somewhat worse picture:00000642`8016732f 4c8b4908 mov r9,qword ptr [rcx+8]00000642`80167333 4d85c9 test r9,r900000642`80167336 0f8684000000 jbe 00000642`801673c000000642`8016733c f30f104110 movss xmm0,dword ptr [rcx+10h]00000642`80167341 488b4208 mov rax,qword ptr [rdx+8]00000642`80167345 4885c0 test rax,rax00000642`80167348 7676 jbe 00000642`801673c000000642`8016734a f30f104a10 movss xmm1,dword ptr [rdx+10h]00000642`8016734f f30f59c8 mulss xmm1,xmm000000642`80167353 4983f901 cmp r9,100000642`80167357 7667 jbe 00000642`801673c000000642`80167359 f30f105114 movss xmm2,dword ptr [rcx+14h]00000642`8016735e 483d01000000 cmp rax,100000642`80167364 765a jbe 00000642`801673c000000642`80167366 f30f104214 movss xmm0,dword ptr [rdx+14h]00000642`8016736b f30f59c2 mulss xmm0,xmm200000642`8016736f f30f58c1 addss xmm0,xmm100000642`80167373 4983f902 cmp r9,200000642`80167377 7647 jbe 00000642`801673c000000642`80167379 f30f105118 movss xmm2,dword ptr [rcx+18h]00000642`8016737e 483d02000000 cmp rax,200000642`80167384 763a jbe 00000642`801673c000000642`80167386 f30f104a18 movss xmm1,dword ptr [rdx+18h]00000642`8016738b f30f59ca mulss xmm1,xmm200000642`8016738f f30f58c8 addss xmm1,xmm000000642`80167393 4983f903 cmp r9,300000642`80167397 7627 jbe 00000642`801673c000000642`80167399 f30f10511c movss xmm2,dword ptr [rcx+1Ch]00000642`8016739e 483d03000000 cmp rax,300000642`801673a4 761a jbe 00000642`801673c000000642`801673a6 f30f10421c movss xmm0,dword ptr [rdx+1Ch]00000642`801673ab f30f59c2 mulss xmm0,xmm200000642`801673af f30f58c1 addss xmm0,xmm100000642`801673b3 f3410f114040 movss dword ptr [r8+40h],xmm0...00000642`801673bd f3c3 rep ret00000642`801673bf 90 nop00000642`801673c0 e88b9f8aff call mscorwks!JIT_RngChkFail (00000642`7fa11350)
Wow! Lots of conditionals there, it's not vectorized either, but we don't expect it to be, automatic vectorization is a hit and miss type of deal with most optimizing compilers (like the Intel one). Not to mention, vectorizing in the runtime JIT would take up far too much time. This method is inlined for us (thankfully), but we see that it is littered with conditionals and jumps. So where are they jumping to? Well, they are actually ending up just after the end of the method. Note the nop instruction that causes the jump destination to be paragraph aligned, that is intentional. As you can probably guess based on the name from the jump destination, those conditionals are checking the array bounds, stored in r9 and rax, against the indices being used. Those jumps aren't actually that friendly for branch prediction, but for the most part they won't hamper the speed of this method much, but they are an additional cost. Unfortunately, they are rather problematic for the matrix version, and tend to cost quite a bit in performance.
We also can see that in x64 mode the JIT will use SSE2 for floating point operations. This is quite nice, but does have some interesting consequences, for instance comparing floating point numbers generated using the FPU and those using SSE2 will actually more than likely fail, EVEN IF you truncate them to their appropriate sizes. The reason for this is that the XMM registers (when using the single versions of the instructions and not the double ones) store the floating point values as exactly 32 bit floats. The FPU however will expand them to 80 bit floats, which means that operations on those 80 bit floats before truncating them can affect the lower bits of the 32 bit result in a manner that will result in them differing in the lower portions. If you are wondering when this might become an issue, then you can imagine the problems of running a managed networked game where you have 64bit and 32 bit clients all sending packets to the server. This is just another reason why you should be using deltas for comparison of floats. Other things to note is that with the addition of SSE2 support came the ability to use instructions that save us loads and stores, such as the cvtss2sd and cvtsd2ss instructions, which perform single to double and double to single conversions respectively.Examining the Call Stack
Of course, there is also the question of exactly what all does our program go through to call our unmanaged methods. First off, the JIT will have to generate several marshalling stubs (to deal with any non-blittable types, although in this case all of the passed types are blittable), along with the security demands. The total number of machines instructions for these stubs is around 10-30, never the less, they aren't inlinable and end up having to be created at runtime. The extra overhead of these calls can add up to quite a bit. First up we'll look at the pinvoke and the delegate stacks:000006427f66bd14 ManagedMathLib!matrix_mul0000064280168b85 mscorwks!DoNDirectCall__PatchGetThreadCall+0x780000064280168ccc ManagedMathLib!DomainBoundILStubClass.IL_STUB(Single[], Single[], Single[])+0xb50000064280168a0f PInvokeTest!SecurityILStubClass.IL_STUB(Single[], Single[], Single[])+0x5c000006428016893e PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__0()+0x1f0000064280167ca1 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x6e000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0x591000006427f66bd14 ManagedMathLib!matrix_mul0000064280168465 mscorwks!DoNDirectCall__PatchGetThreadCall+0x7800000642801685c1 ManagedMathLib!DomainBoundILStubClass.IL_STUB(Single[], Single[], Single[])+0xb50000064280168945 PInvokeTest!SecurityILStubClass.IL_STUB(Single[], Single[], Single[])+0x510000064280167d59 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x75000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0x649
We can see the two stubs that were created, along with this last method calledDoNDirectCall__PatchGetThreadCall
that actually does the work of calling to our unmanaged function. Exactly what it does is probably what the name says, although I haven't actually dug in and tried to find out what's going on in the internals of it. One important thing to notice is the PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__0() call, which is actually a delegate used to call to our unmanaged method (passed in to TimeTest). By using the delegate to call the matrix multiplication function, the JIT was able to eliminate the calls entirely. Other than that, the contents of the two sets of stubs are practically identical. The security stub actually asserts that we have the right to call to unmanaged code, as this is a security demand and can change at runtime, this cannot be eliminated. Calling to our unmanaged function from the manged DLL is up next, and it turns out that this is also the most direct call:000006427f66bf32 ManagedMathLib!matrix_mul0000064280169601 mscorwks!DoNDirectCallWorker+0x6200000642801694ef ManagedMathLib!ManagedMathLib.ManagedMath.MatrixMul(Single[], Single[], Single[])+0xd10000064280168945 PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__3()+0x1f0000064280167ecf PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x75000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0x7bf
As we can see, the only real work that is done to call our unmanaged method is the call to DoNDirectCallWorker. Digging around in that method we find that it is basically a wrapper that saves registers, sets up some registers and then dispatches to the unmanaged function. Upon returning it restores the registers and returns to the caller. There is no dynamic method construction, nor does this require any extra overhead on our end. In fact, one could say that the code is about as fast as we can expect it to be for a managed to unmanaged transition. Looking at the difference between the original unmanaged inner product call and the new version (which writes takes a pointer to the destination float), being made from the managed DLL, we can see a huge difference:000006427f66bf32 ManagedMathLib!inner_product0000064280169bd0 mscorwks!DoNDirectCallWorker+0x620000064280169acf ManagedMathLib!ManagedMathLib.ManagedMath.InnerProduct(Single[], Single[], Single ByRef)+0xc00000064280168955 PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__7()+0x1f00000642801681c5 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x75000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0xab5000006427f66bd14 ManagedMathLib!inner_product0000064280169ca3 mscorwks!DoNDirectCall__PatchGetThreadCall+0x780000064280169ba0 ManagedMathLib!DomainBoundILStubClass.IL_STUB(Single*, Single*)+0x430000064280169b00 ManagedMathLib!ManagedMathLib.ManagedMath.InnerProduct(Single[], Single[])+0x50000006428016893e PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__7()+0x2000000642801681c5 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x6e000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0xab5
Notice the second call stack has the marshalling stub (also note the parameters to the stub). Returning value types has all sorts of interesting consequences. By changing the signature to write out to a float (in the case of the managed DLL it uses an out parameter), we eliminate the marshalling stub entirely. This improves performance by a decent bit, but nowhere near enough to make up for the call in the first place. The managed inner product is still significantly faster.And then came NGEN
So, we've gone through and optimized our managed application, but yet it still is running too slow. We contemplate the necessity of moving some code over to the unmanaged world and shudder at the implications. Security would be shot, bugs abound...what to do! But then we remember that there's yet one more option, NGEN!
Running NGEN on our test executable prejitted the whole thing, even methods that eventually ended up being inlined. So, what did it do to our managed inner product? Well first we'll look at the actual method that got prejitted:PInvokeTest.Program.InnerProduct2(Single[], Single[], Single ByRef)Begin 0000064288003290, size b000000642`88003290 4883ec28 sub rsp,28h00000642`88003294 4c8bc9 mov r9,rcx00000642`88003297 498b4108 mov rax,qword ptr [r9+8]00000642`8800329b 4885c0 test rax,rax00000642`8800329e 0f8696000000 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032a4 33c9 xor ecx,ecx00000642`880032a6 488b4a08 mov rcx,qword ptr [rdx+8]00000642`880032aa 4885c9 test rcx,rcx00000642`880032ad 0f8687000000 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032b3 4533d2 xor r10d,r10d00000642`880032b6 483d01000000 cmp rax,100000642`880032bc 767c jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032be 41ba01000000 mov r10d,100000642`880032c4 4883f901 cmp rcx,100000642`880032c8 7670 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032ca 41ba01000000 mov r10d,100000642`880032d0 483d02000000 cmp rax,200000642`880032d6 7662 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032d8 41ba02000000 mov r10d,200000642`880032de 4883f902 cmp rcx,200000642`880032e2 7656 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032e4 483d03000000 cmp rax,300000642`880032ea 764e jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032ec b803000000 mov eax,300000642`880032f1 4883f903 cmp rcx,300000642`880032f5 7643 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032f7 f30f104a14 movss xmm1,dword ptr [rdx+14h]00000642`880032fc f3410f594914 mulss xmm1,dword ptr [r9+14h]00000642`88003302 f30f104210 movss xmm0,dword ptr [rdx+10h]00000642`88003307 f3410f594110 mulss xmm0,dword ptr [r9+10h]00000642`8800330d f30f58c8 addss xmm1,xmm000000642`88003311 f30f104218 movss xmm0,dword ptr [rdx+18h]00000642`88003316 f3410f594118 mulss xmm0,dword ptr [r9+18h]00000642`8800331c f30f58c8 addss xmm1,xmm000000642`88003320 f30f10421c movss xmm0,dword ptr [rdx+1Ch]00000642`88003325 f3410f59411c mulss xmm0,dword ptr [r9+1Ch]00000642`8800332b f30f58c8 addss xmm1,xmm000000642`8800332f f3410f1108 movss dword ptr [r8],xmm100000642`88003334 4883c428 add rsp,28h00000642`88003338 f3c3 rep ret00000642`8800333a e811e0a0f7 call mscorwks!JIT_RngChkFail (00000642`7fa11350)00000642`8800333f 90 nop
Interesting results eh? First off, all of the checks are right up front, and ignoring the stack frames we can see exactly what will be inlined. Some other things to note: This method appears a lot better than before, with all of the branches right up at the top where one would assume branch prediction can best deal with them (the registers never change and are being compared to constants). Never the less there are some oddities in this code, for instance there appear to be some extrenuous instructions like mov eax,3. Yeah, don't ask me. Never the less the code is clearly superior to its previous form, and in fact the matrix version is equally as superior, with the range checks being spaced out significantly more (and a bunch are done right up front as well). Of course, the question now is: How much does this help our performance? First up we'll examine some results from the new code base, and then some from the NGEN results on the same code base.Count: 50PInvoke MatrixMul : 00:00:07.6456226 Average: 00:00:00.1529124Delegate MatrixMul: 00:00:06.6500307 Average: 00:00:00.1330006Managed MatrixMul: 00:00:05.5783511 Average: 00:00:00.1115670Internal MatrixMul: 00:00:04.5377141 Average: 00:00:00.0907542PInvoke Inner Product: 00:00:05.4466987 Average: 00:00:00.1089339Delegate Inner Product: 00:00:04.5001885 Average: 00:00:00.0900037Managed Inner Product: 00:00:00.5535891 Average: 00:00:00.0110717Internal Inner Product: 00:00:02.2694728 Average: 00:00:00.0453894Count: 10PInvoke MatrixMul : 00:00:01.5706254 Average: 00:00:00.1570625Delegate MatrixMul: 00:00:01.2689247 Average: 00:00:00.1268924Managed MatrixMul: 00:00:01.1501118 Average: 00:00:00.1150111Internal MatrixMul: 00:00:00.9302144 Average: 00:00:00.0930214PInvoke Inner Product: 00:00:01.0198933 Average: 00:00:00.1019893Delegate Inner Product: 00:00:00.8538827 Average: 00:00:00.0853882Managed Inner Product: 00:00:00.0987369 Average: 00:00:00.0098736Internal Inner Product: 00:00:00.4287660 Average: 00:00:00.0428766
All in all, our performance changes have helped out the managed inner product a decent amount, although even the unmanaged calls managed to get a bit of a boost. Now for the NGEN results:Count: 50PInvoke MatrixMul : 00:00:07.5788052 Average: 00:00:00.1515761Delegate MatrixMul: 00:00:06.2202549 Average: 00:00:00.1244050Managed MatrixMul: 00:00:04.0376665 Average: 00:00:00.0807533Internal MatrixMul: 00:00:04.5778189 Average: 00:00:00.0915563PInvoke Inner Product: 00:00:05.2785764 Average: 00:00:00.1055715Delegate Inner Product: 00:00:04.1814388 Average: 00:00:00.0836287Managed Inner Product: 00:00:00.5579279 Average: 00:00:00.0111585Internal Inner Product: 00:00:02.2419279 Average: 00:00:00.0448385Count: 10PInvoke MatrixMul : 00:00:01.3822036 Average: 00:00:00.1382203Delegate MatrixMul: 00:00:01.1436108 Average: 00:00:00.1143610Managed MatrixMul: 00:00:00.7386742 Average: 00:00:00.0738674Internal MatrixMul: 00:00:00.8427460 Average: 00:00:00.0842746PInvoke Inner Product: 00:00:00.9507331 Average: 00:00:00.0950733Delegate Inner Product: 00:00:00.7428082 Average: 00:00:00.0742808Managed Inner Product: 00:00:00.1005084 Average: 00:00:00.0100508Internal Inner Product: 00:00:00.4025611 Average: 00:00:00.0402561
So, now we can see that our matrix multiplication doesn't offer any advantages over the managed version, in fact it's actually SLOWER than the managed version! We also can see that the unmanaged invocations also benefitted from the NGEN process, as their managed calls were also optimized somewhat, although the stub wrappers are still there and hence still add their overhead. Other things we note is that the inner product function appears to have slowed down just a bit, this might be nothing, or it might be due to machine load or it might genuinly be slower. I'm tempted to say that it's actually slower now, though.Conclusion
You may recall that this was all sparked by a discussion I had way back when about comparing managed and unmanaged benchmarks and the disadvantages of just setting the /clr flag. I've gone a bit past that though in looking at our managed resources and optimized unmanaged resources and when it is actually beneficial to call into unmanaged code. It is still beneficial to do so, but only with some operations that are just sufficiently taxing enough to bother with. In this case our matrix code which, while in a pure JIT situation, the native code clearly beat out the JIT produced code, gets beat out by the managed version. So what is sufficiently taxing then? Well, set processing might be taxing enough. That is: applying a set of vectorized operations to a collection of objects. But the reality is, you MUST profile first before you can be sure that optimizations of that sort are anywhere near what you need, as if you just assume it will you're probably mistaken.
On a final note, the x86 version also performs better when NGENed than the native version, although in a surprise jump, the delegates actually cost significantly more:Count: 50PInvoke MatrixMul : 00:00:07.9897235 Average: 00:00:00.1597944Delegate MatrixMul: 00:00:27.2561396 Average: 00:00:00.5451227Managed MatrixMul: 00:00:03.5224029 Average: 00:00:00.0704480Internal MatrixMul: 00:00:04.5232549 Average: 00:00:00.0904650PInvoke Inner Product: 00:00:05.5799834 Average: 00:00:00.1115996Delegate Inner Product: 00:00:29.5660003 Average: 00:00:00.5913200Managed Inner Product: 00:00:00.5755690 Average: 00:00:00.0115113Internal Inner Product: 00:00:01.8218949 Average: 00:00:00.0364378
Exactly why this is I haven't investigated, and perhaps I will next time.
Sources for the new inner product functions:void __declspec(dllexport) inner_product(float const* v1, float const* v2, float* out) { __m128 a = _mm_mul_ps(_mm_loadu_ps(v1), _mm_loadu_ps(v2)); a = _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 0, 3, 2))); _mm_store_ss(out, _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(0, 1, 2, 3))));}static void InnerProduct(array^ v1, array^ v2, [Runtime::InteropServices::Out] float% result) { pin_ptr pv1 = &v1[0]; pin_ptr pv2 = &v2[0]; pin_ptr out = &result inner_product(pv1, pv2, out);}public static void InnerProduct2(float[] v1, float[] v2, out float f) { f = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2] + v1[3] * v2[3];}

Source
1 likes 0 comments

Comments

Nobody has left a comment. You can be the first!
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement