SlimGen and You, Part ADD EAX, [EAX] of N

Published September 14, 2014
Advertisement
So far I've covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.
So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.
The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you're doing at the time. Running the results against my Phenom 9850 I get: Total Matrix Count Per Run: 100,000 Multiply Total Ticks: 2,001,059 SlimGenMultiply Total Ticks: 1,269,200 Improvement: 36.57 %
While when I run it against my T8300 Core2 Duo laptop I get: Total Matrix Count Per Run: 100,000 Multiply Total Ticks: 2,175,380 SlimGenMultiply Total Ticks: 1,621,830 Improvement: 25.45 %
Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn't been fully hammered out (in that it "works" but hasn't been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.
So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that's over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there's the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there's the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.
Finally there's how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn't bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.
The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent: start: mov eax, [esp + 4] movups xmm4, [edx] movups xmm5, [edx + 0x10] movups xmm6, [edx + 0x20] movups xmm7, [edx + 0x30] movups xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm1, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [esp - 0x20], xmm0 ; store row 0 of new matrix movups xmm0, [ecx + 0x10] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [esp - 0x30], xmm0 ; store row 1 of new matrix movups xmm0, [ecx + 0x20] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [esp - 0x40], xmm0 ; store row 2 of new matrix movups xmm0, [ecx + 0x30] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x30], xmm0 ; store row 3 of new matrix movups xmm0, [esp - 0x40] movups [eax + 0x20], xmm0 movups xmm0, [esp - 0x30] movups [eax + 0x10], xmm0 movups xmm0, [esp - 0x20] movups [eax], xmm0 ret 4
The .Net matrix multiplication source code: public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) { Matrix r; r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41); r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42); r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43); r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44); r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41); r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42); r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43); r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44); r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41); r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42); r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43); r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44); r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41); r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42); r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43); r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44); result = r;}

Source
3 likes 0 comments

Comments

Nobody has left a comment. You can be the first!
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement