SlimGen and You, Part ADD EAX, [EAX] of N

posted in There is no escape from the Washu

Published September 14, 2014

So far I've covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.
So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.
The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you're doing at the time. Running the results against my Phenom 9850 I get:

Total Matrix Count Per Run:  100,000  Multiply        Total Ticks: 2,001,059  SlimGenMultiply Total Ticks: 1,269,200  Improvement:                 36.57 %

While when I run it against my T8300 Core2 Duo laptop I get:

Total Matrix Count Per Run:  100,000  Multiply        Total Ticks: 2,175,380  SlimGenMultiply Total Ticks: 1,621,830  Improvement:                 25.45 %

Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn't been fully hammered out (in that it "works" but hasn't been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.
So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that's over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there's the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there's the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.
Finally there's how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn't bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.
The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent:

start:      mov     eax, [esp + 4]              movups  xmm4, [edx]            movups  xmm5, [edx + 0x10]            movups  xmm6, [edx + 0x20]            movups  xmm7, [edx + 0x30]            movups  xmm0, [ecx]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm1, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [esp - 0x20], xmm0 ; store row 0 of new matrix            movups  xmm0, [ecx + 0x10]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm0, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [esp - 0x30], xmm0 ; store row 1 of new matrix            movups  xmm0, [ecx + 0x20]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm0, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [esp - 0x40], xmm0 ; store row 2 of new matrix            movups  xmm0, [ecx + 0x30]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm0, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [eax + 0x30], xmm0 ; store row 3 of new matrix            movups  xmm0, [esp - 0x40]            movups  [eax + 0x20], xmm0            movups  xmm0, [esp - 0x30]            movups  [eax + 0x10], xmm0            movups  xmm0, [esp - 0x20]            movups  [eax], xmm0            ret     4

The .Net matrix multiplication source code:

public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) {      Matrix r;    r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41);    r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42);    r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43);    r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44);    r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41);    r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42);    r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43);    r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44);    r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41);    r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42);    r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43);    r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44);    r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41);    r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42);    r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43);    r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44);    result = r;}

Source

Previous Entry SlimGen and You, Part ADD AL, [RAX] of N

Next Entry C++ Quiz #2

3 likes 0 comments

Comments

Nobody has left a comment. You can be the first!

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

Washu

Author

SlimGen and You, Part ADD EAX, [EAX] of N

Comments

Washu

Latest Entries

Sweet Snippets - Handling Input and Callbacks with Awesomium

Sweet Snippets - More Using Awesomium and Direct3D

Sweet Snippets - Rendering Web Pages to Texture using Awesomium and Direct3D

Sweet Snippets - More Text Rendering with DirectWrite/Direct2D and Direct3D11.

Sweet Snippets - Rendering Text with DirectWrite/Direct2D and Direct3D11.

C++ Quiz #4

C++ Quiz #4

C++ Quiz #3

C++ Quiz #3

C++ Quiz #2

SlimGen and You, Part ADD EAX, [EAX] of N

Comments

Washu

Latest Entries

Sweet Snippets - Handling Input and Callbacks with Awesomium

Sweet Snippets - More Using Awesomium and Direct3D

Sweet Snippets - Rendering Web Pages to Texture using Awesomium and Direct3D

Sweet Snippets - More Text Rendering with DirectWrite/Direct2D and Direct3D11.

Sweet Snippets - Rendering Text with DirectWrite/Direct2D and Direct3D11.

C++ Quiz #4

C++ Quiz #4

C++ Quiz #3

C++ Quiz #3

C++ Quiz #2

Reticulating splines