SlimGen and You, Part ADD AL, [RAX] of N

posted in Washu's Journal

Published September 14, 2014

The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?
The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X('c', 2, 3.0f, "Hello"); becomes:

X('c', 2, 3.0f, "Hello");  00000025  push        40400000h ; 3.0f  0000002a  push        dword ptr ds:[03402088h] ;Address of "Hello"  00000030  mov         edx,2  00000035  mov         ecx,63h ;'c'  0000003a  call        FFB8B040

The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:

p.Y(2, 3.0f);  0000006d  push        40400000h  ; 3.0f  00000072  mov         ecx,dword ptr [ebp-40h] ;this  00000075  mov         edx,2  0000007c  call        FFA1B048

So this all seems clear enough, but it's important to note these differences, especially when you're poking around in the low level bowels of the CLR or when you're doing what SlimGen does: which is replacing actual method bodies.
So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers.

Z('c', 2, 3.0f, "Hello", 1.0, pa);  000000c0  mov         r9,124D3100h  000000ca  mov         r9,qword ptr [r9] ; "Hello"  000000cd  mov         rax,qword ptr [rsp+38h] ;pa (IntPtr[])  000000d2  mov         qword ptr [rsp+28h],rax ;pa - stack spill  000000d7  movsd       xmm0,mmword ptr [00000118h] ;1.0  000000df  movsd       mmword ptr [rsp+20h],xmm0 ;1.0 - stack spill  000000e5  movss       xmm2,dword ptr [00000110h] ;3.0f  000000ed  mov         edx,2 ;int (2)  000000f2  mov         cx,63h ;'c'  000000f6  call        FFFFFFFFFFEC9300

Whew, that looks pretty nasty doesn't it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.

p.Q(~0L, ~1L, ~2L, ~3);  0000010a  mov         rcx,qword ptr [rsp+30h] ; this pointer  0000010f  mov         qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh ;~3L, spilled to stack  00000118  mov         r9,0FFFFFFFFFFFFFFFDh ;~2L  0000011f  mov         r8,0FFFFFFFFFFFFFFFEh ;~1L  00000126  mov         rdx,0FFFFFFFFFFFFFFFFh ;~0L  0000012d  call        FFFFFFFFFFEC9310

Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)

var v = new Vector();  p.R(v);  00000169  lea         rcx,[rsp+40h]  0000016e  mov         rax,qword ptr [rcx]  00000171  mov         qword ptr [rsp+50h],rax  00000176  mov         rax,qword ptr [rcx+8]  0000017a  mov         qword ptr [rsp+58h],rax  0000017f  lea         rdx,[rsp+50h]  00000184  mov         rcx,r8  00000187  call        FFFFFFFFFFEC9318

As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.
All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):

BITS        32  ORG         0x59f0  ;           void Multiply(ref Matrix, ref Matrix, out Matrix)start:      mov     eax, [esp + 4]              movups  xmm4, [edx]            movups  xmm5, [edx + 0x10]            movups  xmm6, [edx + 0x20]            movups  xmm7, [edx + 0x30]            movups  xmm0, [ecx]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm1, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [eax], xmm0 ; Calculate row 0 of new matrix            movups  xmm0, [ecx + 0x10]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm0, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [eax + 0x10], xmm0 ; Calculate row 1 of new matrix            movups  xmm0, [ecx + 0x20]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm0, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [eax + 0x20], xmm0 ; Calculate row 2 of new matrix            movups  xmm0, [ecx + 0x30]            movaps  xmm1, xmm0            movaps  xmm2, xmm0            movaps  xmm3, xmm0            shufps  xmm0, xmm0, 0x00            shufps  xmm1, xmm1, 0x55            shufps  xmm2, xmm2, 0xAA            shufps  xmm3, xmm3, 0xFF            mulps   xmm0, xmm4            mulps   xmm1, xmm5            mulps   xmm2, xmm6            mulps   xmm3, xmm7            addps   xmm0, xmm2            addps   xmm1, xmm3            addps   xmm0, xmm1            movups  [eax + 0x30], xmm0 ; Calculate row 3 of new matrix            ret     4

Source

Previous Entry SlimGen and You, Part ADD [EAX], EAX of N

Next Entry SlimGen and You, Part ADD EAX, [EAX] of N

0 likes 0 comments

Comments

Nobody has left a comment. You can be the first!

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

Washu

Author

SlimGen and You, Part ADD AL, [RAX] of N

Comments

Washu

Latest Entries

Sweet Snippets - Handling Input and Callbacks with Awesomium

Sweet Snippets - More Using Awesomium and Direct3D

Sweet Snippets - Rendering Web Pages to Texture using Awesomium and Direct3D

Sweet Snippets - More Text Rendering with DirectWrite/Direct2D and Direct3D11.

Sweet Snippets - Rendering Text with DirectWrite/Direct2D and Direct3D11.

C++ Quiz #4

C++ Quiz #4

C++ Quiz #3

C++ Quiz #3

C++ Quiz #2

SlimGen and You, Part ADD AL, [RAX] of N

Comments

Washu

Latest Entries

Sweet Snippets - Handling Input and Callbacks with Awesomium

Sweet Snippets - More Using Awesomium and Direct3D

Sweet Snippets - Rendering Web Pages to Texture using Awesomium and Direct3D

Sweet Snippets - More Text Rendering with DirectWrite/Direct2D and Direct3D11.

Sweet Snippets - Rendering Text with DirectWrite/Direct2D and Direct3D11.

C++ Quiz #4

C++ Quiz #4

C++ Quiz #3

C++ Quiz #3

C++ Quiz #2

Reticulating splines