SlimGen and You, Part ADD AL, [RAX] of N

posted in Washu's Journal
Published September 14, 2014
Advertisement
The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?
The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X('c', 2, 3.0f, "Hello"); becomes:X('c', 2, 3.0f, "Hello"); 00000025 push 40400000h ; 3.0f 0000002a push dword ptr ds:[03402088h] ;Address of "Hello" 00000030 mov edx,2 00000035 mov ecx,63h ;'c' 0000003a call FFB8B040
The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:p.Y(2, 3.0f); 0000006d push 40400000h ; 3.0f 00000072 mov ecx,dword ptr [ebp-40h] ;this 00000075 mov edx,2 0000007c call FFA1B048
So this all seems clear enough, but it's important to note these differences, especially when you're poking around in the low level bowels of the CLR or when you're doing what SlimGen does: which is replacing actual method bodies.
So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers. Z('c', 2, 3.0f, "Hello", 1.0, pa); 000000c0 mov r9,124D3100h 000000ca mov r9,qword ptr [r9] ; "Hello" 000000cd mov rax,qword ptr [rsp+38h] ;pa (IntPtr[]) 000000d2 mov qword ptr [rsp+28h],rax ;pa - stack spill 000000d7 movsd xmm0,mmword ptr [00000118h] ;1.0 000000df movsd mmword ptr [rsp+20h],xmm0 ;1.0 - stack spill 000000e5 movss xmm2,dword ptr [00000110h] ;3.0f 000000ed mov edx,2 ;int (2) 000000f2 mov cx,63h ;'c' 000000f6 call FFFFFFFFFFEC9300
Whew, that looks pretty nasty doesn't it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.p.Q(~0L, ~1L, ~2L, ~3); 0000010a mov rcx,qword ptr [rsp+30h] ; this pointer 0000010f mov qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh ;~3L, spilled to stack 00000118 mov r9,0FFFFFFFFFFFFFFFDh ;~2L 0000011f mov r8,0FFFFFFFFFFFFFFFEh ;~1L 00000126 mov rdx,0FFFFFFFFFFFFFFFFh ;~0L 0000012d call FFFFFFFFFFEC9310
Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)var v = new Vector(); p.R(v); 00000169 lea rcx,[rsp+40h] 0000016e mov rax,qword ptr [rcx] 00000171 mov qword ptr [rsp+50h],rax 00000176 mov rax,qword ptr [rcx+8] 0000017a mov qword ptr [rsp+58h],rax 0000017f lea rdx,[rsp+50h] 00000184 mov rcx,r8 00000187 call FFFFFFFFFFEC9318
As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.
All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):BITS 32 ORG 0x59f0 ; void Multiply(ref Matrix, ref Matrix, out Matrix)start: mov eax, [esp + 4] movups xmm4, [edx] movups xmm5, [edx + 0x10] movups xmm6, [edx + 0x20] movups xmm7, [edx + 0x30] movups xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm1, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax], xmm0 ; Calculate row 0 of new matrix movups xmm0, [ecx + 0x10] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x10], xmm0 ; Calculate row 1 of new matrix movups xmm0, [ecx + 0x20] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x20], xmm0 ; Calculate row 2 of new matrix movups xmm0, [ecx + 0x30] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x30], xmm0 ; Calculate row 3 of new matrix ret 4

Source
0 likes 0 comments

Comments

Nobody has left a comment. You can be the first!
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement