• entries
    146
  • comments
    436
  • views
    197730

SlimGen and You, Part ADD AL, RAX of N

Sign in to follow this  

302 views

Quote:
Originally published on my ScapeCode blog.

The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?

The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X('c', 2, 3.0f, "Hello"); becomes:

X('c', 2, 3.0f, "Hello");
00000025 push 40400000h // 3.0f
0000002a push dword ptr ds:[03402088h] //Address of "Hello"
00000030 mov edx,2
00000035 mov ecx,63h //'c'
0000003a call FFB8B040

The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:

p.Y(2, 3.0f);
0000006d push 40400000h // 3.0f
00000072 mov ecx,dword ptr [ebp-40h] //this
00000075 mov edx,2
0000007c call FFA1B048

So this all seems clear enough, but it's important to note these differences, especially when you're poking around in the low level bowels of the CLR or when you're doing what SlimGen does: which is replacing actual method bodies.

So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers.

Z('c', 2, 3.0f, "Hello", 1.0, pa);
000000c0 mov r9,124D3100h
000000ca mov r9,qword ptr [r9] // "Hello"
000000cd mov rax,qword ptr [rsp+38h] //pa (IntPtr[])
000000d2 mov qword ptr [rsp+28h],rax //pa - stack spill
000000d7 movsd xmm0,mmword ptr [00000118h] //1.0
000000df movsd mmword ptr [rsp+20h],xmm0 //1.0 - stack spill
000000e5 movss xmm2,dword ptr [00000110h] //3.0f
000000ed mov edx,2 //int (2)
000000f2 mov cx,63h //'c'
000000f6 call FFFFFFFFFFEC9300

Whew, that looks pretty nasty doesn't it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.

p.Q(~0L, ~1L, ~2L, ~3);
0000010a mov rcx,qword ptr [rsp+30h] // this pointer
0000010f mov qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh //~3L, spilled to stack
00000118 mov r9,0FFFFFFFFFFFFFFFDh //~2L
0000011f mov r8,0FFFFFFFFFFFFFFFEh //~1L
00000126 mov rdx,0FFFFFFFFFFFFFFFFh //~0L
0000012d call FFFFFFFFFFEC9310

Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)

var v = new Vector();
p.R(v);
00000169 lea rcx,[rsp+40h]
0000016e mov rax,qword ptr [rcx]
00000171 mov qword ptr [rsp+50h],rax
00000176 mov rax,qword ptr [rcx+8]
0000017a mov qword ptr [rsp+58h],rax
0000017f lea rdx,[rsp+50h]
00000184 mov rcx,r8
00000187 call FFFFFFFFFFEC9318

As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.


All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):

BITS        32
ORG 0x59f0
; void Multiply(ref Matrix, ref Matrix, out Matrix)
start: mov eax, [esp + 4]
movups xmm4, [edx]
movups xmm5, [edx + 0x10]
movups xmm6, [edx + 0x20]
movups xmm7, [edx + 0x30]

movups xmm0, [ecx]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm1, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF

mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1

movups [eax], xmm0 ; Calculate row 0 of new matrix

movups xmm0, [ecx + 0x10]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm0, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF

mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1

movups [eax + 0x10], xmm0 ; Calculate row 1 of new matrix

movups xmm0, [ecx + 0x20]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm0, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF

mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1

movups [eax + 0x20], xmm0 ; Calculate row 2 of new matrix

movups xmm0, [ecx + 0x30]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm0, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF

mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1

movups [eax + 0x30], xmm0 ; Calculate row 3 of new matrix
ret 4
Sign in to follow this  


0 Comments


Recommended Comments

There are no comments to display.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now