SSE question

Started by
7 comments, last by mattnewport 16 years, 5 months ago
Hi, can someone please explain to me why is former code faster than later? (it should be inverse) extern "C" Vector maskVectorW; //(1,1,1,0) inline void GetBlinn(Vector& vLight, Vector* vertexVectors, float& cosh) { __asm { mov edi, vertexVectors mov esi, vLight movaps xmm0, [edi+0x10] movaps xmm1, xmm0 mulps xmm0, xmm0 mulps xmm0, maskVectorW haddps xmm0, xmm0 haddps xmm0, xmm0 rsqrtps xmm0, xmm0 mulps xmm1, xmm0 //xmm1-normalized tVector movaps xmm0, [esi] subps xmm0, xmm1 //xmm0 - vl vector movaps xmm1, xmm0 mulps xmm1, xmm1 mulps xmm1, maskVectorW haddps xmm1, xmm1 haddps xmm1, xmm1//vl length sqrtps xmm1, xmm1 divps xmm0, xmm1 //xmm0 - HalfWay vector movaps xmm1, [edi+0x30] mulps xmm1, maskVectorW mulps xmm0, xmm1 haddps xmm0, xmm0 haddps xmm0, xmm0 mov edi, cosh movss [edi], xmm0 } } //LATER CODE :) inline void GetBlinn(Vector& vLight, Vector* vertexVectors, float& cosh) { __asm { mov edi, vertexVectors mov esi, vLight movaps xmm3, maskVectorW //put in xmm3 - that's the diff movaps xmm0, [edi+0x10] movaps xmm1, xmm0 mulps xmm0, xmm0 mulps xmm0, xmm3 haddps xmm0, xmm0 haddps xmm0, xmm0 rsqrtps xmm0, xmm0 mulps xmm1, xmm0 //xmm1-normalized tVector movaps xmm0, [esi] subps xmm0, xmm1 //xmm0 - vl vector movaps xmm1, xmm0 mulps xmm1, xmm1 mulps xmm1, xmm3 haddps xmm1, xmm1 haddps xmm1, xmm1//vl length sqrtps xmm1, xmm1 divps xmm0, xmm1 //xmm0 - HalfWay vector movaps xmm1, [edi+0x30] mulps xmm1, xmm3 mulps xmm0, xmm1 haddps xmm0, xmm0 haddps xmm0, xmm0 mov edi, cosh movss [edi], xmm0 } } thanks, regards.
Advertisement
How are you timing the two functions? Generally the versions of instructions that take a memory operand have slightly longer latency than the versions that take a register operand but the effect is going to be fairly small for this code and I would guess you'd have to average over a lot of runs and be careful how you benchmark so you're not seeing cache effects to see a reliable timing difference between these two versions.

This kind of optimization is a bit of a last resort. You'd see much better gains in this case from trying to interleave multiple shading calculations to hide latency and you'll probably get better results using intrinsics rather than inline assembly to allow the compiler to do some of the heavy lifting for instruction scheduling and register allocation.

Game Programming Blog: www.mattnewport.com/blog

Quote:Original post by mattnewport
How are you timing the two functions? Generally the versions of instructions that take a memory operand have slightly longer latency than the versions that take a register operand but the effect is going to be fairly small for this code


...the effect is going to be non-existent for this code.

Quote:Original post by mattnewport
You'd see much better gains in this case from trying to interleave multiple shading calculations to hide latency


I think that this is likely to be correct..

Quote:Original post by mattnewport
and you'll probably get better results using intrinsics rather than inline assembly to allow the compiler to do some of the heavy lifting for instruction scheduling and register allocation.


I'm sorry. Did you even look at the code? Intrinsics wont help at all because there is no heavy lifting to be done. Its damn near one big dependency chain.

I responded to his post in another thread (he appears to have cross-posted the same question to two different catagories) and in that I point out that the division instruction dominates the runtime. In short, the two instructions following the division are guaranteed to be completely free. Infact, he could stuff a lot more non-dependent work in there for free if he has said work available.
Quote:Original post by Rockoon1
I'm sorry. Did you even look at the code? Intrinsics wont help at all because there is no heavy lifting to be done. Its damn near one big dependency chain.

I'm assuming this is a frequently called function inside an inner loop somewhere (looks like part of a lighting calculation). If that's the case then using intrinsics the compiler is more likely to be able to inline the code effectively and hide latency by interleaving the dependent instructions with other work at the call site or by unrolling the loop. A block of inline assembly like this is in a small function is frequently counter-productive because it basically prevents the compiler from doing any instruction re-ordering or scheduling at the call site to hide latency.

Even if this function is just being called in a tight loop over an array of Vectors, using intrinsics will still probably be faster because there's a decent chance the compiler will unroll the loop. Unrolling the loop with the inline assembly version won't be much help because the compiler won't be able to interleave work from multiple loop iterations.

Game Programming Blog: www.mattnewport.com/blog

In regards to intermixing with calling code:

Nearly every instruction here has 3 cycle latency or worse. Likely the return to caller will be executed ~9 cycles PRIOR to the completion of the dependency chain on an AMD64. Things look even more futile on a Core2.

In short, CPU's have been designed to handle precisely this sort of situation as well as can be expected. Out of order execution and register renaming are precisely for finding more work to do during dependency chains.

The CPU doesnt need a compilers help on this situation and compilers are now designed with that in mind.

It is unlikely that a compiler would voluntarilly inline this function if recoded with intrinsics because inlining it is likely to HURT performance.

In regards to unrolling:

This requires a staggered loop unroll to approach optimal (structured such that the CPU always has a division in progress) and compilers still don't do that. Its true that compilers are pretty good these days.. but don't trust them blindly.

--

He however has not asked how to make the code faster. He instead wondered why the second code was slower than the first, and its because its basically one long dependency chain.
Here's an example of what I mean. Say you're actually doing diffuse and specular lighting at the same time in your loop (seems quite likely). Given this code:

struct Vertex{    __m128 pad0;    __m128 viewVec;    __m128 pad1;    __m128 normal;};__m128 makeAndMaskVectorW(){    __m128 mask;    mask.m128_i32[0] = -1;    mask.m128_i32[1] = -1;    mask.m128_i32[2] = -1;    mask.m128_i32[3] = 0;    return mask;}const __m128 zeroVec = { 0.f, 0.f, 0.f, 0.f };const __m128 maskVectorW = { 1.f, 1.f, 1.f, 0 };const __m128 andMaskVectorW = makeAndMaskVectorW();inline void GetBlinn(const __m128& vLight, const __m128* vertexVectors, float& cosh){    __asm    {        mov edi, vertexVectors        mov esi, vLight        movaps xmm0, [edi+0x10]        movaps xmm1, xmm0        mulps xmm0, xmm0        mulps xmm0, maskVectorW        haddps xmm0, xmm0        haddps xmm0, xmm0        rsqrtps xmm0, xmm0        mulps xmm1, xmm0 //xmm1-normalized tVector        movaps xmm0, [esi]        subps xmm0, xmm1 //xmm0 - vl vector        movaps xmm1, xmm0        mulps xmm1, xmm1        mulps xmm1, maskVectorW        haddps xmm1, xmm1        haddps xmm1, xmm1//vl length        sqrtps xmm1, xmm1        divps xmm0, xmm1 //xmm0 - HalfWay vector        movaps xmm1, [edi+0x30]        mulps xmm1, maskVectorW        mulps xmm0, xmm1        haddps xmm0, xmm0        haddps xmm0, xmm0        mov edi, cosh        movss [edi], xmm0    }}inline __m128 dot(const __m128& a, const __m128& b){    __m128 temp = _mm_mul_ps(a, b);    temp = _mm_hadd_ps(temp, temp);    return _mm_hadd_ps(temp, temp);}inline __m128 normalize(const __m128& v){    return _mm_mul_ps(_mm_rsqrt_ps(dot(v, v)), v);}inline float GetBlinnIntrin(const __m128 l, const Vertex* __restrict vertexVectors){    const __m128 viewVec = _mm_and_ps(vertexVectors->viewVec, andMaskVectorW);    const __m128 v = normalize(viewVec);    const __m128 h = normalize(_mm_sub_ps(l, v));    const __m128 n = _mm_and_ps(vertexVectors->normal, andMaskVectorW);    const __m128 nDotH = dot(n, h);    float res;    _mm_store_ss(&res, nDotH);    return res;}inline float GetDiffuse(const __m128 l, const Vertex* __restrict vertexVectors){    __m128 nDotL = dot(vertexVectors->normal, l);    float res;    _mm_store_ss(&res, nDotL);    return res;}


Where GetBlinn() is the first inline assembly version in the OP and GetBlinnIntrin() is my version using intrinsics (with a couple of extra optimizations), and these test functions which calculate both specular and diffuse:

__declspec(noinline) void testGetBlinn(const __m128& lightDir, const Vertex* __restrict verts,                                        float* __restrict results, const size_t numVerts){    for (size_t i = 0; i < numVerts; ++i)    {        GetBlinn(lightDir, &verts.pad0, results);        results += GetDiffuse(lightDir, &verts);    }}__declspec(noinline) void testGetBlinnIntrin(const __m128& lightDir, const Vertex* __restrict verts,                                              float* __restrict results, const size_t numVerts){    for (size_t i = 0; i < numVerts; ++i)    {        results = GetBlinnIntrin(lightDir, &verts) + GetDiffuse(lightDir, &verts);    }}


VC 2008 generates this for the inline assembly version:
__declspec(noinline) void testGetBlinn(const __m128& lightDir, const Vertex* __restrict verts,                                        float* __restrict results, const size_t numVerts){00401C80  push        ebp  00401C81  mov         ebp,esp 00401C83  and         esp,0FFFFFFF0h 00401C86  sub         esp,14h 00401C89  push        ebx  00401C8A  mov         ebx,dword ptr [ebp+8] 00401C8D  push        esi  00401C8E  push        edi      for (size_t i = 0; i < numVerts; ++i)00401C8F  mov         edx,400h     {        GetBlinn(lightDir, &verts.pad0, results);00401C94  mov         dword ptr [esp+18h],eax 00401C98  mov         dword ptr [esp+1Ch],ecx 00401C9C  mov         edi,dword ptr [esp+1Ch] 00401CA0  mov         esi,dword ptr [lightDir] 00401CA3  movaps      xmm0,xmmword ptr [edi+10h] 00401CA7  movaps      xmm1,xmm0 00401CAA  mulps       xmm0,xmm0 00401CAD  mulps       xmm0,xmmword ptr [___xi_z+54h (403160h)] 00401CB4  haddps      xmm0,xmm0 00401CB8  haddps      xmm0,xmm0 00401CBC  rsqrtps     xmm0,xmm0 00401CBF  mulps       xmm1,xmm0 00401CC2  movaps      xmm0,xmmword ptr [esi] 00401CC5  subps       xmm0,xmm1 00401CC8  movaps      xmm1,xmm0 00401CCB  mulps       xmm1,xmm1 00401CCE  mulps       xmm1,xmmword ptr [___xi_z+54h (403160h)] 00401CD5  haddps      xmm1,xmm1 00401CD9  haddps      xmm1,xmm1 00401CDD  sqrtps      xmm1,xmm1 00401CE0  divps       xmm0,xmm1 00401CE3  movaps      xmm1,xmmword ptr [edi+30h] 00401CE7  mulps       xmm1,xmmword ptr [___xi_z+54h (403160h)] 00401CEE  mulps       xmm0,xmm1 00401CF1  haddps      xmm0,xmm0 00401CF5  haddps      xmm0,xmm0 00401CF9  mov         edi,dword ptr [esp+18h] 00401CFD  movss       dword ptr [edi],xmm0         results += GetDiffuse(lightDir, &verts);00401D01  movaps      xmm0,xmmword ptr [ecx+30h] 00401D05  movaps      xmm1,xmmword ptr [ebx] 00401D08  mulps       xmm0,xmm1 00401D0B  movss       xmm1,dword ptr [eax] 00401D0F  haddps      xmm0,xmm0 00401D13  haddps      xmm0,xmm0 00401D17  addss       xmm1,xmm0 00401D1B  movss       dword ptr [eax],xmm1 00401D1F  add         ecx,40h 00401D22  add         eax,4 00401D25  sub         edx,1 00401D28  jne         testGetBlinn+14h (401C94h)     }}00401D2E  pop         edi  00401D2F  pop         esi  00401D30  pop         ebx  00401D31  mov         esp,ebp 00401D33  pop         ebp  00401D34  ret              


And this for the intrinsics version:

__declspec(noinline) void testGetBlinnIntrin(const __m128& lightDir, const Vertex* __restrict verts,                                              float* __restrict results, const size_t numVerts){00401D90  push        ebp  00401D91  mov         ebp,esp 00401D93  and         esp,0FFFFFFF0h 00401D96  movaps      xmm3,xmmword ptr [ecx]     for (size_t i = 0; i < numVerts; ++i)00401D99  mov         ecx,dword ptr [verts] 00401D9C  movaps      xmm4,xmmword ptr [andMaskVectorW (404390h)] 00401DA3  xor         eax,eax 00401DA5  add         ecx,30h 00401DA8  jmp         testGetBlinnIntrin+20h (401DB0h) 00401DAA  lea         ebx,[ebx]     {        results = GetBlinnIntrin(lightDir, &verts) + GetDiffuse(lightDir, &verts);00401DB0  movaps      xmm1,xmmword ptr [ecx-20h] 00401DB4  movaps      xmm2,xmmword ptr [ecx] 00401DB7  andps       xmm1,xmm4 00401DBA  movaps      xmm0,xmm1 00401DBD  mulps       xmm0,xmm1 00401DC0  haddps      xmm0,xmm0 00401DC4  haddps      xmm0,xmm0 00401DC8  rsqrtps     xmm0,xmm0 00401DCB  mulps       xmm0,xmm1 00401DCE  movaps      xmm1,xmm3 00401DD1  subps       xmm1,xmm0 00401DD4  movaps      xmm0,xmm1 00401DD7  mulps       xmm0,xmm1 00401DDA  haddps      xmm0,xmm0 00401DDE  haddps      xmm0,xmm0 00401DE2  rsqrtps     xmm0,xmm0 00401DE5  mulps       xmm0,xmm1 00401DE8  movaps      xmm1,xmm2 00401DEB  mulps       xmm2,xmm3 00401DEE  andps       xmm1,xmm4 00401DF1  mulps       xmm0,xmm1 00401DF4  haddps      xmm2,xmm2 00401DF8  haddps      xmm0,xmm0 00401DFC  haddps      xmm0,xmm0 00401E00  haddps      xmm2,xmm2 00401E04  addss       xmm2,xmm0 00401E08  movss       dword ptr [edx+eax*4],xmm2 00401E0D  inc         eax  00401E0E  add         ecx,40h 00401E11  cmp         eax,400h 00401E16  jb          testGetBlinnIntrin+20h (401DB0h)     }}00401E18  mov         esp,ebp 00401E1A  pop         ebp  00401E1B  ret              

Clearly the code generated for the version with intrinsics is better - the two calculations are interleaved which should help hide some latency and the compiler has also pulled constants like the light direction and the mask out of the loop and kept them in a register. Now admittedly a modern x86 processor may be able to get some of the benefits of the interleaving thanks to the out of order execution engine but I'd rather feed the processor decently scheduled code to start with rather than trust entirely to it's re-ordering abilities.

In a real world situation you would likely have more work to do in the loop and the more work you do the more room you're giving the compiler to reschedule things to hide latency. It can only do that if you're using intrinsics though - it won't reorder anything inside an inline assembly block.

[Edited by - mattnewport on November 23, 2007 5:55:12 AM]

Game Programming Blog: www.mattnewport.com/blog

Thanks mattnewport. Very nice.

Regards.

I suggest comparing apples with apples. Maybe its just me but it seems like you are stroking yourself when you decided to remove the division and deliberately replaced it with another low precision estimate, and then came here to brag about how great the compiler generated code is.


Given the function appears to be intended to calculate the specular term in a lighting equation some slightly reduced precision seemed a reasonable trade off. On the random test data I used the results never differed by more than 5e-4 from the version using the divide which would normally be adequate precision for a lighting calculation.

If the OP has a need for the extra precision then the code could be easily changed to match the original calculations exactly. It's not relevant to the point I was demonstrating which is that using intrinsics means the compiler is more likely to be able to schedule the code to hide latency.

If you want to have a useful discussion about the quality of compiler optimizations then I'm interested to have one. If you're just going to be a dick then don't bother posting.

[Edited by - mattnewport on November 23, 2007 4:57:07 PM]

Game Programming Blog: www.mattnewport.com/blog

This topic is closed to new replies.

Advertisement