SSE question
Hi, can someone please explain to me why is former code faster than later?
(it should be inverse)
extern "C" Vector maskVectorW; //(1,1,1,0)
inline void GetBlinn(Vector& vLight, Vector* vertexVectors, float& cosh)
{
__asm
{
mov edi, vertexVectors
mov esi, vLight
movaps xmm0, [edi+0x10]
movaps xmm1, xmm0
mulps xmm0, xmm0
mulps xmm0, maskVectorW
haddps xmm0, xmm0
haddps xmm0, xmm0
rsqrtps xmm0, xmm0
mulps xmm1, xmm0 //xmm1-normalized tVector
movaps xmm0, [esi]
subps xmm0, xmm1 //xmm0 - vl vector
movaps xmm1, xmm0
mulps xmm1, xmm1
mulps xmm1, maskVectorW
haddps xmm1, xmm1
haddps xmm1, xmm1//vl length
sqrtps xmm1, xmm1
divps xmm0, xmm1 //xmm0 - HalfWay vector
movaps xmm1, [edi+0x30]
mulps xmm1, maskVectorW
mulps xmm0, xmm1
haddps xmm0, xmm0
haddps xmm0, xmm0
mov edi, cosh
movss [edi], xmm0
}
}
//LATER CODE :)
inline void GetBlinn(Vector& vLight, Vector* vertexVectors, float& cosh)
{
__asm
{
mov edi, vertexVectors
mov esi, vLight
movaps xmm3, maskVectorW //put in xmm3 - that's the diff
movaps xmm0, [edi+0x10]
movaps xmm1, xmm0
mulps xmm0, xmm0
mulps xmm0, xmm3
haddps xmm0, xmm0
haddps xmm0, xmm0
rsqrtps xmm0, xmm0
mulps xmm1, xmm0 //xmm1-normalized tVector
movaps xmm0, [esi]
subps xmm0, xmm1 //xmm0 - vl vector
movaps xmm1, xmm0
mulps xmm1, xmm1
mulps xmm1, xmm3
haddps xmm1, xmm1
haddps xmm1, xmm1//vl length
sqrtps xmm1, xmm1
divps xmm0, xmm1 //xmm0 - HalfWay vector
movaps xmm1, [edi+0x30]
mulps xmm1, xmm3
mulps xmm0, xmm1
haddps xmm0, xmm0
haddps xmm0, xmm0
mov edi, cosh
movss [edi], xmm0
}
}
thanks, regards.
How are you timing the two functions? Generally the versions of instructions that take a memory operand have slightly longer latency than the versions that take a register operand but the effect is going to be fairly small for this code and I would guess you'd have to average over a lot of runs and be careful how you benchmark so you're not seeing cache effects to see a reliable timing difference between these two versions.
This kind of optimization is a bit of a last resort. You'd see much better gains in this case from trying to interleave multiple shading calculations to hide latency and you'll probably get better results using intrinsics rather than inline assembly to allow the compiler to do some of the heavy lifting for instruction scheduling and register allocation.
This kind of optimization is a bit of a last resort. You'd see much better gains in this case from trying to interleave multiple shading calculations to hide latency and you'll probably get better results using intrinsics rather than inline assembly to allow the compiler to do some of the heavy lifting for instruction scheduling and register allocation.
Quote:Original post by mattnewport
How are you timing the two functions? Generally the versions of instructions that take a memory operand have slightly longer latency than the versions that take a register operand but the effect is going to be fairly small for this code
...the effect is going to be non-existent for this code.
Quote:Original post by mattnewport
You'd see much better gains in this case from trying to interleave multiple shading calculations to hide latency
I think that this is likely to be correct..
Quote:Original post by mattnewport
and you'll probably get better results using intrinsics rather than inline assembly to allow the compiler to do some of the heavy lifting for instruction scheduling and register allocation.
I'm sorry. Did you even look at the code? Intrinsics wont help at all because there is no heavy lifting to be done. Its damn near one big dependency chain.
I responded to his post in another thread (he appears to have cross-posted the same question to two different catagories) and in that I point out that the division instruction dominates the runtime. In short, the two instructions following the division are guaranteed to be completely free. Infact, he could stuff a lot more non-dependent work in there for free if he has said work available.
Quote:Original post by Rockoon1
I'm sorry. Did you even look at the code? Intrinsics wont help at all because there is no heavy lifting to be done. Its damn near one big dependency chain.
I'm assuming this is a frequently called function inside an inner loop somewhere (looks like part of a lighting calculation). If that's the case then using intrinsics the compiler is more likely to be able to inline the code effectively and hide latency by interleaving the dependent instructions with other work at the call site or by unrolling the loop. A block of inline assembly like this is in a small function is frequently counter-productive because it basically prevents the compiler from doing any instruction re-ordering or scheduling at the call site to hide latency.
Even if this function is just being called in a tight loop over an array of Vectors, using intrinsics will still probably be faster because there's a decent chance the compiler will unroll the loop. Unrolling the loop with the inline assembly version won't be much help because the compiler won't be able to interleave work from multiple loop iterations.
In regards to intermixing with calling code:
Nearly every instruction here has 3 cycle latency or worse. Likely the return to caller will be executed ~9 cycles PRIOR to the completion of the dependency chain on an AMD64. Things look even more futile on a Core2.
In short, CPU's have been designed to handle precisely this sort of situation as well as can be expected. Out of order execution and register renaming are precisely for finding more work to do during dependency chains.
The CPU doesnt need a compilers help on this situation and compilers are now designed with that in mind.
It is unlikely that a compiler would voluntarilly inline this function if recoded with intrinsics because inlining it is likely to HURT performance.
In regards to unrolling:
This requires a staggered loop unroll to approach optimal (structured such that the CPU always has a division in progress) and compilers still don't do that. Its true that compilers are pretty good these days.. but don't trust them blindly.
--
He however has not asked how to make the code faster. He instead wondered why the second code was slower than the first, and its because its basically one long dependency chain.
Nearly every instruction here has 3 cycle latency or worse. Likely the return to caller will be executed ~9 cycles PRIOR to the completion of the dependency chain on an AMD64. Things look even more futile on a Core2.
In short, CPU's have been designed to handle precisely this sort of situation as well as can be expected. Out of order execution and register renaming are precisely for finding more work to do during dependency chains.
The CPU doesnt need a compilers help on this situation and compilers are now designed with that in mind.
It is unlikely that a compiler would voluntarilly inline this function if recoded with intrinsics because inlining it is likely to HURT performance.
In regards to unrolling:
This requires a staggered loop unroll to approach optimal (structured such that the CPU always has a division in progress) and compilers still don't do that. Its true that compilers are pretty good these days.. but don't trust them blindly.
--
He however has not asked how to make the code faster. He instead wondered why the second code was slower than the first, and its because its basically one long dependency chain.
Here's an example of what I mean. Say you're actually doing diffuse and specular lighting at the same time in your loop (seems quite likely). Given this code:
Where GetBlinn() is the first inline assembly version in the OP and GetBlinnIntrin() is my version using intrinsics (with a couple of extra optimizations), and these test functions which calculate both specular and diffuse:
VC 2008 generates this for the inline assembly version:
And this for the intrinsics version:
Clearly the code generated for the version with intrinsics is better - the two calculations are interleaved which should help hide some latency and the compiler has also pulled constants like the light direction and the mask out of the loop and kept them in a register. Now admittedly a modern x86 processor may be able to get some of the benefits of the interleaving thanks to the out of order execution engine but I'd rather feed the processor decently scheduled code to start with rather than trust entirely to it's re-ordering abilities.
In a real world situation you would likely have more work to do in the loop and the more work you do the more room you're giving the compiler to reschedule things to hide latency. It can only do that if you're using intrinsics though - it won't reorder anything inside an inline assembly block.
[Edited by - mattnewport on November 23, 2007 5:55:12 AM]
struct Vertex{ __m128 pad0; __m128 viewVec; __m128 pad1; __m128 normal;};__m128 makeAndMaskVectorW(){ __m128 mask; mask.m128_i32[0] = -1; mask.m128_i32[1] = -1; mask.m128_i32[2] = -1; mask.m128_i32[3] = 0; return mask;}const __m128 zeroVec = { 0.f, 0.f, 0.f, 0.f };const __m128 maskVectorW = { 1.f, 1.f, 1.f, 0 };const __m128 andMaskVectorW = makeAndMaskVectorW();inline void GetBlinn(const __m128& vLight, const __m128* vertexVectors, float& cosh){ __asm { mov edi, vertexVectors mov esi, vLight movaps xmm0, [edi+0x10] movaps xmm1, xmm0 mulps xmm0, xmm0 mulps xmm0, maskVectorW haddps xmm0, xmm0 haddps xmm0, xmm0 rsqrtps xmm0, xmm0 mulps xmm1, xmm0 //xmm1-normalized tVector movaps xmm0, [esi] subps xmm0, xmm1 //xmm0 - vl vector movaps xmm1, xmm0 mulps xmm1, xmm1 mulps xmm1, maskVectorW haddps xmm1, xmm1 haddps xmm1, xmm1//vl length sqrtps xmm1, xmm1 divps xmm0, xmm1 //xmm0 - HalfWay vector movaps xmm1, [edi+0x30] mulps xmm1, maskVectorW mulps xmm0, xmm1 haddps xmm0, xmm0 haddps xmm0, xmm0 mov edi, cosh movss [edi], xmm0 }}inline __m128 dot(const __m128& a, const __m128& b){ __m128 temp = _mm_mul_ps(a, b); temp = _mm_hadd_ps(temp, temp); return _mm_hadd_ps(temp, temp);}inline __m128 normalize(const __m128& v){ return _mm_mul_ps(_mm_rsqrt_ps(dot(v, v)), v);}inline float GetBlinnIntrin(const __m128 l, const Vertex* __restrict vertexVectors){ const __m128 viewVec = _mm_and_ps(vertexVectors->viewVec, andMaskVectorW); const __m128 v = normalize(viewVec); const __m128 h = normalize(_mm_sub_ps(l, v)); const __m128 n = _mm_and_ps(vertexVectors->normal, andMaskVectorW); const __m128 nDotH = dot(n, h); float res; _mm_store_ss(&res, nDotH); return res;}inline float GetDiffuse(const __m128 l, const Vertex* __restrict vertexVectors){ __m128 nDotL = dot(vertexVectors->normal, l); float res; _mm_store_ss(&res, nDotL); return res;}
Where GetBlinn() is the first inline assembly version in the OP and GetBlinnIntrin() is my version using intrinsics (with a couple of extra optimizations), and these test functions which calculate both specular and diffuse:
__declspec(noinline) void testGetBlinn(const __m128& lightDir, const Vertex* __restrict verts, float* __restrict results, const size_t numVerts){ for (size_t i = 0; i < numVerts; ++i) { GetBlinn(lightDir, &verts.pad0, results); results += GetDiffuse(lightDir, &verts); }}__declspec(noinline) void testGetBlinnIntrin(const __m128& lightDir, const Vertex* __restrict verts, float* __restrict results, const size_t numVerts){ for (size_t i = 0; i < numVerts; ++i) { results = GetBlinnIntrin(lightDir, &verts) + GetDiffuse(lightDir, &verts); }}
VC 2008 generates this for the inline assembly version:
__declspec(noinline) void testGetBlinn(const __m128& lightDir, const Vertex* __restrict verts, float* __restrict results, const size_t numVerts){00401C80 push ebp 00401C81 mov ebp,esp 00401C83 and esp,0FFFFFFF0h 00401C86 sub esp,14h 00401C89 push ebx 00401C8A mov ebx,dword ptr [ebp+8] 00401C8D push esi 00401C8E push edi for (size_t i = 0; i < numVerts; ++i)00401C8F mov edx,400h { GetBlinn(lightDir, &verts.pad0, results);00401C94 mov dword ptr [esp+18h],eax 00401C98 mov dword ptr [esp+1Ch],ecx 00401C9C mov edi,dword ptr [esp+1Ch] 00401CA0 mov esi,dword ptr [lightDir] 00401CA3 movaps xmm0,xmmword ptr [edi+10h] 00401CA7 movaps xmm1,xmm0 00401CAA mulps xmm0,xmm0 00401CAD mulps xmm0,xmmword ptr [___xi_z+54h (403160h)] 00401CB4 haddps xmm0,xmm0 00401CB8 haddps xmm0,xmm0 00401CBC rsqrtps xmm0,xmm0 00401CBF mulps xmm1,xmm0 00401CC2 movaps xmm0,xmmword ptr [esi] 00401CC5 subps xmm0,xmm1 00401CC8 movaps xmm1,xmm0 00401CCB mulps xmm1,xmm1 00401CCE mulps xmm1,xmmword ptr [___xi_z+54h (403160h)] 00401CD5 haddps xmm1,xmm1 00401CD9 haddps xmm1,xmm1 00401CDD sqrtps xmm1,xmm1 00401CE0 divps xmm0,xmm1 00401CE3 movaps xmm1,xmmword ptr [edi+30h] 00401CE7 mulps xmm1,xmmword ptr [___xi_z+54h (403160h)] 00401CEE mulps xmm0,xmm1 00401CF1 haddps xmm0,xmm0 00401CF5 haddps xmm0,xmm0 00401CF9 mov edi,dword ptr [esp+18h] 00401CFD movss dword ptr [edi],xmm0 results += GetDiffuse(lightDir, &verts);00401D01 movaps xmm0,xmmword ptr [ecx+30h] 00401D05 movaps xmm1,xmmword ptr [ebx] 00401D08 mulps xmm0,xmm1 00401D0B movss xmm1,dword ptr [eax] 00401D0F haddps xmm0,xmm0 00401D13 haddps xmm0,xmm0 00401D17 addss xmm1,xmm0 00401D1B movss dword ptr [eax],xmm1 00401D1F add ecx,40h 00401D22 add eax,4 00401D25 sub edx,1 00401D28 jne testGetBlinn+14h (401C94h) }}00401D2E pop edi 00401D2F pop esi 00401D30 pop ebx 00401D31 mov esp,ebp 00401D33 pop ebp 00401D34 ret
And this for the intrinsics version:
__declspec(noinline) void testGetBlinnIntrin(const __m128& lightDir, const Vertex* __restrict verts, float* __restrict results, const size_t numVerts){00401D90 push ebp 00401D91 mov ebp,esp 00401D93 and esp,0FFFFFFF0h 00401D96 movaps xmm3,xmmword ptr [ecx] for (size_t i = 0; i < numVerts; ++i)00401D99 mov ecx,dword ptr [verts] 00401D9C movaps xmm4,xmmword ptr [andMaskVectorW (404390h)] 00401DA3 xor eax,eax 00401DA5 add ecx,30h 00401DA8 jmp testGetBlinnIntrin+20h (401DB0h) 00401DAA lea ebx,[ebx] { results = GetBlinnIntrin(lightDir, &verts) + GetDiffuse(lightDir, &verts);00401DB0 movaps xmm1,xmmword ptr [ecx-20h] 00401DB4 movaps xmm2,xmmword ptr [ecx] 00401DB7 andps xmm1,xmm4 00401DBA movaps xmm0,xmm1 00401DBD mulps xmm0,xmm1 00401DC0 haddps xmm0,xmm0 00401DC4 haddps xmm0,xmm0 00401DC8 rsqrtps xmm0,xmm0 00401DCB mulps xmm0,xmm1 00401DCE movaps xmm1,xmm3 00401DD1 subps xmm1,xmm0 00401DD4 movaps xmm0,xmm1 00401DD7 mulps xmm0,xmm1 00401DDA haddps xmm0,xmm0 00401DDE haddps xmm0,xmm0 00401DE2 rsqrtps xmm0,xmm0 00401DE5 mulps xmm0,xmm1 00401DE8 movaps xmm1,xmm2 00401DEB mulps xmm2,xmm3 00401DEE andps xmm1,xmm4 00401DF1 mulps xmm0,xmm1 00401DF4 haddps xmm2,xmm2 00401DF8 haddps xmm0,xmm0 00401DFC haddps xmm0,xmm0 00401E00 haddps xmm2,xmm2 00401E04 addss xmm2,xmm0 00401E08 movss dword ptr [edx+eax*4],xmm2 00401E0D inc eax 00401E0E add ecx,40h 00401E11 cmp eax,400h 00401E16 jb testGetBlinnIntrin+20h (401DB0h) }}00401E18 mov esp,ebp 00401E1A pop ebp 00401E1B ret
Clearly the code generated for the version with intrinsics is better - the two calculations are interleaved which should help hide some latency and the compiler has also pulled constants like the light direction and the mask out of the loop and kept them in a register. Now admittedly a modern x86 processor may be able to get some of the benefits of the interleaving thanks to the out of order execution engine but I'd rather feed the processor decently scheduled code to start with rather than trust entirely to it's re-ordering abilities.
In a real world situation you would likely have more work to do in the loop and the more work you do the more room you're giving the compiler to reschedule things to hide latency. It can only do that if you're using intrinsics though - it won't reorder anything inside an inline assembly block.
[Edited by - mattnewport on November 23, 2007 5:55:12 AM]
I suggest comparing apples with apples. Maybe its just me but it seems like you are stroking yourself when you decided to remove the division and deliberately replaced it with another low precision estimate, and then came here to brag about how great the compiler generated code is.
Given the function appears to be intended to calculate the specular term in a lighting equation some slightly reduced precision seemed a reasonable trade off. On the random test data I used the results never differed by more than 5e-4 from the version using the divide which would normally be adequate precision for a lighting calculation.
If the OP has a need for the extra precision then the code could be easily changed to match the original calculations exactly. It's not relevant to the point I was demonstrating which is that using intrinsics means the compiler is more likely to be able to schedule the code to hide latency.
If you want to have a useful discussion about the quality of compiler optimizations then I'm interested to have one. If you're just going to be a dick then don't bother posting.
[Edited by - mattnewport on November 23, 2007 4:57:07 PM]
If the OP has a need for the extra precision then the code could be easily changed to match the original calculations exactly. It's not relevant to the point I was demonstrating which is that using intrinsics means the compiler is more likely to be able to schedule the code to hide latency.
If you want to have a useful discussion about the quality of compiler optimizations then I'm interested to have one. If you're just going to be a dick then don't bother posting.
[Edited by - mattnewport on November 23, 2007 4:57:07 PM]
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement