Quote:Original post by apatriarca
I think the code of the loop isn't generated at all in Release mode when using SSE. That loop do in fact nothing. This is a common optimization in modern compilers. In the other case the compiler generate the loop because the function isn't inlined. You have to read the assembly generated by the compiler.
Ah! Good observation!
Indeed, when replacing the loops with something a bit less determined, eg:
for(DWORD i = 0; i < (1 << 10); i++) IVector v3 = v * (rand()%(i+1));
The results become:
72000 (SSE) vs 103000 (non-SSE) cycles in release mode. I'm still not impressed: the non-SSE version is double-precision, it's not inlined and it's not accelerated, yet ir performs at far less than twice the speed. Bleh. Granted, though - now, most of the time is spent inside rand().
EDIT:
Okay, now I'm confused: I removed the rand() from the loop by changing the test code to this:
__int64 p1, p2; float rn_f[10]; double rn_d[10]; for(int i = 0; i < 10; i++) { rn_f = rand(); rn_d = rand(); } IVector v3; DWORD t1 = timeGetTime(); p1 = __rdtsc(); for(DWORD i = 0; i < (1 << 20); i++) v3 = v * (rn_f[i%10]); p2 = __rdtsc(); DWORD t2 = timeGetTime(); lout << "PROFILE1: " << (DWORD)(p2 - p1) << " cycles (" << (t2 - t1) << " ms)" << endl; TVector3D rv3; t1 = timeGetTime(); p1 =__rdtsc(); for(DWORD i = 0; i < (1 << 20); i++) rv3 = rv1 * (rn_d[i%10]); p2 = __rdtsc(); t2 = timeGetTime(); lout << "PROFILE2: " << (DWORD)(p2 - p1) << " cycles (" << (t2 - t1) << " ms)" << endl;
Results in release mode with optimizations disabled (1 << 20 iterations):
SSE = ~120M cycles (47ms)
non-SSE = ~60M cycles (23ms)
Once more, SSE fails miserably. Is it because disabling optimizations also disables instruction set enhancements? Because the disassembly suggests it doesn't:
return IVector(_mm_mul_ps(_sse, *(__m128*)value));000E1396 movaps xmm0,xmmword ptr [ebp-20h] 000E139A mov ecx,dword ptr [ebp-8] 000E139D movaps xmm1,xmmword ptr [ecx+10h] 000E13A1 mulps xmm1,xmm0 000E13A4 movaps xmmword ptr [ebp-30h],xmm1 000E13A8 movaps xmm0,xmmword ptr [ebp-30h] 000E13AC mov ecx,dword ptr [ebx+8] 000E13AF call IVector::IVector (0E1180h) 000E13B4 mov eax,dword ptr [ebx+8]
I'm using VS2008 on Windows 7.
[Edited by - irreversible on April 16, 2010 8:16:35 AM]