SSE/3dNow! vs. DX8

Started by
14 comments, last by malyskolacek 22 years, 5 months ago
Hi all! I have simple question about performance. What will be faster: if I use for example D3DXMatrixMultiply() from DX8 SDK or if I use function doing the same provided by AMD or Intel with downloadable Processor Pack for MSVC? I don''t know whether DX functions use any SSE/3dNow! instructions. Please help->thanks very much
Advertisement
DX8.1 D3DX tools are supposed to be processor optimized, but DX8 D3DX tools are not.


Edited by - meZmo on November 15, 2001 2:26:58 PM
I just finished a Intel SSE pipeline, so this issue is of some interest to me.

DX8.0 D3DXMatrixMultiply() was really slow FPU code
DX8.1 D3DXMatrixMultiply() has been improved greatly, but still uses the FPU.

I just looked at DX8.1 D3DXMatrixMultiply(), it seems like it was optimized for a hot cache, kind of dumb because we don't use matrix muls that way, its fast enough to use for just playing around though. When you're finished your game and have some extra time, have a look at SSE & 3DNOW.

SSE is always going to be faster than the FPU, because its really 4 FPU's in one.



Edited by - Abstract Thought on November 15, 2001 2:38:17 PM

Edited by - Abstract Thought on November 16, 2001 7:11:45 AM
Only by art can we get outside ourselves; instead of seeing only one world, our own, we see it under multiple forms.
Thanks for clearing that up for me, AT!
Abstract Thought:

how did you determine that D3DX in 8.1 doesn''t use SSE/SSE2 ?...
Was it by static dissasembly? Or runtime?

Take a look at the symbols in d3dx8.lib and you''ll find some with interesting names:

x86_D3DXMatrixMultiply()
sse_D3DXMatrixMultiply()
sse2_D3DXMatrixMultiply()
x3d_D3DXMatrixMultiply()


Also look in the docs under "What''s New in DirectX Graphics", specifically "Math Library".

One of the bullet points says: "Math library. Added CPU specific optimizations for most important functions for 3DNow, SSE, and SSE2."



AFAIK, the D3DX 8.1 maths functions are statically linked to use the x86 version (scalar FPU), thus with static dissasembly this is what you''ll see.

The FIRST time you use a function from the maths library, D3DX detects the CPU features and overwrites the virtual function table of the relevent class to point at the CPU specific version. Doing this avoids any overhead of repeated flag or CPUID checks. You should see this with dynamic dissasembly.

(In fact one of the early beta versions appeared not do the detection properly and always picked SSE which was fun for me testing on an AMD CPU ;-)


--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

Just confirmed it in the debugger (name decoration stripped from labels):


D3DXMatrixMultiply( &out, &in1, &in2 );


Step into dissasembly:

1) Client call to D3DXMatrixMultiply()
  ; push matrices etc onto the stack and call D3DXmov         eax,dword ptr [ebp-4]add         eax,5Chpush        eaxmov         ecx,dword ptr [ebp-4]add         ecx,1Chpush        ecxlea         edx,[ebp-68h]push        edxcall        _D3DXMatrixMultiply@12 (1000e259)  



2) Use jump table to jump to CPU specific version
  _D3DXMatrixMultiply:jmp         dword ptr [g_D3DXFastTable+0Ch (1009bc84)]  



3) Jumps to 3DNow! specific version (this machine has AMD CPU)
  x3d_D3DXMatrixMultiply:femmssub         esp,44hmov         edx,dword ptr [esp+50h]mov         dword ptr [esp+40h],ebpmov         ebp,espand         esp,0FFFFFFF8h; SIMD register moves, not very FPU ;)movq        mm0,mmword ptr [edx]movq        mm1,mmword ptr [edx+10h]movq        mm3,mmword ptr [edx+28h]movq        mm4,mmword ptr [edx+38h]movq        mm2,mm0; HMM looks like 3DNow! to me...punpckldq   mm0,mm1punpckhdq   mm2,mm1[snip]; yep, it''s 3DNow!...pfmul       mm5,mmword ptr [esp+28h]pfmul       mm6,mmword ptr [esp+30h]pfmul       mm7,mmword ptr [esp+38h]pfacc       mm0,mm2pfacc       mm1,mm3pfacc       mm4,mm6  


So I was wrong about them modifying the VTABLE (I''m sure they used to), but they do use a jump table and do jump to CPU specific code.

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

re: Beta: How did you determine that D3DX in 8.1 doesn''t use SSE/SSE2 ?... Was it by static dissasembly? Or runtime?

rumtime, I mean... runtime

re: Beta: You should see this with dynamic dissasembly.

I''ll look into that...

Grrrrrr... I guess I wasted all those lonely late nights faithfully hacking away for nothing.

Cry. If I only could have waited for DirectX 8.1 to come along, to give me the matrix mul of my dreams.

Sniffle, but alas, my quest wasn''t in vane, for D3DXMatrixMultiply in dynamic disassembly, still bites the big one.

ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha

Dude, its only 1 cycles faster than their FPU version.

ha ha ha ha ha ha ha ha

Intel''s server has one that''s 20 cycles faster than that.

ha ha ha ha

and I...

he he

Sorry, I lost it, but that was really funny.

hmmm

Ok, I''m better now...

Thank you, no really, I did learn something new from what you said, really, thank you.
Only by art can we get outside ourselves; instead of seeing only one world, our own, we see it under multiple forms.
AFAIK it was Intel who wrote the D3DX one, just the same as they wrote the D3D PSGP. And the same from AMD. So it should be quicker (remember to take cache coherency, call latency and pipelining into account on those cycle timings).

It depends how you are comparing the two... The correct way would be to use a profiler such as VTune and profile a release executable twice, once normally, a second time with the "DisableD3DXPSGP" registry key set (Software\Microsoft\Direct3D). Any other profile would be pretty pointless and totally inaccurate.

If their SIMD code is really only a cycle quicker, then that sucks quite a bit and someone at Microsoft wasted a lot of their time!.
I can see how a single SSE/3DNow! matrix multiply could appear slow though due to the overhead, a repeated call of something like D3DXVec3Transform() with the same matrix and differing vectors should show more of a difference.


Intel and AMD are in competition which means they have both been supplying Microsoft with hand tuned code for D3D for the past few years. [Think about it, if D3D ran significantly faster with AMD CPUs because AMD engineers hand optimised it, Intel wouldn''t be happy, so they get their engineers to do the same which results in very well optimised code... Or at least thats what Microsoft, Intel and AMD say anyway!].

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

Beta: AFAIK it was Intel who wrote the D3DX one, just the same as they wrote the D3D PSGP

That's what I thought too, hmmmm. Its not the same code, that I saw on Intel's site. I know it pretty well because, at the time it was 10 cycles faster then anything I
could write.

Beta: remember to take cache coherency, call latency and pipelining into account on those cycle timings.

I know my way around, cache coherency, latency and pipelining, now if I could only spell.

Beta: It depends how you are comparing the two... The correct way would be to use a profiler such as VTune and profile a release executable twice, once normally,
a second time with the "DisableD3DXPSGP" registry key set(Software\Microsoft\Direct3D). Any other profile would be pretty pointless and totally inaccurate.

I know, but I don't have that kind of money, I'm just a student. I know for a fact my results are accurate, because the cycles counts match Intel's published
instruction cycle counts.

Beta: I can see how a single SSE/3DNow! matrix multiply could appear slow though due to the overhead,.

Who said I do only one?

Beta: If their SIMD code is really only a cycle quicker, then that sucks quite a bit and someone at Microsoft wasted a lot of their time!.

Well that's what happens when you call to a jump table. I timed that one time, its really slow. Everyone says, don't do self modifying code though.

Beta: a repeated call of something like D3DXVec3Transform() with the same matrix and differing vectors should show more of a
difference

I think we both know its possible to do better than that.

Hey you're an interesting guy, thanks for the discussion, all the dudes in my school are pot heads, bring it on.

Edited by - Abstract Thought on November 16, 2001 7:49:09 AM
Only by art can we get outside ourselves; instead of seeing only one world, our own, we see it under multiple forms.
Abstract Thought:

If, as you assert, the SIMD version of D3DXMatrixMultiply is taking ths same time or longer than the scalar x86 FPU version of D3DXMatrixMultiply, then your timings are flawed in some way.

I just timed differences between the SIMD PSGP version of D3DXMatrixMultiply and the scalar x86 version with the following program (try putting what's below into an MSVC console app and build it in Release mode, comment out profiles not relevent to your CPU)...

    #pragma comment(lib, "d3dx8.lib")#include <windows.h>#include <stdio.h>#include <d3dx8.h>#define NUM_MATRICES 1000000D3DXMATRIX      g_source[NUM_MATRICES];D3DXMATRIX      g_dest;LARGE_INTEGER   g_liPerfFreq;extern D3DXMATRIX* WINAPI x86_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );extern D3DXMATRIX* WINAPI sse_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );extern D3DXMATRIX* WINAPI sse2_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );extern D3DXMATRIX* WINAPI x3d_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );void initMatrices(void){    // seed with same value so that all tests are with equal values    srand(1974);    D3DXMatrixIdentity( &g_dest );    // make some non-identity matrices    // this also ensures the PSGP has been invoked at least once    D3DXVECTOR3 eye, at, up;    for (int i=0; i<NUM_MATRICES; ++i)    {        eye.x = (float)rand();  eye.y = (float)rand();  eye.z = (float)rand();        at.x = (float)rand();   at.y = (float)rand();   at.z = (float)rand();        up.x = (float)rand();   up.y = (float)rand();   up.z = (float)rand();        D3DXMatrixLookAtLH( &g_source[i], &eye, &at, &up );    }}void profile( D3DXMATRIX* (WINAPI *pfnMethod)( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 ), char* comment ){    initMatrices();    LARGE_INTEGER liStart, liFinish;    QueryPerformanceCounter( &liStart );    for (int i=0; i<NUM_MATRICES; ++i)    {        pfnMethod( &g_dest, &g_dest, &g_source[i] );    }    QueryPerformanceCounter( &liFinish );    // calculate the time taken in miliseconds    float time = (float)((liFinish.QuadPart - liStart.QuadPart) * 1000) / (float)g_liPerfFreq.QuadPart;    char string[80];    sprintf( string, "%s:\t%f\n", comment, time );    OutputDebugString( string );}int main(void){    QueryPerformanceFrequency( &g_liPerfFreq );    profile( &x86_D3DXMatrixMultiply, "x86 FPU D3DXMatrixMultiply" );    profile( &x3d_D3DXMatrixMultiply, "3DNow! D3DXMatrixMultiply" );    profile( &D3DXMatrixMultiply, "Automatic D3DXMatrixMultiply" );//  profile( &sse_D3DXMatrixMultiply, "SSE D3DXMatrixMultiply" );//  profile( &sse2_D3DXMatrixMultiply, "SSE2 D3DXMatrixMultiply" );    return 0;}    


The program makes a million D3DMatrixMultiply calls and measures the amount of time this takes.

I'm at home at the moment so I've only been able to profile on a 1.3GHz AMD Thunderbird (thus the SSE and SSE2 profiles are commented out). I'll try it on my work Northwood P4 on Monday but I'm 99% certain I'll see similar results.

The x86_D3DXMatrixMultiply, x3d_D3DXMatrixMultiply, sse_D3DXMatrixMultiply and sse2_D3DXMatrixMultiply are the functions which the jump at the start of the D3DMatrixMultiply jump to. I do this to test the difference without needing to flip the "DisableD3DXPSGP" registry key. The code also profiles the D3DXMatrixMultiply function directly (including the jump) this is so we can see just how long that jump is taking (Automatic on the profile), and check that D3DX is really using the CPU specific version.

Running the above in a loop 4 times gives me the following results:

Automatic D3DXMatrixMultiply:	186.367522Automatic D3DXMatrixMultiply:	186.712818Automatic D3DXMatrixMultiply:	187.064818Automatic D3DXMatrixMultiply:	187.223219x86 FPU D3DXMatrixMultiply:	263.830269x86 FPU D3DXMatrixMultiply:	263.993698x86 FPU D3DXMatrixMultiply:	264.038117x86 FPU D3DXMatrixMultiply:	263.6299643DNow! D3DXMatrixMultiply:	186.6600183DNow! D3DXMatrixMultiply:	176.1008403DNow! D3DXMatrixMultiply:	181.3842003DNow! D3DXMatrixMultiply:	185.410416  


Conclusions:
1. The 3DNow! routine called directly is significantly quicker than the x86 FPU routine.

2. The D3DMatrixMultiply() routine is also significantly quicker than the x86 FPU routine.

3. The D3DMatrixMultiply() routine takes a very similar amount of time to calling the 3DNow! version directly. Conclusion is that the D3DX library is automatically (and correctly) calling the 3DNow! PSGP code. The difference in the time between the two is likely to be a combination of the jmp instruction overhead, instruction cache overhead, and data cache overhead.

4. Since the profiles were done under Windows, varying thread quantums, context switch overhead etc will have some effect on the individual timings (that's why the times vary).


Bring it on ?, a phrase involving eggs and grandmothers springs to mind.


BTW: My handle on this board is S1CA not Beta.

--
Simon O'Connor
Creative Asylum Ltd
www.creative-asylum.com

Edited by - s1ca on November 17, 2001 7:28:59 PM

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

This topic is closed to new replies.

Advertisement