Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

malyskolacek

SSE/3dNow! vs. DX8

This topic is 6056 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi all! I have simple question about performance. What will be faster: if I use for example D3DXMatrixMultiply() from DX8 SDK or if I use function doing the same provided by AMD or Intel with downloadable Processor Pack for MSVC? I don''t know whether DX functions use any SSE/3dNow! instructions. Please help->thanks very much

Share this post


Link to post
Share on other sites
Advertisement
DX8.1 D3DX tools are supposed to be processor optimized, but DX8 D3DX tools are not.


Edited by - meZmo on November 15, 2001 2:26:58 PM

Share this post


Link to post
Share on other sites
I just finished a Intel SSE pipeline, so this issue is of some interest to me.

DX8.0 D3DXMatrixMultiply() was really slow FPU code
DX8.1 D3DXMatrixMultiply() has been improved greatly, but still uses the FPU.

I just looked at DX8.1 D3DXMatrixMultiply(), it seems like it was optimized for a hot cache, kind of dumb because we don't use matrix muls that way, its fast enough to use for just playing around though. When you're finished your game and have some extra time, have a look at SSE & 3DNOW.

SSE is always going to be faster than the FPU, because its really 4 FPU's in one.



Edited by - Abstract Thought on November 15, 2001 2:38:17 PM

Edited by - Abstract Thought on November 16, 2001 7:11:45 AM

Share this post


Link to post
Share on other sites
Abstract Thought:

how did you determine that D3DX in 8.1 doesn''t use SSE/SSE2 ?...
Was it by static dissasembly? Or runtime?

Take a look at the symbols in d3dx8.lib and you''ll find some with interesting names:

x86_D3DXMatrixMultiply()
sse_D3DXMatrixMultiply()
sse2_D3DXMatrixMultiply()
x3d_D3DXMatrixMultiply()


Also look in the docs under "What''s New in DirectX Graphics", specifically "Math Library".

One of the bullet points says: "Math library. Added CPU specific optimizations for most important functions for 3DNow, SSE, and SSE2."



AFAIK, the D3DX 8.1 maths functions are statically linked to use the x86 version (scalar FPU), thus with static dissasembly this is what you''ll see.

The FIRST time you use a function from the maths library, D3DX detects the CPU features and overwrites the virtual function table of the relevent class to point at the CPU specific version. Doing this avoids any overhead of repeated flag or CPUID checks. You should see this with dynamic dissasembly.

(In fact one of the early beta versions appeared not do the detection properly and always picked SSE which was fun for me testing on an AMD CPU ;-)


--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

Share this post


Link to post
Share on other sites
Just confirmed it in the debugger (name decoration stripped from labels):


D3DXMatrixMultiply( &out, &in1, &in2 );


Step into dissasembly:

1) Client call to D3DXMatrixMultiply()
  
; push matrices etc onto the stack and call D3DX
mov eax,dword ptr [ebp-4]
add eax,5Ch
push eax
mov ecx,dword ptr [ebp-4]
add ecx,1Ch
push ecx
lea edx,[ebp-68h]
push edx
call _D3DXMatrixMultiply@12 (1000e259)



2) Use jump table to jump to CPU specific version
  
_D3DXMatrixMultiply:
jmp dword ptr [g_D3DXFastTable+0Ch (1009bc84)]



3) Jumps to 3DNow! specific version (this machine has AMD CPU)
  
x3d_D3DXMatrixMultiply:
femms
sub esp,44h
mov edx,dword ptr [esp+50h]
mov dword ptr [esp+40h],ebp
mov ebp,esp
and esp,0FFFFFFF8h
; SIMD register moves, not very FPU ;)
movq mm0,mmword ptr [edx]
movq mm1,mmword ptr [edx+10h]
movq mm3,mmword ptr [edx+28h]
movq mm4,mmword ptr [edx+38h]
movq mm2,mm0
; HMM looks like 3DNow! to me...
punpckldq mm0,mm1
punpckhdq mm2,mm1

[snip]

; yep, it''s 3DNow!...
pfmul mm5,mmword ptr [esp+28h]
pfmul mm6,mmword ptr [esp+30h]
pfmul mm7,mmword ptr [esp+38h]
pfacc mm0,mm2
pfacc mm1,mm3
pfacc mm4,mm6


So I was wrong about them modifying the VTABLE (I''m sure they used to), but they do use a jump table and do jump to CPU specific code.

Share this post


Link to post
Share on other sites
re: Beta: How did you determine that D3DX in 8.1 doesn''t use SSE/SSE2 ?... Was it by static dissasembly? Or runtime?

rumtime, I mean... runtime

re: Beta: You should see this with dynamic dissasembly.

I''ll look into that...

Grrrrrr... I guess I wasted all those lonely late nights faithfully hacking away for nothing.

Cry. If I only could have waited for DirectX 8.1 to come along, to give me the matrix mul of my dreams.

Sniffle, but alas, my quest wasn''t in vane, for D3DXMatrixMultiply in dynamic disassembly, still bites the big one.

ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha

Dude, its only 1 cycles faster than their FPU version.

ha ha ha ha ha ha ha ha

Intel''s server has one that''s 20 cycles faster than that.

ha ha ha ha

and I...

he he

Sorry, I lost it, but that was really funny.

hmmm

Ok, I''m better now...

Thank you, no really, I did learn something new from what you said, really, thank you.

Share this post


Link to post
Share on other sites
AFAIK it was Intel who wrote the D3DX one, just the same as they wrote the D3D PSGP. And the same from AMD. So it should be quicker (remember to take cache coherency, call latency and pipelining into account on those cycle timings).

It depends how you are comparing the two... The correct way would be to use a profiler such as VTune and profile a release executable twice, once normally, a second time with the "DisableD3DXPSGP" registry key set (Software\Microsoft\Direct3D). Any other profile would be pretty pointless and totally inaccurate.

If their SIMD code is really only a cycle quicker, then that sucks quite a bit and someone at Microsoft wasted a lot of their time!.
I can see how a single SSE/3DNow! matrix multiply could appear slow though due to the overhead, a repeated call of something like D3DXVec3Transform() with the same matrix and differing vectors should show more of a difference.


Intel and AMD are in competition which means they have both been supplying Microsoft with hand tuned code for D3D for the past few years. [Think about it, if D3D ran significantly faster with AMD CPUs because AMD engineers hand optimised it, Intel wouldn''t be happy, so they get their engineers to do the same which results in very well optimised code... Or at least thats what Microsoft, Intel and AMD say anyway!].

Share this post


Link to post
Share on other sites
Beta: AFAIK it was Intel who wrote the D3DX one, just the same as they wrote the D3D PSGP

That's what I thought too, hmmmm. Its not the same code, that I saw on Intel's site. I know it pretty well because, at the time it was 10 cycles faster then anything I
could write.

Beta: remember to take cache coherency, call latency and pipelining into account on those cycle timings.

I know my way around, cache coherency, latency and pipelining, now if I could only spell.

Beta: It depends how you are comparing the two... The correct way would be to use a profiler such as VTune and profile a release executable twice, once normally,
a second time with the "DisableD3DXPSGP" registry key set(Software\Microsoft\Direct3D). Any other profile would be pretty pointless and totally inaccurate.

I know, but I don't have that kind of money, I'm just a student. I know for a fact my results are accurate, because the cycles counts match Intel's published
instruction cycle counts.

Beta: I can see how a single SSE/3DNow! matrix multiply could appear slow though due to the overhead,.

Who said I do only one?

Beta: If their SIMD code is really only a cycle quicker, then that sucks quite a bit and someone at Microsoft wasted a lot of their time!.

Well that's what happens when you call to a jump table. I timed that one time, its really slow. Everyone says, don't do self modifying code though.

Beta: a repeated call of something like D3DXVec3Transform() with the same matrix and differing vectors should show more of a
difference

I think we both know its possible to do better than that.

Hey you're an interesting guy, thanks for the discussion, all the dudes in my school are pot heads, bring it on.

Edited by - Abstract Thought on November 16, 2001 7:49:09 AM

Share this post


Link to post
Share on other sites
Abstract Thought:

If, as you assert, the SIMD version of D3DXMatrixMultiply is taking ths same time or longer than the scalar x86 FPU version of D3DXMatrixMultiply, then your timings are flawed in some way.

I just timed differences between the SIMD PSGP version of D3DXMatrixMultiply and the scalar x86 version with the following program (try putting what's below into an MSVC console app and build it in Release mode, comment out profiles not relevent to your CPU)...

    
#pragma comment(lib, "d3dx8.lib")

#include <windows.h>
#include <stdio.h>
#include <d3dx8.h>

#define NUM_MATRICES 1000000


D3DXMATRIX g_source[NUM_MATRICES];
D3DXMATRIX g_dest;
LARGE_INTEGER g_liPerfFreq;


extern D3DXMATRIX* WINAPI x86_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );
extern D3DXMATRIX* WINAPI sse_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );
extern D3DXMATRIX* WINAPI sse2_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );
extern D3DXMATRIX* WINAPI x3d_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );


void initMatrices(void)
{
// seed with same value so that all tests are with equal values

srand(1974);

D3DXMatrixIdentity( &g_dest );

// make some non-identity matrices

// this also ensures the PSGP has been invoked at least once

D3DXVECTOR3 eye, at, up;
for (int i=0; i<NUM_MATRICES; ++i)
{
eye.x = (float)rand(); eye.y = (float)rand(); eye.z = (float)rand();
at.x = (float)rand(); at.y = (float)rand(); at.z = (float)rand();
up.x = (float)rand(); up.y = (float)rand(); up.z = (float)rand();
D3DXMatrixLookAtLH( &g_source[i], &eye, &at, &up );
}
}


void profile( D3DXMATRIX* (WINAPI *pfnMethod)( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 ), char* comment )
{
initMatrices();


LARGE_INTEGER liStart, liFinish;

QueryPerformanceCounter( &liStart );

for (int i=0; i<NUM_MATRICES; ++i)
{
pfnMethod( &g_dest, &g_dest, &g_source[i] );
}

QueryPerformanceCounter( &liFinish );


// calculate the time taken in miliseconds

float time = (float)((liFinish.QuadPart - liStart.QuadPart) * 1000) / (float)g_liPerfFreq.QuadPart;


char string[80];
sprintf( string, "%s:\t%f\n", comment, time );
OutputDebugString( string );
}


int main(void)
{
QueryPerformanceFrequency( &g_liPerfFreq );

profile( &x86_D3DXMatrixMultiply, "x86 FPU D3DXMatrixMultiply" );
profile( &x3d_D3DXMatrixMultiply, "3DNow! D3DXMatrixMultiply" );
profile( &D3DXMatrixMultiply, "Automatic D3DXMatrixMultiply" );
// profile( &sse_D3DXMatrixMultiply, "SSE D3DXMatrixMultiply" );

// profile( &sse2_D3DXMatrixMultiply, "SSE2 D3DXMatrixMultiply" );


return 0;
}


The program makes a million D3DMatrixMultiply calls and measures the amount of time this takes.

I'm at home at the moment so I've only been able to profile on a 1.3GHz AMD Thunderbird (thus the SSE and SSE2 profiles are commented out). I'll try it on my work Northwood P4 on Monday but I'm 99% certain I'll see similar results.

The x86_D3DXMatrixMultiply, x3d_D3DXMatrixMultiply, sse_D3DXMatrixMultiply and sse2_D3DXMatrixMultiply are the functions which the jump at the start of the D3DMatrixMultiply jump to. I do this to test the difference without needing to flip the "DisableD3DXPSGP" registry key. The code also profiles the D3DXMatrixMultiply function directly (including the jump) this is so we can see just how long that jump is taking (Automatic on the profile), and check that D3DX is really using the CPU specific version.

Running the above in a loop 4 times gives me the following results:


Automatic D3DXMatrixMultiply: 186.367522
Automatic D3DXMatrixMultiply: 186.712818
Automatic D3DXMatrixMultiply: 187.064818
Automatic D3DXMatrixMultiply: 187.223219

x86 FPU D3DXMatrixMultiply: 263.830269
x86 FPU D3DXMatrixMultiply: 263.993698
x86 FPU D3DXMatrixMultiply: 264.038117
x86 FPU D3DXMatrixMultiply: 263.629964

3DNow! D3DXMatrixMultiply: 186.660018
3DNow! D3DXMatrixMultiply: 176.100840
3DNow! D3DXMatrixMultiply: 181.384200
3DNow! D3DXMatrixMultiply: 185.410416


Conclusions:
1. The 3DNow! routine called directly is significantly quicker than the x86 FPU routine.

2. The D3DMatrixMultiply() routine is also significantly quicker than the x86 FPU routine.

3. The D3DMatrixMultiply() routine takes a very similar amount of time to calling the 3DNow! version directly. Conclusion is that the D3DX library is automatically (and correctly) calling the 3DNow! PSGP code. The difference in the time between the two is likely to be a combination of the jmp instruction overhead, instruction cache overhead, and data cache overhead.

4. Since the profiles were done under Windows, varying thread quantums, context switch overhead etc will have some effect on the individual timings (that's why the times vary).


Bring it on ?, a phrase involving eggs and grandmothers springs to mind.


BTW: My handle on this board is S1CA not Beta.

--
Simon O'Connor
Creative Asylum Ltd
www.creative-asylum.com

Edited by - s1ca on November 17, 2001 7:28:59 PM

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!