#### Archived

This topic is now archived and is closed to further replies.

# SSE/3dNow! vs. DX8

This topic is 5865 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi all! I have simple question about performance. What will be faster: if I use for example D3DXMatrixMultiply() from DX8 SDK or if I use function doing the same provided by AMD or Intel with downloadable Processor Pack for MSVC? I don''t know whether DX functions use any SSE/3dNow! instructions. Please help->thanks very much

##### Share on other sites
DX8.1 D3DX tools are supposed to be processor optimized, but DX8 D3DX tools are not.

Edited by - meZmo on November 15, 2001 2:26:58 PM

##### Share on other sites
I just finished a Intel SSE pipeline, so this issue is of some interest to me.

DX8.0 D3DXMatrixMultiply() was really slow FPU code
DX8.1 D3DXMatrixMultiply() has been improved greatly, but still uses the FPU.

I just looked at DX8.1 D3DXMatrixMultiply(), it seems like it was optimized for a hot cache, kind of dumb because we don't use matrix muls that way, its fast enough to use for just playing around though. When you're finished your game and have some extra time, have a look at SSE & 3DNOW.

SSE is always going to be faster than the FPU, because its really 4 FPU's in one.

Edited by - Abstract Thought on November 15, 2001 2:38:17 PM

Edited by - Abstract Thought on November 16, 2001 7:11:45 AM

##### Share on other sites
Thanks for clearing that up for me, AT!

##### Share on other sites
Abstract Thought:

how did you determine that D3DX in 8.1 doesn''t use SSE/SSE2 ?...
Was it by static dissasembly? Or runtime?

Take a look at the symbols in d3dx8.lib and you''ll find some with interesting names:

x86_D3DXMatrixMultiply()
sse_D3DXMatrixMultiply()
sse2_D3DXMatrixMultiply()
x3d_D3DXMatrixMultiply()

Also look in the docs under "What''s New in DirectX Graphics", specifically "Math Library".

One of the bullet points says: "Math library. Added CPU specific optimizations for most important functions for 3DNow, SSE, and SSE2."

AFAIK, the D3DX 8.1 maths functions are statically linked to use the x86 version (scalar FPU), thus with static dissasembly this is what you''ll see.

The FIRST time you use a function from the maths library, D3DX detects the CPU features and overwrites the virtual function table of the relevent class to point at the CPU specific version. Doing this avoids any overhead of repeated flag or CPUID checks. You should see this with dynamic dissasembly.

(In fact one of the early beta versions appeared not do the detection properly and always picked SSE which was fun for me testing on an AMD CPU ;-)

--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

##### Share on other sites
Just confirmed it in the debugger (name decoration stripped from labels):

D3DXMatrixMultiply( &out, &in1, &in2 );

Step into dissasembly:

1) Client call to D3DXMatrixMultiply()
  ; push matrices etc onto the stack and call D3DXmov eax,dword ptr [ebp-4]add eax,5Chpush eaxmov ecx,dword ptr [ebp-4]add ecx,1Chpush ecxlea edx,[ebp-68h]push edxcall _D3DXMatrixMultiply@12 (1000e259)

  _D3DXMatrixMultiply:jmp dword ptr [g_D3DXFastTable+0Ch (1009bc84)]

3) Jumps to 3DNow! specific version (this machine has AMD CPU)
  x3d_D3DXMatrixMultiply:femmssub esp,44hmov edx,dword ptr [esp+50h]mov dword ptr [esp+40h],ebpmov ebp,espand esp,0FFFFFFF8h; SIMD register moves, not very FPU ;)movq mm0,mmword ptr [edx]movq mm1,mmword ptr [edx+10h]movq mm3,mmword ptr [edx+28h]movq mm4,mmword ptr [edx+38h]movq mm2,mm0; HMM looks like 3DNow! to me...punpckldq mm0,mm1punpckhdq mm2,mm1[snip]; yep, it''s 3DNow!...pfmul mm5,mmword ptr [esp+28h]pfmul mm6,mmword ptr [esp+30h]pfmul mm7,mmword ptr [esp+38h]pfacc mm0,mm2pfacc mm1,mm3pfacc mm4,mm6

So I was wrong about them modifying the VTABLE (I''m sure they used to), but they do use a jump table and do jump to CPU specific code.

##### Share on other sites
re: Beta: How did you determine that D3DX in 8.1 doesn''t use SSE/SSE2 ?... Was it by static dissasembly? Or runtime?

rumtime, I mean... runtime

re: Beta: You should see this with dynamic dissasembly.

I''ll look into that...

Grrrrrr... I guess I wasted all those lonely late nights faithfully hacking away for nothing.

Cry. If I only could have waited for DirectX 8.1 to come along, to give me the matrix mul of my dreams.

Sniffle, but alas, my quest wasn''t in vane, for D3DXMatrixMultiply in dynamic disassembly, still bites the big one.

ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha

Dude, its only 1 cycles faster than their FPU version.

ha ha ha ha ha ha ha ha

Intel''s server has one that''s 20 cycles faster than that.

ha ha ha ha

and I...

he he

Sorry, I lost it, but that was really funny.

hmmm

Ok, I''m better now...

Thank you, no really, I did learn something new from what you said, really, thank you.

##### Share on other sites
AFAIK it was Intel who wrote the D3DX one, just the same as they wrote the D3D PSGP. And the same from AMD. So it should be quicker (remember to take cache coherency, call latency and pipelining into account on those cycle timings).

It depends how you are comparing the two... The correct way would be to use a profiler such as VTune and profile a release executable twice, once normally, a second time with the "DisableD3DXPSGP" registry key set (Software\Microsoft\Direct3D). Any other profile would be pretty pointless and totally inaccurate.

If their SIMD code is really only a cycle quicker, then that sucks quite a bit and someone at Microsoft wasted a lot of their time!.
I can see how a single SSE/3DNow! matrix multiply could appear slow though due to the overhead, a repeated call of something like D3DXVec3Transform() with the same matrix and differing vectors should show more of a difference.

Intel and AMD are in competition which means they have both been supplying Microsoft with hand tuned code for D3D for the past few years. [Think about it, if D3D ran significantly faster with AMD CPUs because AMD engineers hand optimised it, Intel wouldn''t be happy, so they get their engineers to do the same which results in very well optimised code... Or at least thats what Microsoft, Intel and AMD say anyway!].

##### Share on other sites
Beta: AFAIK it was Intel who wrote the D3DX one, just the same as they wrote the D3D PSGP

That's what I thought too, hmmmm. Its not the same code, that I saw on Intel's site. I know it pretty well because, at the time it was 10 cycles faster then anything I
could write.

Beta: remember to take cache coherency, call latency and pipelining into account on those cycle timings.

I know my way around, cache coherency, latency and pipelining, now if I could only spell.

Beta: It depends how you are comparing the two... The correct way would be to use a profiler such as VTune and profile a release executable twice, once normally,
a second time with the "DisableD3DXPSGP" registry key set(Software\Microsoft\Direct3D). Any other profile would be pretty pointless and totally inaccurate.

I know, but I don't have that kind of money, I'm just a student. I know for a fact my results are accurate, because the cycles counts match Intel's published
instruction cycle counts.

Beta: I can see how a single SSE/3DNow! matrix multiply could appear slow though due to the overhead,.

Who said I do only one?

Beta: If their SIMD code is really only a cycle quicker, then that sucks quite a bit and someone at Microsoft wasted a lot of their time!.

Well that's what happens when you call to a jump table. I timed that one time, its really slow. Everyone says, don't do self modifying code though.

Beta: a repeated call of something like D3DXVec3Transform() with the same matrix and differing vectors should show more of a
difference

I think we both know its possible to do better than that.

Hey you're an interesting guy, thanks for the discussion, all the dudes in my school are pot heads, bring it on.

Edited by - Abstract Thought on November 16, 2001 7:49:09 AM

##### Share on other sites
Abstract Thought:

If, as you assert, the SIMD version of D3DXMatrixMultiply is taking ths same time or longer than the scalar x86 FPU version of D3DXMatrixMultiply, then your timings are flawed in some way.

I just timed differences between the SIMD PSGP version of D3DXMatrixMultiply and the scalar x86 version with the following program (try putting what's below into an MSVC console app and build it in Release mode, comment out profiles not relevent to your CPU)...

  #pragma comment(lib, "d3dx8.lib")#include #include #include #define NUM_MATRICES 1000000D3DXMATRIX g_source[NUM_MATRICES];D3DXMATRIX g_dest;LARGE_INTEGER g_liPerfFreq;extern D3DXMATRIX* WINAPI x86_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );extern D3DXMATRIX* WINAPI sse_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );extern D3DXMATRIX* WINAPI sse2_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );extern D3DXMATRIX* WINAPI x3d_D3DXMatrixMultiply( D3DXMATRIX *pOut, CONST D3DXMATRIX *pM1, CONST D3DXMATRIX *pM2 );void initMatrices(void){ // seed with same value so that all tests are with equal values srand(1974); D3DXMatrixIdentity( &g_dest ); // make some non-identity matrices // this also ensures the PSGP has been invoked at least once D3DXVECTOR3 eye, at, up; for (int i=0; i

The program makes a million D3DMatrixMultiply calls and measures the amount of time this takes.

I'm at home at the moment so I've only been able to profile on a 1.3GHz AMD Thunderbird (thus the SSE and SSE2 profiles are commented out). I'll try it on my work Northwood P4 on Monday but I'm 99% certain I'll see similar results.

The x86_D3DXMatrixMultiply, x3d_D3DXMatrixMultiply, sse_D3DXMatrixMultiply and sse2_D3DXMatrixMultiply are the functions which the jump at the start of the D3DMatrixMultiply jump to. I do this to test the difference without needing to flip the "DisableD3DXPSGP" registry key. The code also profiles the D3DXMatrixMultiply function directly (including the jump) this is so we can see just how long that jump is taking (Automatic on the profile), and check that D3DX is really using the CPU specific version.

Running the above in a loop 4 times gives me the following results:

Automatic D3DXMatrixMultiply:	186.367522Automatic D3DXMatrixMultiply:	186.712818Automatic D3DXMatrixMultiply:	187.064818Automatic D3DXMatrixMultiply:	187.223219x86 FPU D3DXMatrixMultiply:	263.830269x86 FPU D3DXMatrixMultiply:	263.993698x86 FPU D3DXMatrixMultiply:	264.038117x86 FPU D3DXMatrixMultiply:	263.6299643DNow! D3DXMatrixMultiply:	186.6600183DNow! D3DXMatrixMultiply:	176.1008403DNow! D3DXMatrixMultiply:	181.3842003DNow! D3DXMatrixMultiply:	185.410416  

Conclusions:
1. The 3DNow! routine called directly is significantly quicker than the x86 FPU routine.

2. The D3DMatrixMultiply() routine is also significantly quicker than the x86 FPU routine.

3. The D3DMatrixMultiply() routine takes a very similar amount of time to calling the 3DNow! version directly. Conclusion is that the D3DX library is automatically (and correctly) calling the 3DNow! PSGP code. The difference in the time between the two is likely to be a combination of the jmp instruction overhead, instruction cache overhead, and data cache overhead.

4. Since the profiles were done under Windows, varying thread quantums, context switch overhead etc will have some effect on the individual timings (that's why the times vary).

Bring it on ?, a phrase involving eggs and grandmothers springs to mind.

BTW: My handle on this board is S1CA not Beta.

--
Simon O'Connor
Creative Asylum Ltd
www.creative-asylum.com

Edited by - s1ca on November 17, 2001 7:28:59 PM

##### Share on other sites
For S1CA

quote:

If, as you assert, the SIMD version of D3DXMatrixMultiply is taking this same time or longer than the scalar x86 FPU version of D3DXMatrixMultiply, then your timings are flawed in some way.

Yes, you are correct. It took awhile, but I understand now what I was doing wrong. I wasn't compiling with the DX81 libs. Dumb.

quote:

I just timed differences between the SIMD PSGP version of D3DXMatrixMultiply and the scalar x86 version with the following program (try putting what's below into an MSVC console app and build it in Release mode, comment out profiles not relevant to your CPU)...

I tried your timing code and D3DXMatrixMultiply and sse_D3DXMatrixMultiply ran at the same speed. I also traced into the asm and saw that they were both executing the same SSE, there wasn't any FPU code in sight .

I think there are some problems with your testing method. This is how I test code.

Here is that 90 cycle code from Intel that I was talking about, sse_D3DXMatrixMultiply is 105 cycles.

Testing code in 1,000,000 loop, also has some problems, read this. This effect can really be seen, if you reduce that count to 10000.

Here is a Matrix Mul I wrote when I was learning SSE about a month ago. Its takes 111 cycles, not a fast as 90, but it doesn't pollute the cache as much either. If you add this code to your timing code, and run it on a P3, and reduce the count to 10000, this code should be faster than sse_D3DXMatrixMultiply version.

  //you need to use __declspec(align(16)) D3DXMATRIX;__declspec(naked) D3DXMATRIX* WINAPI v1_MatrixMul(D3DXMATRIX *m3, CONST D3DXMATRIX *m1, CONST D3DXMATRIX *m2){ __asm { mov ecx, dword ptr [esp+8] //m2 movaps xmm4,[ecx] movaps xmm5,[ecx+16] mov eax, dword ptr [esp+4] // m1 mov edx, dword ptr [esp+12] //m3 movaps xmm6,[ecx+32] movaps xmm7,[ecx+48] mov ecx,4l1: dec ecx movss xmm3,[eax] shufps xmm3,xmm3,0 mulps xmm3,xmm4 movss xmm2,[eax+4] shufps xmm2,xmm2,0 mulps xmm2,xmm5 movss xmm1,[eax+8] shufps xmm1,xmm1,0 mulps xmm1,xmm6 addps xmm3,xmm2 movss xmm2,[eax+12] shufps xmm2,xmm2,0 mulps xmm2,xmm7 addps xmm1,xmm3 addps xmm1,xmm2 movaps [edx],xmm1 lea edx,[edx+16] lea eax,[eax+16] jnz l1 ret 12 }} `

I apologies for my behavior on "16 November 2001 6:49:02 AM", that was 3:49 AM my time, I had just come home from a club. I guess computers and Guinness don't mix to well, who would have thought.

Edited by - Abstract Thought on November 19, 2001 8:08:44 PM

##### Share on other sites
quote:

I think there are some problems with your testing method. This is how I test code.

QueryPerformanceCounter() uses RDTSC internally if available and suitable. If not available or using SMP it''ll use the 8254 Programmable Interrupt Timer for it''s timing.

RDTSC itself isn''t free of problems itself. Try it on a CPU which doesn''t support it . And more importantly, on a machine with multiple CPUs it doesn''t make any sense since you''re only getting the cycle count of one CPU. If the work is spread across CPUs (i.e. split across threads) you''re timing is screwed.

Granted, QPF() does have some documented problems on machines with buggy BIOS implementations.

A much larger problem is the fact that the code being timed isn''t the only code running in the system. Using QPF() or RDTSC gets the absolute times between two points isn''t accurate since the Windows thread scheduler will switch away from the thread when we reach the end of our current quantum (~15-20ms). Making the thread realtime wouldn''t totally avoid it either since Windows will scedule a double length quantum for starved threads.

The only real way to get completely accurate and stable timings for that code is to do it in pure DOS (*not* a DOS console) where there isn''t any thread switching going on. Unfortunately IIRC D3DX looks at registry keys at startup so wouldn''t work in DOS.

quote:

Testing code in 1,000,000 loop, also has some problems, read this. This effect can really be seen, if you reduce that count to 10000.

The only issue is the one I mention above. The article you linked to is about Write Combined AGP memory which has nothing whatsoever to do with what we''re discussing. The arrays you pass for matrices and points to D3DX are typically in Cached, system memory.
Only consequence apart from linear cache misses is (linear) page faults from the virtual memory system. The assumption I made is that you''re testing on a machine with enough physical memory to not require regular paging. When I have some time I''ll probably add some explicit page touching code to that to ensure everything is paged in.

quote:

Here is a Matrix Mul I wrote when I was learning SSE about a month ago. Its takes 111 cycles, not a fast as 90, but it doesn''t pollute the cache as much either. If you add this code to your timing code, and run it on a P3, and reduce the count to 10000, this code should be faster than sse_D3DXMatrixMultiply version

If you want to use that directly, and have your own jump table to do it on different CPUs, fine. Depends if matrix multiplication is a major bottleneck to your app or not.
If you''re not matrix multiply limited, D3DX is a great solution. If profiling the app as a whole shows that as a problem, then sure, saving 10 cycles on each might be a help. But realistically you''d have to be transforming a hell of a lot of bones/hierarchies for that to ever be the case.

BTW: Try replacing D3DXMATRIX in my code with D3DXMATRIXA16 if you''re using the processor pack or VC7 and then re-profile.

quote:

I apologies for my behavior on "16 November 2001 6:49:02 AM", that was 3:49 AM my time, I had just come home from a club. I guess computers and Guinness don''t mix to well, who would have thought.

No worries mate, we''re essentially in agreement now anyway regarding the original topic. D3DX will use CPU specific code and is pretty darn fast. It may be able to beaten with some hand tuned code, but as a general purpose, tested solution for all x86 CPUs it''s a good choice.

--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

##### Share on other sites
quote:

The only issue is the one I mention above. The article you linked to is about Write Combined AGP memory which has nothing whatsoever to do with what we''re
discussing.

No its doesn''t have much to do with matrix mul. However a great deal of the time used by your timing software was being used by the virtual memory manger, that''s why I use Intel''s method of timing code. This is also an interest of mine care to discuss it.

I''m interested in transformation, lighting and tweening (new ), you may have seen my crappy little demo. I was also thinking of large transformed arrays of vertices that were being stored in AGP for quick processing by the GPU. I just can''t believe how much time is wasted on the write back to memory.

quote:

The only real way to get completely accurate and stable timings for that code is to do it in pure DOS (*not* a DOS console) where there isn''t any thread switching going on. Unfortunately IIRC D3DX looks at registry keys at startup so wouldn''t work in DOS.

That may be an option if I use cli and sti. A while ago I wrote a asm program that used the processor stack as an interpreter in pure DOS, but that was only 16bit I would have to switch to 32bit, hmmm, thanks good idea.

quote:

The assumption I made is that you''re testing on a machine with enough physical memory to not require regular paging.

I only have 128, lets see that''s 16*4*1000000=64000000, that eats half of it, I did see some paging the first time I ran it.

quote:

If profiling the app as a whole shows that as a problem, then sure, saving 10 cycles on each might be a help. But realistically you''d have to be transforming a hell of a lot of bones/hierarchies for that to ever be the case.

I don''t know much about bones, yet, my work has focused on learning transformation and lighting, there''s no point in moving a character or other object if I didn''t even know how the math works. So now I do, and have my own pipeline to prove it.

The basic equation I optimized is Matrix matSet = matWorld * matView * matProj; I didn''t time this yet but you could say that sse_D3DXMatrixMultiply would take about 105*3, I was able improve this substantially, the method would also work with AMD. There are so many more ways that the pipeline can be improved, its a shame that were losing control of it, to the new GPU''s. On the other hand having full control over a fast GPU would be great.

quote:
My goal is to understand, optimizing is just for fun .

##### Share on other sites
quote:

No its doesn''t have much to do with matrix mul. However a great deal of the time used by your timing software was being used by the virtual memory manger, that''s why I use Intel''s method of timing code. This is also an interest of mine care to discuss it.

The difference here is in the number of items being operated on. Not the actual profiling method. Counting cycles with RDTSC isn''t really any different to timing cycles with QueryPerformanceCounter (which itself uses RDTSC when possible).
My point about that article is it''s referring to what happens with data living in WC AGP memory rather than cached system memory (which is where the code in the program I posted is expecting matrices to live).

The code I posted wasn''t intended to be the "perfect" profiler. It was *ONLY* to indicate a difference between the x86, 3DNow! and SSE versions of the D3DX maths functions. The machines I tested that on has 512Mb and had nothing else running.

quote:

I''m interested in transformation, lighting and tweening (new ), you may have seen my crappy little demo. I was also thinking of large transformed arrays of vertices that were being stored in AGP for quick processing by the GPU. I just can''t believe how much time is wasted on the write back to memory

Whether that will be faster than doing the maths in system memory and burst writing to AGP memory will depend on your *exact* program. Remember that the maths usually requires READ operations on its data. If the data is stored in non-cached memory (WC AGP), then any reads take the time it takes to go to memory rather than the cache. (IIRC it''s something like a 20-40cycles penalty per read from non-cached memory).

quote:

That may be an option if I use cli and sti.

Manually disabling interrupts on Windows NT, XP or 2000 is a great easy way to persuade the task manager to suspend/terminate your application for trying to damage the stability/security of the machine. You should never need to do that nowadays.

quote:

There are so many more ways that the pipeline can be improved, its a shame that were losing control of it, to the new GPU''s. On the other hand having full control over a fast GPU would be great.

You''d lile the PlayStation 2 then , lots of low level control (i.e. you decide exactly how data gets shifted around and you manage what happens concurrently etc) - although it has to be said that re-inventing the wheel can get boring after a while!

Handing the boring work (like transforming vertices etc) off to specialised custom CPUs (like a GPU) is definately a better way to approach hardware architectures than having a single serialised CPU do everything.

--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

##### Share on other sites
S1CA:

quote:

Counting cycles with RDTSC isn't really any different to timing cycles with QueryPerformanceCounter (which itself uses RDTSC when possible).

I traced into QueryPerformanceCounter, it seems to go to ring0, and past ring0 I can't see of course. I see a difference between 1 instruction and 11+ who knows how many instructions, but maybe I'm just a freak. I guess I could find out its overhead first.

I was reading a few months ago, that the Intel processor switches tasks only when a jmp or a call is executed, I could be wrong, and I can't find that info right know. Its hiding somewhere on my 30 GB hard drive.

quote:

The code I posted wasn't intended to be the "perfect" profiler. It was *ONLY* to indicate a difference between the x86, 3DNow! and SSE versions of the D3DX maths functions. The machines I tested that on has 512Mb and had nothing else running.

Cool, didn't mean to bash it. I appreciate the discussing, thanks.

quote:

Whether that will be faster than doing the maths in system memory and burst writing to AGP memory will depend on your *exact* program.

In fact on my system anyway, I little test app I wrote does benefit from this method, but I have a feeling that DX81 also does it this way. Not sure, you?

quote:

Manually disabling interrupts on Windows NT, XP or 2000 is a great easy way to persuade the task manager to suspend/terminate your application for trying to damage the stability/security of the machine. You should never need to do that nowadays.

Really, I didn't know that about Windows NT, XP or 2000, but I was referring to pure DOS.

quote:

You'd like the PlayStation 2 then , lots of low level control

Ya, that sounds great. I'll look into that when I finish my studies.

quote:

although it has to be said that re-inventing the wheel can get boring after a while!

Its not about re-inventing the wheel, its about making a better one.

quote:

Handing the boring work (like transforming vertices etc) off to specialized custom CPUs (like a GPU) is definitely a better way to approach hardware architectures than having a single serialized CPU do everything.

For game developers, who need to get there products out quickly without worrying about every single cycle, I think its great. I just think it could make game programmers into point and clickers, all the real work has already been done for them. Isn't that really what MS wants, a whole generation of programmers that use DX and Windows for game programming.

What about the cost of those fast GPU's that do TL, in my country, Canada they cost between $300 -$600. I'm sure that their motives are the same as MS's.

It reminds of this one time I was looking for a game to buy. This little kid about 10, asked if I could see Diablo, so I pointed it out. Poor little bugger, fell into a deep depression when he saw the price tag. He said, "I only wanted a disk, not the whole thing".

What's my point, developers don't forget what its like to be a kid.

Edited by - Abstract Thought on November 20, 2001 9:22:17 PM

• ### Forum Statistics

• Total Topics
628647
• Total Posts
2984032

• 10
• 9
• 9
• 10
• 21