SSE/3dNow! vs. DX8

Started by
14 comments, last by malyskolacek 22 years, 5 months ago
Only by art can we get outside ourselves; instead of seeing only one world, our own, we see it under multiple forms.
Advertisement
For S1CA

quote:
If, as you assert, the SIMD version of D3DXMatrixMultiply is taking this same time or longer than the scalar x86 FPU version of D3DXMatrixMultiply, then your timings are flawed in some way.


Yes, you are correct. It took awhile, but I understand now what I was doing wrong. I wasn't compiling with the DX81 libs. Dumb.

quote:
I just timed differences between the SIMD PSGP version of D3DXMatrixMultiply and the scalar x86 version with the following program (try putting what's below into an MSVC console app and build it in Release mode, comment out profiles not relevant to your CPU)...


I tried your timing code and D3DXMatrixMultiply and sse_D3DXMatrixMultiply ran at the same speed. I also traced into the asm and saw that they were both executing the same SSE, there wasn't any FPU code in sight .

I think there are some problems with your testing method. This is how I test code.

Here is that 90 cycle code from Intel that I was talking about, sse_D3DXMatrixMultiply is 105 cycles.

Testing code in 1,000,000 loop, also has some problems, read this. This effect can really be seen, if you reduce that count to 10000.

Here is a Matrix Mul I wrote when I was learning SSE about a month ago. Its takes 111 cycles, not a fast as 90, but it doesn't pollute the cache as much either. If you add this code to your timing code, and run it on a P3, and reduce the count to 10000, this code should be faster than sse_D3DXMatrixMultiply version.

      //you need to use __declspec(align(16)) D3DXMATRIX;__declspec(naked) D3DXMATRIX* WINAPI v1_MatrixMul(D3DXMATRIX *m3, CONST D3DXMATRIX *m1, CONST D3DXMATRIX *m2){	__asm	{		mov ecx, dword ptr [esp+8] //m2		movaps xmm4,[ecx]		movaps xmm5,[ecx+16]		mov eax, dword ptr [esp+4] // m1		mov edx, dword ptr [esp+12] //m3				movaps xmm6,[ecx+32]		movaps xmm7,[ecx+48]		mov ecx,4l1:		dec ecx		movss  xmm3,[eax]				shufps xmm3,xmm3,0		mulps  xmm3,xmm4		movss  xmm2,[eax+4]				shufps xmm2,xmm2,0		mulps  xmm2,xmm5		movss  xmm1,[eax+8]		shufps xmm1,xmm1,0		mulps  xmm1,xmm6		addps  xmm3,xmm2				movss  xmm2,[eax+12]		shufps xmm2,xmm2,0				mulps  xmm2,xmm7				addps  xmm1,xmm3		addps  xmm1,xmm2			movaps [edx],xmm1		lea edx,[edx+16]		lea eax,[eax+16]		jnz l1		ret 12	}}      


I apologies for my behavior on "16 November 2001 6:49:02 AM", that was 3:49 AM my time, I had just come home from a club. I guess computers and Guinness don't mix to well, who would have thought.



Edited by - Abstract Thought on November 19, 2001 8:08:44 PM
Only by art can we get outside ourselves; instead of seeing only one world, our own, we see it under multiple forms.
quote:
I think there are some problems with your testing method. This is how I test code.


QueryPerformanceCounter() uses RDTSC internally if available and suitable. If not available or using SMP it''ll use the 8254 Programmable Interrupt Timer for it''s timing.

RDTSC itself isn''t free of problems itself. Try it on a CPU which doesn''t support it . And more importantly, on a machine with multiple CPUs it doesn''t make any sense since you''re only getting the cycle count of one CPU. If the work is spread across CPUs (i.e. split across threads) you''re timing is screwed.

Granted, QPF() does have some documented problems on machines with buggy BIOS implementations.

A much larger problem is the fact that the code being timed isn''t the only code running in the system. Using QPF() or RDTSC gets the absolute times between two points isn''t accurate since the Windows thread scheduler will switch away from the thread when we reach the end of our current quantum (~15-20ms). Making the thread realtime wouldn''t totally avoid it either since Windows will scedule a double length quantum for starved threads.

The only real way to get completely accurate and stable timings for that code is to do it in pure DOS (*not* a DOS console) where there isn''t any thread switching going on. Unfortunately IIRC D3DX looks at registry keys at startup so wouldn''t work in DOS.


quote:
Testing code in 1,000,000 loop, also has some problems, read this. This effect can really be seen, if you reduce that count to 10000.


The only issue is the one I mention above. The article you linked to is about Write Combined AGP memory which has nothing whatsoever to do with what we''re discussing. The arrays you pass for matrices and points to D3DX are typically in Cached, system memory.
Only consequence apart from linear cache misses is (linear) page faults from the virtual memory system. The assumption I made is that you''re testing on a machine with enough physical memory to not require regular paging. When I have some time I''ll probably add some explicit page touching code to that to ensure everything is paged in.


quote:
Here is a Matrix Mul I wrote when I was learning SSE about a month ago. Its takes 111 cycles, not a fast as 90, but it doesn''t pollute the cache as much either. If you add this code to your timing code, and run it on a P3, and reduce the count to 10000, this code should be faster than sse_D3DXMatrixMultiply version


If you want to use that directly, and have your own jump table to do it on different CPUs, fine. Depends if matrix multiplication is a major bottleneck to your app or not.
If you''re not matrix multiply limited, D3DX is a great solution. If profiling the app as a whole shows that as a problem, then sure, saving 10 cycles on each might be a help. But realistically you''d have to be transforming a hell of a lot of bones/hierarchies for that to ever be the case.

BTW: Try replacing D3DXMATRIX in my code with D3DXMATRIXA16 if you''re using the processor pack or VC7 and then re-profile.


quote:
I apologies for my behavior on "16 November 2001 6:49:02 AM", that was 3:49 AM my time, I had just come home from a club. I guess computers and Guinness don''t mix to well, who would have thought.


No worries mate, we''re essentially in agreement now anyway regarding the original topic. D3DX will use CPU specific code and is pretty darn fast. It may be able to beaten with some hand tuned code, but as a general purpose, tested solution for all x86 CPUs it''s a good choice.

--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

quote:
The only issue is the one I mention above. The article you linked to is about Write Combined AGP memory which has nothing whatsoever to do with what we''re
discussing.


No its doesn''t have much to do with matrix mul. However a great deal of the time used by your timing software was being used by the virtual memory manger, that''s why I use Intel''s method of timing code. This is also an interest of mine care to discuss it.

I''m interested in transformation, lighting and tweening (new ), you may have seen my crappy little demo. I was also thinking of large transformed arrays of vertices that were being stored in AGP for quick processing by the GPU. I just can''t believe how much time is wasted on the write back to memory.

quote:
The only real way to get completely accurate and stable timings for that code is to do it in pure DOS (*not* a DOS console) where there isn''t any thread switching going on. Unfortunately IIRC D3DX looks at registry keys at startup so wouldn''t work in DOS.


That may be an option if I use cli and sti. A while ago I wrote a asm program that used the processor stack as an interpreter in pure DOS, but that was only 16bit I would have to switch to 32bit, hmmm, thanks good idea.

quote:
The assumption I made is that you''re testing on a machine with enough physical memory to not require regular paging.


I only have 128, lets see that''s 16*4*1000000=64000000, that eats half of it, I did see some paging the first time I ran it.

quote:
If profiling the app as a whole shows that as a problem, then sure, saving 10 cycles on each might be a help. But realistically you''d have to be transforming a hell of a lot of bones/hierarchies for that to ever be the case.


I don''t know much about bones, yet, my work has focused on learning transformation and lighting, there''s no point in moving a character or other object if I didn''t even know how the math works. So now I do, and have my own pipeline to prove it.

The basic equation I optimized is Matrix matSet = matWorld * matView * matProj; I didn''t time this yet but you could say that sse_D3DXMatrixMultiply would take about 105*3, I was able improve this substantially, the method would also work with AMD. There are so many more ways that the pipeline can be improved, its a shame that were losing control of it, to the new GPU''s. On the other hand having full control over a fast GPU would be great.

quote: My goal is to understand, optimizing is just for fun .
Only by art can we get outside ourselves; instead of seeing only one world, our own, we see it under multiple forms.
quote:
No its doesn''t have much to do with matrix mul. However a great deal of the time used by your timing software was being used by the virtual memory manger, that''s why I use Intel''s method of timing code. This is also an interest of mine care to discuss it.


The difference here is in the number of items being operated on. Not the actual profiling method. Counting cycles with RDTSC isn''t really any different to timing cycles with QueryPerformanceCounter (which itself uses RDTSC when possible).
My point about that article is it''s referring to what happens with data living in WC AGP memory rather than cached system memory (which is where the code in the program I posted is expecting matrices to live).

The code I posted wasn''t intended to be the "perfect" profiler. It was *ONLY* to indicate a difference between the x86, 3DNow! and SSE versions of the D3DX maths functions. The machines I tested that on has 512Mb and had nothing else running.


quote:
I''m interested in transformation, lighting and tweening (new ), you may have seen my crappy little demo. I was also thinking of large transformed arrays of vertices that were being stored in AGP for quick processing by the GPU. I just can''t believe how much time is wasted on the write back to memory


Whether that will be faster than doing the maths in system memory and burst writing to AGP memory will depend on your *exact* program. Remember that the maths usually requires READ operations on its data. If the data is stored in non-cached memory (WC AGP), then any reads take the time it takes to go to memory rather than the cache. (IIRC it''s something like a 20-40cycles penalty per read from non-cached memory).


quote:
That may be an option if I use cli and sti.

Manually disabling interrupts on Windows NT, XP or 2000 is a great easy way to persuade the task manager to suspend/terminate your application for trying to damage the stability/security of the machine. You should never need to do that nowadays.


quote:
There are so many more ways that the pipeline can be improved, its a shame that were losing control of it, to the new GPU''s. On the other hand having full control over a fast GPU would be great.


You''d lile the PlayStation 2 then , lots of low level control (i.e. you decide exactly how data gets shifted around and you manage what happens concurrently etc) - although it has to be said that re-inventing the wheel can get boring after a while!

Handing the boring work (like transforming vertices etc) off to specialised custom CPUs (like a GPU) is definately a better way to approach hardware architectures than having a single serialised CPU do everything.

--
Simon O''''Connor
Creative Asylum Ltd
www.creative-asylum.com

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

S1CA:

quote:
Counting cycles with RDTSC isn't really any different to timing cycles with QueryPerformanceCounter (which itself uses RDTSC when possible).


I traced into QueryPerformanceCounter, it seems to go to ring0, and past ring0 I can't see of course. I see a difference between 1 instruction and 11+ who knows how many instructions, but maybe I'm just a freak. I guess I could find out its overhead first.

I was reading a few months ago, that the Intel processor switches tasks only when a jmp or a call is executed, I could be wrong, and I can't find that info right know. Its hiding somewhere on my 30 GB hard drive.

quote:
The code I posted wasn't intended to be the "perfect" profiler. It was *ONLY* to indicate a difference between the x86, 3DNow! and SSE versions of the D3DX maths functions. The machines I tested that on has 512Mb and had nothing else running.


Cool, didn't mean to bash it. I appreciate the discussing, thanks.

quote:
Whether that will be faster than doing the maths in system memory and burst writing to AGP memory will depend on your *exact* program.


In fact on my system anyway, I little test app I wrote does benefit from this method, but I have a feeling that DX81 also does it this way. Not sure, you?

quote:
Manually disabling interrupts on Windows NT, XP or 2000 is a great easy way to persuade the task manager to suspend/terminate your application for trying to damage the stability/security of the machine. You should never need to do that nowadays.


Really, I didn't know that about Windows NT, XP or 2000, but I was referring to pure DOS.

quote:
You'd like the PlayStation 2 then , lots of low level control


Ya, that sounds great. I'll look into that when I finish my studies.

quote:
although it has to be said that re-inventing the wheel can get boring after a while!


Its not about re-inventing the wheel, its about making a better one.

quote:
Handing the boring work (like transforming vertices etc) off to specialized custom CPUs (like a GPU) is definitely a better way to approach hardware architectures than having a single serialized CPU do everything.


For game developers, who need to get there products out quickly without worrying about every single cycle, I think its great. I just think it could make game programmers into point and clickers, all the real work has already been done for them. Isn't that really what MS wants, a whole generation of programmers that use DX and Windows for game programming.

What about the cost of those fast GPU's that do TL, in my country, Canada they cost between $300 - $600. I'm sure that their motives are the same as MS's.

It reminds of this one time I was looking for a game to buy. This little kid about 10, asked if I could see Diablo, so I pointed it out. Poor little bugger, fell into a deep depression when he saw the price tag. He said, "I only wanted a disk, not the whole thing".

What's my point, developers don't forget what its like to be a kid.




Edited by - Abstract Thought on November 20, 2001 9:22:17 PM
Only by art can we get outside ourselves; instead of seeing only one world, our own, we see it under multiple forms.

This topic is closed to new replies.

Advertisement