d3dx library sucks! please read this!

Graphics and GPU Programming Programming

Started by ashade March 03, 2005 02:12 PM

22 comments, last by DrunkenHyena 19 years, 1 month ago

100

Author

March 03, 2005 02:12 PM

everyone says d3dx library functions are very fast. So, I decided do make a simple test: 1) first, I made my own vector structure: __declspec(align(16)) struct myVec3 { float x, y, z, w; myVec3() { x = y = z = w = 0.0f; }; myVec3(float x, float y, float z) { this->x = x; this->y = y; this->z = z; this->w = 0.0f; }; } note that it has 4 members, but the last one will always be zero!!! then I made these two functions: #define MATHinl __forceinline MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut) { __asm { mov ebx, [esp+4] /* 0 a1 a3 a2 -- 0 b2 b1 b3 0 a2 a1 a3 -- 0 b1 b3 b2 */ movaps xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, [edx] movaps xmm3, xmm2 shufps xmm0, xmm0, 11001001b shufps xmm1, xmm1, 11010010b shufps xmm2, xmm2, 11010010b shufps xmm3, xmm3, 11001001b mulps xmm0, xmm2 mulps xmm1, xmm3 subps xmm0, xmm1 movaps [ebx], xmm0 ret 4 } } MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut) { __asm { mov ebx, [esp+4] /* 0 a1 a3 a2 -- 0 b2 b1 b3 0 a2 a1 a3 -- 0 b1 b3 b2 */ movq mm0, [ecx+4] //a3 a2 movd mm1, [ecx] //0 a1 movd mm2, [ecx+8] //0 a3 movd mm4, [edx+8] //0 b3 punpckldq mm4, [edx] //b1 b3 movd mm5, [edx+4] //0 b2 movq mm6, [edx+4] //b3 b2 movd mm7, [edx] //0 b1 movq mm3, mm0 //0 a2 punpckldq mm2, mm1 //a1 a3 pfmul mm0, mm4 pfmul mm1, mm5 pfmul mm2, mm6 pfmul mm3, mm7 pfsub mm0, mm2 pfsub mm1, mm3 movq [ebx], mm0 movq [ebx+8], mm1 femms ret 4 } } the first one uses sse and the second 3dnow then, inside my main() function, I made the test like this: myVec3 v1(1.5f, 2.0f, 1.0f); myVec3 v2(5.0f, 4.0f, 1.0f), fd; D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5; unsigned long i1 = timeGetTime(); for(unsigned int i = 0; i<100000000; i++) { MATHMyVec3Cross(&v1, &v2, &fd); //MATHMyVec3Cross2(&v1, &v2, &fd); //D3DXVec3Cross(&v5, &v4, &v3); } unsigned long i2 = timeGetTime(); i2-=i1; make the test switching the comments to test each function. I tested in my Athlon XP 1700+ with 256mb of memory. The results were amazing: d3dx: 17650 miliseconds; MATHMyVec3Cross: 1250 miliseconds; MATHMyVec3Cross2: 1700 miliseconds. the 3dnow version is about 12 times faster than the d3dx one!! the sse version is about 10 times faster the the d3dx one!! I want you to test and post here your results and your machine techs... and don't forget that the 3dnow! version only works on amd processors!!! HEY, COMPILE ON VC++ 2003!!

Evil Steve

2,021

March 03, 2005 02:34 PM

That *was* a release build, wasn't it? And you were linking with d3dx9.lib, not d3dx9d.lib?
Also, there's bound to be some overhead in using the D3DX functions, since they'll work on any machine, even one without SSE or 3DNow!. Also, it saves a lot of time to just use the D3DX functions than write your own functions.

EDIT: Also, D3DX functions aren't __forceinline'd. I'm testing your code just now on my Athlon 2600XP, 512Mb RAM. I'll post my results shortly.

Syranide

375

March 03, 2005 02:37 PM

You have to understand they are fast, but as usual, microsoft doesn't use assembly unless for very critical functions...

There is always and will always be a faster way of doing everything... no doubt.

ashade

100

Author

March 03, 2005 02:44 PM

D3DX vector functions do are __forceinline! And I'm do linking to d3dx9.lib! And more, every modern processor has, at least, sse. No one uses pentium 2 anymore, and if you do, i'm afraid you can't run a heavy (computer expensive) 3d aplication.

Muhammad Haggag

1,358

March 03, 2005 02:56 PM

1) No one said that the D3DX functions are unbeatable, you can beat them, but the performance benefit is unlikely to justify the time spent on it (How much time have you spent on just those 2 functions?)

2) D3DX Functions select optimum versions for the current hardware (i.e. SSE, 3DNow, SSE2). Your code doesn't, it assumes support for something and uses it.

3) It seems that you tested in a debug build, because I think any retarded compiler would've optimized out the whole loop (because you're not doing anything with v5).

Evil Steve

2,021

March 03, 2005 02:58 PM

D3DX functions are just declared as inline, so the compiler can choose not to inline them if it wants. And as Syranide said, most of the vector functions don't use assembly at all. The matrix functions do, which is why they're in the .lib (So actually I was wrong, it doesn't matter what version of the d3dx lib you link to, it's not used).

My results are:
Custom 3DNow!... 27664784 ticks, 7728 ms
D3DX... 49081947 ticks, 13711 ms
So, it's still twice as fast.

I was using QueryPerformanceCounter() for profiling. What compiler were you using? I had to edit the code to stop the compiler optimizing the D3DX version away... Here's my complete code:

#include <windows.h>#include <stdio.h>#include <d3dx9math.h>__declspec(align(16)) struct myVec3{	float x, y, z, w;	myVec3() {x = y = z = w = 0.0f;}	myVec3(float _x, float _y, float _z)		{x = _x; y = _y; z = _z; w = 0.0f;}};#define MATHinl inlineMATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut){	__asm 	{		mov ebx, [esp+4]		movaps xmm0, [ecx]		movaps xmm1, xmm0		movaps xmm2, [edx]		movaps xmm3, xmm2		shufps xmm0, xmm0, 11001001b		shufps xmm1, xmm1, 11010010b		shufps xmm2, xmm2, 11010010b		shufps xmm3, xmm3, 11001001b		mulps xmm0, xmm2		mulps xmm1, xmm3		subps xmm0, xmm1		movaps [ebx], xmm0		ret 4	}}MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut){	__asm	{		mov ebx, [esp+4]		movq mm0, [ecx+4] //a3 a2		movd mm1, [ecx] //0 a1		movd mm2, [ecx+8] //0 a3		movd mm4, [edx+8] //0 b3		punpckldq mm4, [edx] //b1 b3		movd mm5, [edx+4] //0 b2		movq mm6, [edx+4] //b3 b2		movd mm7, [edx] //0 b1		movq mm3, mm0 //0 a2		punpckldq mm2, mm1 //a1 a3		pfmul mm0, mm4		pfmul mm1, mm5		pfmul mm2, mm6		pfmul mm3, mm7		pfsub mm0, mm2		pfsub mm1, mm3		movq [ebx], mm0		movq [ebx+8], mm1		femms		ret 4	}}int main(int argc, char** argv){	myVec3 v1(1.5f, 2.0f, 1.0f);	myVec3 v2(5.0f, 4.0f, 1.0f), fd;	D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5;	LARGE_INTEGER liFreq, liStart, liEnd;	double dTime;	printf("Profiling custom 3DNow!... ");	QueryPerformanceFrequency(&liFreq);	QueryPerformanceCounter(&liStart);	for(DWORD i=0; i<1000000000; ++i)	{		MATHMyVec3Cross(&v1, &v2, &fd);	}	QueryPerformanceCounter(&liEnd);	liEnd.QuadPart -= liStart.QuadPart;	dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0);	printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime);	printf("Profiling D3DX... ");	QueryPerformanceFrequency(&liFreq);	QueryPerformanceCounter(&liStart);	for(DWORD i=0; i<1000000000; ++i)	{		D3DXVec3Cross(&v3, &v4, &v3);	}	QueryPerformanceCounter(&liEnd);	liEnd.QuadPart -= liStart.QuadPart;	dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0);	printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime);}

EDIT: I changed __forceinline to inline, just so it's the same as D3DX. But the compiler will almost certainly put it inline anyway.
Also, I'm having problems... If I put a loop to test the SSE version of the code, the 3DNow code takes forever to complete (I have up after a minute). I don't know why that is.

Also, as coder said - this is another reason to use D3DX - it's already been debugged and everything :P
If you really want to, try writing some code to profile matrix multiplication, that should be a bit more fair, since the D3DX version uses assembly. However, I don't think the compiler will be able to inline the code, since it's in a .lib. It might be interesting to try though.

Namethatnobodyelsetook

1,260

March 03, 2005 03:03 PM

The D3DX functions work on non align16 memory, and don't trample the fourth float. Most apps have vec3s that really are vec3 (your fixed pipeline and most shaders expect it as such).

Because D3DX cannot fetch and store 4 floats at once, it's already going to be slower. Add in non-alignment, and again it's going to be slower. You've changed the definition of a Vec3 and written code to take advantage of it. Great. Use it when you can. Comparing it to a function with completely different, more generic goals is absurd.

ashade

100

Author

March 03, 2005 03:08 PM

hey coder, i'm planning to check for one's processor capabilities in the beginning of the program and make a function pointer table for each function containing 3 versions of the same code (d3dx code, sse, 3dnow). And I was giving only an example with the d3xvec3 function, which is only a simple and very used funtion... for more complex functions, I think I could have a remarkable gain of speed, let's say, 2 fps...

Muhammad Haggag

1,358

March 03, 2005 03:14 PM

Using the following code (copied Evil Steve's and modified a bit to prevent optimizing the loop out):

#include <windows.h>#include <stdio.h>#include <d3dx9math.h>__declspec(align(16)) struct myVec3{	float x, y, z, w;	myVec3() {x = y = z = w = 0.0f;}	myVec3(float _x, float _y, float _z)		{x = _x; y = _y; z = _z; w = 0.0f;}};#define MATHinl inlineMATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut){	__asm 	{		mov ebx, [esp+4]		movaps xmm0, [ecx]		movaps xmm1, xmm0		movaps xmm2, [edx]		movaps xmm3, xmm2		shufps xmm0, xmm0, 11001001b		shufps xmm1, xmm1, 11010010b		shufps xmm2, xmm2, 11010010b		shufps xmm3, xmm3, 11001001b		mulps xmm0, xmm2		mulps xmm1, xmm3		subps xmm0, xmm1		movaps [ebx], xmm0		ret 4	}}MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut){	__asm	{		mov ebx, [esp+4]		movq mm0, [ecx+4] //a3 a2		movd mm1, [ecx] //0 a1		movd mm2, [ecx+8] //0 a3		movd mm4, [edx+8] //0 b3		punpckldq mm4, [edx] //b1 b3		movd mm5, [edx+4] //0 b2		movq mm6, [edx+4] //b3 b2		movd mm7, [edx] //0 b1		movq mm3, mm0 //0 a2		punpckldq mm2, mm1 //a1 a3		pfmul mm0, mm4		pfmul mm1, mm5		pfmul mm2, mm6		pfmul mm3, mm7		pfsub mm0, mm2		pfsub mm1, mm3		movq [ebx], mm0		movq [ebx+8], mm1		femms		ret 4	}}int main(int argc, char** argv){	myVec3 v1(1.5f, 2.0f, 1.0f);	myVec3 v2(5.0f, 4.0f, 1.0f), fd;	D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5;	LARGE_INTEGER liFreq, liStart, liEnd;	double dTime;	printf("Profiling... ");	QueryPerformanceFrequency(&liFreq);	QueryPerformanceCounter(&liStart);	for(DWORD i=0; i<1000000000; ++i)	{		//MATHMyVec3Cross(&v1, &v2, &fd);		//MATHMyVec3Cross(&fd, &v2, &v1);		D3DXVec3Cross(&v5, &v4, &v3);		D3DXVec3Cross(&v4, &v5, &v3);	}	QueryPerformanceCounter(&liEnd);	liEnd.QuadPart -= liStart.QuadPart;	dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0);	printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime);	return 0;}

On an AthlonXP 2000, 512 MB RAM:
3DNow!: 39996 ms
D3DX: 27224 ms

i.e. D3DX took 68% the time your custom functions took. Compiler is VC++ 2005 Express Beta, release build, all optimizations on (except Global Optimization), Favor fast code, ...etc

Tough luck.

[edit]Clarifying the last paragraph

[Edited by - Coder on March 3, 2005 3:14:21 PM]

intrest86

742

March 03, 2005 03:15 PM

ASSUMING... that you are spending all your time doing vector math. There is a lot more going on in any game that probably takes up much more of the proccessing time than this. Try profiling a full game using the two different functions, and then maybe the results will be more relevant.

Until then it was a fun project that was a premature optomization.

Turring Machines are better than C++ any day ^_~

d3dx library sucks! please read this!

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

d3dx library sucks! please read this!

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines