Jump to content
  • Advertisement
Sign in to follow this  
ashade

d3dx library sucks! please read this!

This topic is 5029 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

everyone says d3dx library functions are very fast. So, I decided do make a simple test: 1) first, I made my own vector structure: __declspec(align(16)) struct myVec3 { float x, y, z, w; myVec3() { x = y = z = w = 0.0f; }; myVec3(float x, float y, float z) { this->x = x; this->y = y; this->z = z; this->w = 0.0f; }; } note that it has 4 members, but the last one will always be zero!!! then I made these two functions: #define MATHinl __forceinline MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut) { __asm { mov ebx, [esp+4] /* 0 a1 a3 a2 -- 0 b2 b1 b3 0 a2 a1 a3 -- 0 b1 b3 b2 */ movaps xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, [edx] movaps xmm3, xmm2 shufps xmm0, xmm0, 11001001b shufps xmm1, xmm1, 11010010b shufps xmm2, xmm2, 11010010b shufps xmm3, xmm3, 11001001b mulps xmm0, xmm2 mulps xmm1, xmm3 subps xmm0, xmm1 movaps [ebx], xmm0 ret 4 } } MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut) { __asm { mov ebx, [esp+4] /* 0 a1 a3 a2 -- 0 b2 b1 b3 0 a2 a1 a3 -- 0 b1 b3 b2 */ movq mm0, [ecx+4] //a3 a2 movd mm1, [ecx] //0 a1 movd mm2, [ecx+8] //0 a3 movd mm4, [edx+8] //0 b3 punpckldq mm4, [edx] //b1 b3 movd mm5, [edx+4] //0 b2 movq mm6, [edx+4] //b3 b2 movd mm7, [edx] //0 b1 movq mm3, mm0 //0 a2 punpckldq mm2, mm1 //a1 a3 pfmul mm0, mm4 pfmul mm1, mm5 pfmul mm2, mm6 pfmul mm3, mm7 pfsub mm0, mm2 pfsub mm1, mm3 movq [ebx], mm0 movq [ebx+8], mm1 femms ret 4 } } the first one uses sse and the second 3dnow then, inside my main() function, I made the test like this: myVec3 v1(1.5f, 2.0f, 1.0f); myVec3 v2(5.0f, 4.0f, 1.0f), fd; D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5; unsigned long i1 = timeGetTime(); for(unsigned int i = 0; i<100000000; i++) { MATHMyVec3Cross(&v1, &v2, &fd); //MATHMyVec3Cross2(&v1, &v2, &fd); //D3DXVec3Cross(&v5, &v4, &v3); } unsigned long i2 = timeGetTime(); i2-=i1; make the test switching the comments to test each function. I tested in my Athlon XP 1700+ with 256mb of memory. The results were amazing: d3dx: 17650 miliseconds; MATHMyVec3Cross: 1250 miliseconds; MATHMyVec3Cross2: 1700 miliseconds. the 3dnow version is about 12 times faster than the d3dx one!! the sse version is about 10 times faster the the d3dx one!! I want you to test and post here your results and your machine techs... and don't forget that the 3dnow! version only works on amd processors!!! HEY, COMPILE ON VC++ 2003!!

Share this post


Link to post
Share on other sites
Advertisement
That *was* a release build, wasn't it? And you were linking with d3dx9.lib, not d3dx9d.lib?
Also, there's bound to be some overhead in using the D3DX functions, since they'll work on any machine, even one without SSE or 3DNow!. Also, it saves a lot of time to just use the D3DX functions than write your own functions.

EDIT: Also, D3DX functions aren't __forceinline'd. I'm testing your code just now on my Athlon 2600XP, 512Mb RAM. I'll post my results shortly.

Share this post


Link to post
Share on other sites
You have to understand they are fast, but as usual, microsoft doesn't use assembly unless for very critical functions...

There is always and will always be a faster way of doing everything... no doubt.

Share this post


Link to post
Share on other sites
D3DX vector functions do are __forceinline! And I'm do linking to d3dx9.lib! And more, every modern processor has, at least, sse. No one uses pentium 2 anymore, and if you do, i'm afraid you can't run a heavy (computer expensive) 3d aplication.

Share this post


Link to post
Share on other sites
1) No one said that the D3DX functions are unbeatable, you can beat them, but the performance benefit is unlikely to justify the time spent on it (How much time have you spent on just those 2 functions?)

2) D3DX Functions select optimum versions for the current hardware (i.e. SSE, 3DNow, SSE2). Your code doesn't, it assumes support for something and uses it.

3) It seems that you tested in a debug build, because I think any retarded compiler would've optimized out the whole loop (because you're not doing anything with v5).

Share this post


Link to post
Share on other sites
D3DX functions are just declared as inline, so the compiler can choose not to inline them if it wants. And as Syranide said, most of the vector functions don't use assembly at all. The matrix functions do, which is why they're in the .lib (So actually I was wrong, it doesn't matter what version of the d3dx lib you link to, it's not used).

My results are:
Custom 3DNow!... 27664784 ticks, 7728 ms
D3DX... 49081947 ticks, 13711 ms
So, it's still twice as fast.

I was using QueryPerformanceCounter() for profiling. What compiler were you using? I had to edit the code to stop the compiler optimizing the D3DX version away... Here's my complete code:

#include <windows.h>
#include <stdio.h>
#include <d3dx9math.h>

__declspec(align(16)) struct myVec3
{
float x, y, z, w;

myVec3() {x = y = z = w = 0.0f;}
myVec3(float _x, float _y, float _z)
{x = _x; y = _y; z = _z; w = 0.0f;}
};

#define MATHinl inline

MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut)
{
__asm
{
mov ebx, [esp+4]

movaps xmm0, [ecx]
movaps xmm1, xmm0
movaps xmm2, [edx]
movaps xmm3, xmm2

shufps xmm0, xmm0, 11001001b
shufps xmm1, xmm1, 11010010b
shufps xmm2, xmm2, 11010010b
shufps xmm3, xmm3, 11001001b

mulps xmm0, xmm2
mulps xmm1, xmm3

subps xmm0, xmm1
movaps [ebx], xmm0

ret 4
}
}

MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut)
{
__asm
{
mov ebx, [esp+4]

movq mm0, [ecx+4] //a3 a2
movd mm1, [ecx] //0 a1
movd mm2, [ecx+8] //0 a3

movd mm4, [edx+8] //0 b3
punpckldq mm4, [edx] //b1 b3
movd mm5, [edx+4] //0 b2
movq mm6, [edx+4] //b3 b2
movd mm7, [edx] //0 b1

movq mm3, mm0 //0 a2
punpckldq mm2, mm1 //a1 a3

pfmul mm0, mm4
pfmul mm1, mm5
pfmul mm2, mm6
pfmul mm3, mm7

pfsub mm0, mm2
pfsub mm1, mm3

movq [ebx], mm0
movq [ebx+8], mm1


femms

ret 4
}
}

int main(int argc, char** argv)
{
myVec3 v1(1.5f, 2.0f, 1.0f);
myVec3 v2(5.0f, 4.0f, 1.0f), fd;
D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5;
LARGE_INTEGER liFreq, liStart, liEnd;
double dTime;

printf("Profiling custom 3DNow!... ");
QueryPerformanceFrequency(&liFreq);
QueryPerformanceCounter(&liStart);
for(DWORD i=0; i<1000000000; ++i)
{
MATHMyVec3Cross(&v1, &v2, &fd);
}
QueryPerformanceCounter(&liEnd);
liEnd.QuadPart -= liStart.QuadPart;
dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0);
printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime);

printf("Profiling D3DX... ");
QueryPerformanceFrequency(&liFreq);
QueryPerformanceCounter(&liStart);
for(DWORD i=0; i<1000000000; ++i)
{
D3DXVec3Cross(&v3, &v4, &v3);
}
QueryPerformanceCounter(&liEnd);
liEnd.QuadPart -= liStart.QuadPart;
dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0);
printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime);
}



EDIT: I changed __forceinline to inline, just so it's the same as D3DX. But the compiler will almost certainly put it inline anyway.
Also, I'm having problems... If I put a loop to test the SSE version of the code, the 3DNow code takes forever to complete (I have up after a minute). I don't know why that is.

Also, as coder said - this is another reason to use D3DX - it's already been debugged and everything :P
If you really want to, try writing some code to profile matrix multiplication, that should be a bit more fair, since the D3DX version uses assembly. However, I don't think the compiler will be able to inline the code, since it's in a .lib. It might be interesting to try though.

Share this post


Link to post
Share on other sites
The D3DX functions work on non align16 memory, and don't trample the fourth float. Most apps have vec3s that really are vec3 (your fixed pipeline and most shaders expect it as such).

Because D3DX cannot fetch and store 4 floats at once, it's already going to be slower. Add in non-alignment, and again it's going to be slower. You've changed the definition of a Vec3 and written code to take advantage of it. Great. Use it when you can. Comparing it to a function with completely different, more generic goals is absurd.

Share this post


Link to post
Share on other sites
hey coder, i'm planning to check for one's processor capabilities in the beginning of the program and make a function pointer table for each function containing 3 versions of the same code (d3dx code, sse, 3dnow). And I was giving only an example with the d3xvec3 function, which is only a simple and very used funtion... for more complex functions, I think I could have a remarkable gain of speed, let's say, 2 fps...

Share this post


Link to post
Share on other sites
Using the following code (copied Evil Steve's and modified a bit to prevent optimizing the loop out):
#include <windows.h>
#include <stdio.h>
#include <d3dx9math.h>

__declspec(align(16)) struct myVec3
{
float x, y, z, w;

myVec3() {x = y = z = w = 0.0f;}
myVec3(float _x, float _y, float _z)
{x = _x; y = _y; z = _z; w = 0.0f;}
};

#define MATHinl inline

MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut)
{
__asm
{
mov ebx, [esp+4]

movaps xmm0, [ecx]
movaps xmm1, xmm0
movaps xmm2, [edx]
movaps xmm3, xmm2

shufps xmm0, xmm0, 11001001b
shufps xmm1, xmm1, 11010010b
shufps xmm2, xmm2, 11010010b
shufps xmm3, xmm3, 11001001b

mulps xmm0, xmm2
mulps xmm1, xmm3

subps xmm0, xmm1
movaps [ebx], xmm0

ret 4
}
}

MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut)
{
__asm
{
mov ebx, [esp+4]

movq mm0, [ecx+4] //a3 a2
movd mm1, [ecx] //0 a1
movd mm2, [ecx+8] //0 a3

movd mm4, [edx+8] //0 b3
punpckldq mm4, [edx] //b1 b3
movd mm5, [edx+4] //0 b2
movq mm6, [edx+4] //b3 b2
movd mm7, [edx] //0 b1

movq mm3, mm0 //0 a2
punpckldq mm2, mm1 //a1 a3

pfmul mm0, mm4
pfmul mm1, mm5
pfmul mm2, mm6
pfmul mm3, mm7

pfsub mm0, mm2
pfsub mm1, mm3

movq [ebx], mm0
movq [ebx+8], mm1


femms

ret 4
}
}

int main(int argc, char** argv)
{
myVec3 v1(1.5f, 2.0f, 1.0f);
myVec3 v2(5.0f, 4.0f, 1.0f), fd;
D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5;
LARGE_INTEGER liFreq, liStart, liEnd;
double dTime;

printf("Profiling... ");
QueryPerformanceFrequency(&liFreq);
QueryPerformanceCounter(&liStart);
for(DWORD i=0; i<1000000000; ++i)
{
//MATHMyVec3Cross(&v1, &v2, &fd);
//MATHMyVec3Cross(&fd, &v2, &v1);
D3DXVec3Cross(&v5, &v4, &v3);
D3DXVec3Cross(&v4, &v5, &v3);
}
QueryPerformanceCounter(&liEnd);
liEnd.QuadPart -= liStart.QuadPart;
dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0);
printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime);

return 0;
}



On an AthlonXP 2000, 512 MB RAM:
3DNow!: 39996 ms
D3DX: 27224 ms

i.e. D3DX took 68% the time your custom functions took. Compiler is VC++ 2005 Express Beta, release build, all optimizations on (except Global Optimization), Favor fast code, ...etc

Tough luck.

[edit]Clarifying the last paragraph

[Edited by - Coder on March 3, 2005 3:14:21 PM]

Share this post


Link to post
Share on other sites
ASSUMING... that you are spending all your time doing vector math. There is a lot more going on in any game that probably takes up much more of the proccessing time than this. Try profiling a full game using the two different functions, and then maybe the results will be more relevant.

Until then it was a fun project that was a premature optomization.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!