d3dx library sucks! please read this!
everyone says d3dx library functions are very fast. So, I decided do make a simple test:
1) first, I made my own vector structure:
__declspec(align(16)) struct myVec3 {
float x, y, z, w;
myVec3() {
x = y = z = w = 0.0f;
};
myVec3(float x, float y, float z)
{
this->x = x; this->y = y; this->z = z; this->w = 0.0f;
};
}
note that it has 4 members, but the last one will always be zero!!!
then I made these two functions:
#define MATHinl __forceinline
MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut) {
__asm {
mov ebx, [esp+4]
/*
0 a1 a3 a2 -- 0 b2 b1 b3
0 a2 a1 a3 -- 0 b1 b3 b2
*/
movaps xmm0, [ecx]
movaps xmm1, xmm0
movaps xmm2, [edx]
movaps xmm3, xmm2
shufps xmm0, xmm0, 11001001b
shufps xmm1, xmm1, 11010010b
shufps xmm2, xmm2, 11010010b
shufps xmm3, xmm3, 11001001b
mulps xmm0, xmm2
mulps xmm1, xmm3
subps xmm0, xmm1
movaps [ebx], xmm0
ret 4
}
}
MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut) {
__asm {
mov ebx, [esp+4]
/*
0 a1 a3 a2 -- 0 b2 b1 b3
0 a2 a1 a3 -- 0 b1 b3 b2
*/
movq mm0, [ecx+4] //a3 a2
movd mm1, [ecx] //0 a1
movd mm2, [ecx+8] //0 a3
movd mm4, [edx+8] //0 b3
punpckldq mm4, [edx] //b1 b3
movd mm5, [edx+4] //0 b2
movq mm6, [edx+4] //b3 b2
movd mm7, [edx] //0 b1
movq mm3, mm0 //0 a2
punpckldq mm2, mm1 //a1 a3
pfmul mm0, mm4
pfmul mm1, mm5
pfmul mm2, mm6
pfmul mm3, mm7
pfsub mm0, mm2
pfsub mm1, mm3
movq [ebx], mm0
movq [ebx+8], mm1
femms
ret 4
}
}
the first one uses sse and the second 3dnow
then, inside my main() function, I made the test like this:
myVec3 v1(1.5f, 2.0f, 1.0f);
myVec3 v2(5.0f, 4.0f, 1.0f), fd;
D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5;
unsigned long i1 = timeGetTime();
for(unsigned int i = 0; i<100000000; i++) {
MATHMyVec3Cross(&v1, &v2, &fd);
//MATHMyVec3Cross2(&v1, &v2, &fd);
//D3DXVec3Cross(&v5, &v4, &v3);
}
unsigned long i2 = timeGetTime();
i2-=i1;
make the test switching the comments to test each function. I tested in my Athlon XP 1700+ with 256mb of memory. The results were amazing:
d3dx: 17650 miliseconds;
MATHMyVec3Cross: 1250 miliseconds;
MATHMyVec3Cross2: 1700 miliseconds.
the 3dnow version is about 12 times faster than the d3dx one!! the sse version is about 10 times faster the the d3dx one!! I want you to test and post here your results and your machine techs... and don't forget that the 3dnow! version only works on amd processors!!!
HEY, COMPILE ON VC++ 2003!!
That *was* a release build, wasn't it? And you were linking with d3dx9.lib, not d3dx9d.lib?
Also, there's bound to be some overhead in using the D3DX functions, since they'll work on any machine, even one without SSE or 3DNow!. Also, it saves a lot of time to just use the D3DX functions than write your own functions.
EDIT: Also, D3DX functions aren't __forceinline'd. I'm testing your code just now on my Athlon 2600XP, 512Mb RAM. I'll post my results shortly.
Also, there's bound to be some overhead in using the D3DX functions, since they'll work on any machine, even one without SSE or 3DNow!. Also, it saves a lot of time to just use the D3DX functions than write your own functions.
EDIT: Also, D3DX functions aren't __forceinline'd. I'm testing your code just now on my Athlon 2600XP, 512Mb RAM. I'll post my results shortly.
You have to understand they are fast, but as usual, microsoft doesn't use assembly unless for very critical functions...
There is always and will always be a faster way of doing everything... no doubt.
There is always and will always be a faster way of doing everything... no doubt.
D3DX vector functions do are __forceinline! And I'm do linking to d3dx9.lib! And more, every modern processor has, at least, sse. No one uses pentium 2 anymore, and if you do, i'm afraid you can't run a heavy (computer expensive) 3d aplication.
1) No one said that the D3DX functions are unbeatable, you can beat them, but the performance benefit is unlikely to justify the time spent on it (How much time have you spent on just those 2 functions?)
2) D3DX Functions select optimum versions for the current hardware (i.e. SSE, 3DNow, SSE2). Your code doesn't, it assumes support for something and uses it.
3) It seems that you tested in a debug build, because I think any retarded compiler would've optimized out the whole loop (because you're not doing anything with v5).
2) D3DX Functions select optimum versions for the current hardware (i.e. SSE, 3DNow, SSE2). Your code doesn't, it assumes support for something and uses it.
3) It seems that you tested in a debug build, because I think any retarded compiler would've optimized out the whole loop (because you're not doing anything with v5).
D3DX functions are just declared as inline, so the compiler can choose not to inline them if it wants. And as Syranide said, most of the vector functions don't use assembly at all. The matrix functions do, which is why they're in the .lib (So actually I was wrong, it doesn't matter what version of the d3dx lib you link to, it's not used).
My results are:
Custom 3DNow!... 27664784 ticks, 7728 ms
D3DX... 49081947 ticks, 13711 ms
So, it's still twice as fast.
I was using QueryPerformanceCounter() for profiling. What compiler were you using? I had to edit the code to stop the compiler optimizing the D3DX version away... Here's my complete code:
EDIT: I changed __forceinline to inline, just so it's the same as D3DX. But the compiler will almost certainly put it inline anyway.
Also, I'm having problems... If I put a loop to test the SSE version of the code, the 3DNow code takes forever to complete (I have up after a minute). I don't know why that is.
Also, as coder said - this is another reason to use D3DX - it's already been debugged and everything :P
If you really want to, try writing some code to profile matrix multiplication, that should be a bit more fair, since the D3DX version uses assembly. However, I don't think the compiler will be able to inline the code, since it's in a .lib. It might be interesting to try though.
My results are:
Custom 3DNow!... 27664784 ticks, 7728 ms
D3DX... 49081947 ticks, 13711 ms
So, it's still twice as fast.
I was using QueryPerformanceCounter() for profiling. What compiler were you using? I had to edit the code to stop the compiler optimizing the D3DX version away... Here's my complete code:
#include <windows.h>#include <stdio.h>#include <d3dx9math.h>__declspec(align(16)) struct myVec3{ float x, y, z, w; myVec3() {x = y = z = w = 0.0f;} myVec3(float _x, float _y, float _z) {x = _x; y = _y; z = _z; w = 0.0f;}};#define MATHinl inlineMATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut){ __asm { mov ebx, [esp+4] movaps xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, [edx] movaps xmm3, xmm2 shufps xmm0, xmm0, 11001001b shufps xmm1, xmm1, 11010010b shufps xmm2, xmm2, 11010010b shufps xmm3, xmm3, 11001001b mulps xmm0, xmm2 mulps xmm1, xmm3 subps xmm0, xmm1 movaps [ebx], xmm0 ret 4 }}MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut){ __asm { mov ebx, [esp+4] movq mm0, [ecx+4] //a3 a2 movd mm1, [ecx] //0 a1 movd mm2, [ecx+8] //0 a3 movd mm4, [edx+8] //0 b3 punpckldq mm4, [edx] //b1 b3 movd mm5, [edx+4] //0 b2 movq mm6, [edx+4] //b3 b2 movd mm7, [edx] //0 b1 movq mm3, mm0 //0 a2 punpckldq mm2, mm1 //a1 a3 pfmul mm0, mm4 pfmul mm1, mm5 pfmul mm2, mm6 pfmul mm3, mm7 pfsub mm0, mm2 pfsub mm1, mm3 movq [ebx], mm0 movq [ebx+8], mm1 femms ret 4 }}int main(int argc, char** argv){ myVec3 v1(1.5f, 2.0f, 1.0f); myVec3 v2(5.0f, 4.0f, 1.0f), fd; D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5; LARGE_INTEGER liFreq, liStart, liEnd; double dTime; printf("Profiling custom 3DNow!... "); QueryPerformanceFrequency(&liFreq); QueryPerformanceCounter(&liStart); for(DWORD i=0; i<1000000000; ++i) { MATHMyVec3Cross(&v1, &v2, &fd); } QueryPerformanceCounter(&liEnd); liEnd.QuadPart -= liStart.QuadPart; dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0); printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime); printf("Profiling D3DX... "); QueryPerformanceFrequency(&liFreq); QueryPerformanceCounter(&liStart); for(DWORD i=0; i<1000000000; ++i) { D3DXVec3Cross(&v3, &v4, &v3); } QueryPerformanceCounter(&liEnd); liEnd.QuadPart -= liStart.QuadPart; dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0); printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime);}
EDIT: I changed __forceinline to inline, just so it's the same as D3DX. But the compiler will almost certainly put it inline anyway.
Also, I'm having problems... If I put a loop to test the SSE version of the code, the 3DNow code takes forever to complete (I have up after a minute). I don't know why that is.
Also, as coder said - this is another reason to use D3DX - it's already been debugged and everything :P
If you really want to, try writing some code to profile matrix multiplication, that should be a bit more fair, since the D3DX version uses assembly. However, I don't think the compiler will be able to inline the code, since it's in a .lib. It might be interesting to try though.
The D3DX functions work on non align16 memory, and don't trample the fourth float. Most apps have vec3s that really are vec3 (your fixed pipeline and most shaders expect it as such).
Because D3DX cannot fetch and store 4 floats at once, it's already going to be slower. Add in non-alignment, and again it's going to be slower. You've changed the definition of a Vec3 and written code to take advantage of it. Great. Use it when you can. Comparing it to a function with completely different, more generic goals is absurd.
Because D3DX cannot fetch and store 4 floats at once, it's already going to be slower. Add in non-alignment, and again it's going to be slower. You've changed the definition of a Vec3 and written code to take advantage of it. Great. Use it when you can. Comparing it to a function with completely different, more generic goals is absurd.
hey coder, i'm planning to check for one's processor capabilities in the beginning of the program and make a function pointer table for each function containing 3 versions of the same code (d3dx code, sse, 3dnow). And I was giving only an example with the d3xvec3 function, which is only a simple and very used funtion... for more complex functions, I think I could have a remarkable gain of speed, let's say, 2 fps...
Using the following code (copied Evil Steve's and modified a bit to prevent optimizing the loop out):
On an AthlonXP 2000, 512 MB RAM:
3DNow!: 39996 ms
D3DX: 27224 ms
i.e. D3DX took 68% the time your custom functions took. Compiler is VC++ 2005 Express Beta, release build, all optimizations on (except Global Optimization), Favor fast code, ...etc
Tough luck.
[edit]Clarifying the last paragraph
[Edited by - Coder on March 3, 2005 3:14:21 PM]
#include <windows.h>#include <stdio.h>#include <d3dx9math.h>__declspec(align(16)) struct myVec3{ float x, y, z, w; myVec3() {x = y = z = w = 0.0f;} myVec3(float _x, float _y, float _z) {x = _x; y = _y; z = _z; w = 0.0f;}};#define MATHinl inlineMATHinl void __declspec(naked) __fastcall MATHMyVec3Cross2(myVec3 *v1, myVec3 *v2, myVec3* pOut){ __asm { mov ebx, [esp+4] movaps xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, [edx] movaps xmm3, xmm2 shufps xmm0, xmm0, 11001001b shufps xmm1, xmm1, 11010010b shufps xmm2, xmm2, 11010010b shufps xmm3, xmm3, 11001001b mulps xmm0, xmm2 mulps xmm1, xmm3 subps xmm0, xmm1 movaps [ebx], xmm0 ret 4 }}MATHinl void __declspec(naked) __fastcall MATHMyVec3Cross(myVec3 *v1, myVec3 *v2, myVec3* pOut){ __asm { mov ebx, [esp+4] movq mm0, [ecx+4] //a3 a2 movd mm1, [ecx] //0 a1 movd mm2, [ecx+8] //0 a3 movd mm4, [edx+8] //0 b3 punpckldq mm4, [edx] //b1 b3 movd mm5, [edx+4] //0 b2 movq mm6, [edx+4] //b3 b2 movd mm7, [edx] //0 b1 movq mm3, mm0 //0 a2 punpckldq mm2, mm1 //a1 a3 pfmul mm0, mm4 pfmul mm1, mm5 pfmul mm2, mm6 pfmul mm3, mm7 pfsub mm0, mm2 pfsub mm1, mm3 movq [ebx], mm0 movq [ebx+8], mm1 femms ret 4 }}int main(int argc, char** argv){ myVec3 v1(1.5f, 2.0f, 1.0f); myVec3 v2(5.0f, 4.0f, 1.0f), fd; D3DXVECTOR3 v3(1.5f, 2.0f, 1.0f), v4(5.0f, 4.0f, 1.0f), v5; LARGE_INTEGER liFreq, liStart, liEnd; double dTime; printf("Profiling... "); QueryPerformanceFrequency(&liFreq); QueryPerformanceCounter(&liStart); for(DWORD i=0; i<1000000000; ++i) { //MATHMyVec3Cross(&v1, &v2, &fd); //MATHMyVec3Cross(&fd, &v2, &v1); D3DXVec3Cross(&v5, &v4, &v3); D3DXVec3Cross(&v4, &v5, &v3); } QueryPerformanceCounter(&liEnd); liEnd.QuadPart -= liStart.QuadPart; dTime = (double)liEnd.QuadPart / ((double)liFreq.QuadPart / 1000.0); printf("%I64lu ticks, %d ms\n",liEnd,(int)dTime); return 0;}
On an AthlonXP 2000, 512 MB RAM:
3DNow!: 39996 ms
D3DX: 27224 ms
i.e. D3DX took 68% the time your custom functions took. Compiler is VC++ 2005 Express Beta, release build, all optimizations on (except Global Optimization), Favor fast code, ...etc
Tough luck.
[edit]Clarifying the last paragraph
[Edited by - Coder on March 3, 2005 3:14:21 PM]
ASSUMING... that you are spending all your time doing vector math. There is a lot more going on in any game that probably takes up much more of the proccessing time than this. Try profiling a full game using the two different functions, and then maybe the results will be more relevant.
Until then it was a fun project that was a premature optomization.
Until then it was a fun project that was a premature optomization.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement