Direct3D faster than OpenGL? how...

Started by
7 comments, last by MJP 16 years, 1 month ago
Hi guys, (Sorry bad English, I'm French-Canadian) I am a bit desperate now. The engine I am building supports both OpenGL and Direct3D. ------------------- First, some statistics, (With shaders off). (With the exact same view and amount of triangles) Direct3D: 100 FPS OpenGL: 280 FPS PC: Dual-Xeon Duo (4 cores) VCard: Ati FireGL V3400 -------------------- I am having difficulties to achieve a good FPS in Direct3D. In both cases, I am using Vertex Buffers (ARB in OpenGL). I know 100 FPS is not much of a problem, but when I have the shaders on top of this, or run it on a lower machine, it gets worse. -------------------- That is what I tried: - Set to Hardware Vertex Processing. (I gained 10 FPS, from 90 to 100) - Batching objects. (No gain in D3D) - Removed ALL gets to the device and set it to D3DCREATE_PUREDEVICE (No gain at all) - Sorted by material (Passed from 73 material switches to 25, and strangely no gain at all) - Using only static buffers, with usage 0 and D3DPOOL_DEFAULT -------------------- I have read the msdn performance tips. And that is all what I am doing, except for TriangleStrips (Using triangle enumeration), but it is not different in OpenGL. And they say also that hardware now achieve that optimization for us. -------------------- The only thing I am worried about, is when I create a texture I am not using D3DPOOL_DEFAULT, but MANAGED. Because I have to lock the texture after to put the data in it. (Using our own format, and simplify the switch from OpenGL to Direct3D). So is there a way to create a texture with DEFAULT Pool, and to set his data without having to "lock" it? All our textures are loaded this way:

//--- Create the empty texture

//--- Lock surface to write data on it
D3DLOCKED_RECT lockedRect;
m_dxTexture->LockRect(0, &lockedRect, 0, 0);

//--- Fill it with the data
HGubyte *imageData = (HGubyte*)lockedRect.pBits;
// imageData is NULL if I create the texture using D3DPOOL_DEFAULT

//--- Unlock surface

That is my only wondering, and possible optimization case left. I have read in those forums that D3DPOOL_DEFAULT is faster. And we are using managed for ALL our textures. -------------------- Thanks a bunch guys for your help :)
Well the first step would be to do some profiling to figure out where your bottlnecks are. Probably the first thing you want to figure out is whether you're CPU-limited or GPU-limited. PIX can be invaluable for this kind of work, as it will break down the CPU-time and GPU-time per-frame. It will also profile your API calls, so you can see if you're spending a significant amount of time in any particular functions. See this gamefest presentation for more info.

Also, you don't want to put your textures in the DEFAULT pool unless you have to (for example, to use as a render target). It's best to keep them in MANAGED, so that the runtime can manage the memory for you.
Quote:Original post by Daivuk
- Sorted by material (Passed from 73 material switches to 25, and strangely no gain at all)
How many DrawPrimitive() and DrawIndexedPrimtive() calls do you have? Those calls are particularly expensive in D3D, decreasing them should help.

Quote:Original post by Daivuk
- Using only static buffers, with usage 0 and D3DPOOL_DEFAULT
As MJP said, don't. The managed pool is basically one copy of the data in the default pool and one in system memory. D3D will upload data to the graphics card as needed - and so long as you're not running out of video memory, all your resources should be kept in video memory.

Quote:Original post by Daivuk
I have read the msdn performance tips.

And that is all what I am doing, except for TriangleStrips (Using triangle enumeration), but it is not different in OpenGL. And they say also that hardware now achieve that optimization for us.
I was under the impression that indexed triangle lists are generally faster now.

Also as MJP said, you'll need to profile to find out eexactly what's going on.


I've been profiling now for a couple of days.
Printing a lot of infos on screen, hiding/showing parts and looking at the impact on the FPS/draw count/texture count. Also did some tests with PIX.

The statistics for the hole scene:

---- Overall ----
fps : 96
triangle count: 19945
texture calls : 69
draw calls : 342
state calls : 203 (That's SetRenderState)

My number of draw calls where almost 1000 before, and texture calls was around 500.

So I thought I have found the problem, but no... Batching my stuff and lowering the state changes didn't change anything. Actually, what is taking all those calls are everything but the walls. Statistics for the walls only:

---- Walls ----
fps : 158
triangle count: 13718
texture calls : 8
draw calls : 50
state calls : 0

Note that those don't take any calls at all!! But have the major part of the triangles, and the lowest FPS.

In conclusion: Number of triangles cost a lot. But, what the... ? I have only 20k triangles here!! I tried running in hardware vertex processing, and Pure device. On some machine it gets slower (Mac book pro by example).

And why do I have 300 fps in OpenGL for the exact same thing?

I am using triangles enumeration, next step is to test indexed. It will be hard because some normals don't meant to be smoothed and uvs don't always fit.
Quote:Original post by Daivuk
fps : 96
triangle count: 19945
texture calls : 69
draw calls : 342
state calls : 203 (That's SetRenderState)
Uhm... Do I get it right? The average batch length is about 59 triangles. No wonder your results are screwed. I wonder if it makes sense on a hardware like yours, to benchmark that way and pretend to obtain meaningful numbers...
Quote:Original post by Daivuk
---- Walls ----
fps : 158
triangle count: 13718
texture calls : 8
draw calls : 50
state calls : 0

3 triangles per call? Why are you doing this?
The batching problem comes from two sources. The first is the CPU bottleneck you don't have, the second is that GPUs, by themselves are ugly slow with small amount of triangles (higher power at high RPMs in a certain sense).
I have no experience at all with FireGLs but 2Mtris/sec, considering the average batch size doesn't sound SO bad. My experience was worse (ktris/sec with about 30 vertices as a batch size).

The APIs are "more or less" well performing when properly used. You're giving D3D a too big handicap here by punching it where it gets it worse. If you're replicating the walls just to save on memory, I don't think it's futureproof.
Both APIs are "good" (or "bad" depending on your viewpoint) but you cannot expect them to have the same performance pattern.
Quote:Original post by Daivuk
And why do I have 300 fps in OpenGL for the exact same thing?
I suppose that the inner core is the culprit here but I don't know much about it... people on the GL forum would say that ATi's probably playing dirty with its internal locks and semantics... although I feel this isn't going to happen on FireGLs. ;)
Quote:Original post by Daivuk
I am using triangles enumeration, next step is to test indexed. It will be hard because some normals don't meant to be smoothed and uvs don't always fit.
I have a bad feeling here. Could you elaborate a bit on how you manage your geometry?

Previously "Krohm"

Indexed primitives are significantly quicker on modern hardware because it enables caching for processed vertices. It can get you up to a six times vertex throughput performance boost with optimal reuse, but in reality you should expect significantly less than that.

Where UVs or Normals don't match but positions do you need to duplicate the vertex for each different usage. One option is to duplicate it for all four combinations, and then just delete the unused ones afterwards. This should obviously be done in a converter tool, and not at load time as it can take a while.
Quote:Original post by Daivuk
And why do I have 300 fps in OpenGL for the exact same thing?

Because the way the drivers run on Windowss. The GL drivers run in user space, so if you make many glDrawRangeElement calls, then it is not such a problem.
With D3D, every call causes a switch to ring 0 (kernel space) and this is expensive.

You need to adjust your code. Render more tri per drawPrimitive call.
an open source GLU replacement library. Much more modern than GLU.
float matrix[16], inverse_matrix[16];
glhTranslatef2(matrix, 0.0, 0.0, 5.0);
glhRotateAboutXf2(matrix, angleInRadians);
glhScalef2(matrix, 1.0, 1.0, -1.0);
glhQuickInvertMatrixf2(matrix, inverse_matrix);
glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);

That's not true. Every Direct3D call does not require a ring transition.

The D3D calls are buffered by the D3D runtime, then the entire command buffer is sent to the driver at one time (typically when the app calls Present).

If you've improved your batching and haven't seen a performance increase, it's very likely you're not CPU-bound. This is why I said you needed to figure out where your bottleneck is: relieving pressure on the CPU doesn't do much for you when it's your GPU that's holding things up.

This topic is closed to new replies.
