Sign in to follow this  
Daivuk

OpenGL Direct3D faster than OpenGL? how...

Recommended Posts

Daivuk    413
Hi guys, (Sorry bad English, I'm French-Canadian) I am a bit desperate now. The engine I am building supports both OpenGL and Direct3D. ------------------- First, some statistics, (With shaders off). (With the exact same view and amount of triangles) Direct3D: 100 FPS OpenGL: 280 FPS PC: Dual-Xeon Duo (4 cores) VCard: Ati FireGL V3400 -------------------- I am having difficulties to achieve a good FPS in Direct3D. In both cases, I am using Vertex Buffers (ARB in OpenGL). I know 100 FPS is not much of a problem, but when I have the shaders on top of this, or run it on a lower machine, it gets worse. -------------------- That is what I tried: - Set to Hardware Vertex Processing. (I gained 10 FPS, from 90 to 100) - Batching objects. (No gain in D3D) - Removed ALL gets to the device and set it to D3DCREATE_PUREDEVICE (No gain at all) - Sorted by material (Passed from 73 material switches to 25, and strangely no gain at all) - Using only static buffers, with usage 0 and D3DPOOL_DEFAULT -------------------- I have read the msdn performance tips. http://msdn2.microsoft.com/en-us/library/bb147263(VS.85).aspx And that is all what I am doing, except for TriangleStrips (Using triangle enumeration), but it is not different in OpenGL. And they say also that hardware now achieve that optimization for us. -------------------- The only thing I am worried about, is when I create a texture I am not using D3DPOOL_DEFAULT, but MANAGED. Because I have to lock the texture after to put the data in it. (Using our own format, and simplify the switch from OpenGL to Direct3D). So is there a way to create a texture with DEFAULT Pool, and to set his data without having to "lock" it? All our textures are loaded this way:
//--- Create the empty texture
D3DXCreateTexture(
    m_D3DDevice,
    m_size[0],
    m_size[1],
    0,
    0,
    format,
    D3DPOOL_MANAGED,
    &m_dxTexture);

//--- Lock surface to write data on it
D3DLOCKED_RECT lockedRect;
m_dxTexture->LockRect(0, &lockedRect, 0, 0);

//--- Fill it with the data
HGubyte *imageData = (HGubyte*)lockedRect.pBits;
// imageData is NULL if I create the texture using D3DPOOL_DEFAULT

//--- Unlock surface
m_dxTexture->UnlockRect(0);



That is my only wondering, and possible optimization case left. I have read in those forums that D3DPOOL_DEFAULT is faster. And we are using managed for ALL our textures. -------------------- Thanks a bunch guys for your help :)

Share this post


Link to post
Share on other sites
MJP    19753
Well the first step would be to do some profiling to figure out where your bottlnecks are. Probably the first thing you want to figure out is whether you're CPU-limited or GPU-limited. PIX can be invaluable for this kind of work, as it will break down the CPU-time and GPU-time per-frame. It will also profile your API calls, so you can see if you're spending a significant amount of time in any particular functions. See this gamefest presentation for more info.

Also, you don't want to put your textures in the DEFAULT pool unless you have to (for example, to use as a render target). It's best to keep them in MANAGED, so that the runtime can manage the memory for you.

Share this post


Link to post
Share on other sites
Evil Steve    2017
Quote:
Original post by Daivuk
- Sorted by material (Passed from 73 material switches to 25, and strangely no gain at all)
How many DrawPrimitive() and DrawIndexedPrimtive() calls do you have? Those calls are particularly expensive in D3D, decreasing them should help.

Quote:
Original post by Daivuk
- Using only static buffers, with usage 0 and D3DPOOL_DEFAULT
As MJP said, don't. The managed pool is basically one copy of the data in the default pool and one in system memory. D3D will upload data to the graphics card as needed - and so long as you're not running out of video memory, all your resources should be kept in video memory.


Quote:
Original post by Daivuk
--------------------
I have read the msdn performance tips.
http://msdn2.microsoft.com/en-us/library/bb147263(VS.85).aspx

And that is all what I am doing, except for TriangleStrips (Using triangle enumeration), but it is not different in OpenGL. And they say also that hardware now achieve that optimization for us.
I was under the impression that indexed triangle lists are generally faster now.

Also as MJP said, you'll need to profile to find out eexactly what's going on.

Share this post


Link to post
Share on other sites
Daivuk    413
Ok,

I've been profiling now for a couple of days.
Printing a lot of infos on screen, hiding/showing parts and looking at the impact on the FPS/draw count/texture count. Also did some tests with PIX.

The statistics for the hole scene:

---- Overall ----
fps : 96
triangle count: 19945
texture calls : 69
draw calls : 342
state calls : 203 (That's SetRenderState)


My number of draw calls where almost 1000 before, and texture calls was around 500.

So I thought I have found the problem, but no... Batching my stuff and lowering the state changes didn't change anything. Actually, what is taking all those calls are everything but the walls. Statistics for the walls only:

---- Walls ----
fps : 158
triangle count: 13718
texture calls : 8
draw calls : 50
state calls : 0

Note that those don't take any calls at all!! But have the major part of the triangles, and the lowest FPS.

In conclusion: Number of triangles cost a lot. But, what the... ? I have only 20k triangles here!! I tried running in hardware vertex processing, and Pure device. On some machine it gets slower (Mac book pro by example).

And why do I have 300 fps in OpenGL for the exact same thing?


------------------
I am using triangles enumeration, next step is to test indexed. It will be hard because some normals don't meant to be smoothed and uvs don't always fit.

Share this post


Link to post
Share on other sites
Krohm    5030
Quote:
Original post by Daivuk
fps : 96
triangle count: 19945
texture calls : 69
draw calls : 342
state calls : 203 (That's SetRenderState)
Uhm... Do I get it right? The average batch length is about 59 triangles. No wonder your results are screwed. I wonder if it makes sense on a hardware like yours, to benchmark that way and pretend to obtain meaningful numbers...
Quote:
Original post by Daivuk
---- Walls ----
fps : 158
triangle count: 13718
texture calls : 8
draw calls : 50
state calls : 0

3 triangles per call? Why are you doing this?
The batching problem comes from two sources. The first is the CPU bottleneck you don't have, the second is that GPUs, by themselves are ugly slow with small amount of triangles (higher power at high RPMs in a certain sense).
I have no experience at all with FireGLs but 2Mtris/sec, considering the average batch size doesn't sound SO bad. My experience was worse (ktris/sec with about 30 vertices as a batch size).

The APIs are "more or less" well performing when properly used. You're giving D3D a too big handicap here by punching it where it gets it worse. If you're replicating the walls just to save on memory, I don't think it's futureproof.
Both APIs are "good" (or "bad" depending on your viewpoint) but you cannot expect them to have the same performance pattern.
Quote:
Original post by Daivuk
And why do I have 300 fps in OpenGL for the exact same thing?
I suppose that the inner core is the culprit here but I don't know much about it... people on the GL forum would say that ATi's probably playing dirty with its internal locks and semantics... although I feel this isn't going to happen on FireGLs. ;)
Quote:
Original post by Daivuk
I am using triangles enumeration, next step is to test indexed. It will be hard because some normals don't meant to be smoothed and uvs don't always fit.
I have a bad feeling here. Could you elaborate a bit on how you manage your geometry?

Share this post


Link to post
Share on other sites
Adam_42    3629
Indexed primitives are significantly quicker on modern hardware because it enables caching for processed vertices. It can get you up to a six times vertex throughput performance boost with optimal reuse, but in reality you should expect significantly less than that.

Where UVs or Normals don't match but positions do you need to duplicate the vertex for each different usage. One option is to duplicate it for all four combinations, and then just delete the unused ones afterwards. This should obviously be done in a converter tool, and not at load time as it can take a while.

Share this post


Link to post
Share on other sites
V-man    813
Quote:
Original post by Daivuk
And why do I have 300 fps in OpenGL for the exact same thing?


Because the way the drivers run on Windowss. The GL drivers run in user space, so if you make many glDrawRangeElement calls, then it is not such a problem.
With D3D, every call causes a switch to ring 0 (kernel space) and this is expensive.

You need to adjust your code. Render more tri per drawPrimitive call.

Share this post


Link to post
Share on other sites
don    431
@V-man

That's not true. Every Direct3D call does not require a ring transition.

The D3D calls are buffered by the D3D runtime, then the entire command buffer is sent to the driver at one time (typically when the app calls Present).

Share this post


Link to post
Share on other sites
MJP    19753
If you've improved your batching and haven't seen a performance increase, it's very likely you're not CPU-bound. This is why I said you needed to figure out where your bottleneck is: relieving pressure on the CPU doesn't do much for you when it's your GPU that's holding things up.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Similar Content

    • By markshaw001
      Hi i am new to this forum  i wanted to ask for help from all of you i want to generate real time terrain using a 32 bit heightmap i am good at c++ and have started learning Opengl as i am very interested in making landscapes in opengl i have looked around the internet for help about this topic but i am not getting the hang of the concepts and what they are doing can some here suggests me some good resources for making terrain engine please for example like tutorials,books etc so that i can understand the whole concept of terrain generation.
       
    • By KarimIO
      Hey guys. I'm trying to get my application to work on my Nvidia GTX 970 desktop. It currently works on my Intel HD 3000 laptop, but on the desktop, every bind textures specifically from framebuffers, I get half a second of lag. This is done 4 times as I have three RGBA textures and one depth 32F buffer. I tried to use debugging software for the first time - RenderDoc only shows SwapBuffers() and no OGL calls, while Nvidia Nsight crashes upon execution, so neither are helpful. Without binding it runs regularly. This does not happen with non-framebuffer binds.
      GLFramebuffer::GLFramebuffer(FramebufferCreateInfo createInfo) { glGenFramebuffers(1, &fbo); glBindFramebuffer(GL_FRAMEBUFFER, fbo); textures = new GLuint[createInfo.numColorTargets]; glGenTextures(createInfo.numColorTargets, textures); GLenum *DrawBuffers = new GLenum[createInfo.numColorTargets]; for (uint32_t i = 0; i < createInfo.numColorTargets; i++) { glBindTexture(GL_TEXTURE_2D, textures[i]); GLint internalFormat; GLenum format; TranslateFormats(createInfo.colorFormats[i], format, internalFormat); // returns GL_RGBA and GL_RGBA glTexImage2D(GL_TEXTURE_2D, 0, internalFormat, createInfo.width, createInfo.height, 0, format, GL_FLOAT, 0); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); DrawBuffers[i] = GL_COLOR_ATTACHMENT0 + i; glBindTexture(GL_TEXTURE_2D, 0); glFramebufferTexture(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0 + i, textures[i], 0); } if (createInfo.depthFormat != FORMAT_DEPTH_NONE) { GLenum depthFormat; switch (createInfo.depthFormat) { case FORMAT_DEPTH_16: depthFormat = GL_DEPTH_COMPONENT16; break; case FORMAT_DEPTH_24: depthFormat = GL_DEPTH_COMPONENT24; break; case FORMAT_DEPTH_32: depthFormat = GL_DEPTH_COMPONENT32; break; case FORMAT_DEPTH_24_STENCIL_8: depthFormat = GL_DEPTH24_STENCIL8; break; case FORMAT_DEPTH_32_STENCIL_8: depthFormat = GL_DEPTH32F_STENCIL8; break; } glGenTextures(1, &depthrenderbuffer); glBindTexture(GL_TEXTURE_2D, depthrenderbuffer); glTexImage2D(GL_TEXTURE_2D, 0, depthFormat, createInfo.width, createInfo.height, 0, GL_DEPTH_COMPONENT, GL_FLOAT, 0); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glBindTexture(GL_TEXTURE_2D, 0); glFramebufferTexture(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, depthrenderbuffer, 0); } if (createInfo.numColorTargets > 0) glDrawBuffers(createInfo.numColorTargets, DrawBuffers); else glDrawBuffer(GL_NONE); if (glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE) std::cout << "Framebuffer Incomplete\n"; glBindFramebuffer(GL_FRAMEBUFFER, 0); width = createInfo.width; height = createInfo.height; } // ... // FBO Creation FramebufferCreateInfo gbufferCI; gbufferCI.colorFormats = gbufferCFs.data(); gbufferCI.depthFormat = FORMAT_DEPTH_32; gbufferCI.numColorTargets = gbufferCFs.size(); gbufferCI.width = engine.settings.resolutionX; gbufferCI.height = engine.settings.resolutionY; gbufferCI.renderPass = nullptr; gbuffer = graphicsWrapper->CreateFramebuffer(gbufferCI); // Bind glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fbo); // Draw here... // Bind to textures glActiveTexture(GL_TEXTURE0); glBindTexture(GL_TEXTURE_2D, textures[0]); glActiveTexture(GL_TEXTURE1); glBindTexture(GL_TEXTURE_2D, textures[1]); glActiveTexture(GL_TEXTURE2); glBindTexture(GL_TEXTURE_2D, textures[2]); glActiveTexture(GL_TEXTURE3); glBindTexture(GL_TEXTURE_2D, depthrenderbuffer); Here is an extract of my code. I can't think of anything else to include. I've really been butting my head into a wall trying to think of a reason but I can think of none and all my research yields nothing. Thanks in advance!
    • By Adrianensis
      Hi everyone, I've shared my 2D Game Engine source code. It's the result of 4 years working on it (and I still continue improving features ) and I want to share with the community. You can see some videos on youtube and some demo gifs on my twitter account.
      This Engine has been developed as End-of-Degree Project and it is coded in Javascript, WebGL and GLSL. The engine is written from scratch.
      This is not a professional engine but it's for learning purposes, so anyone can review the code an learn basis about graphics, physics or game engine architecture. Source code on this GitHub repository.
      I'm available for a good conversation about Game Engine / Graphics Programming
    • By C0dR
      I would like to introduce the first version of my physically based camera rendering library, written in C++, called PhysiCam.
      Physicam is an open source OpenGL C++ library, which provides physically based camera rendering and parameters. It is based on OpenGL and designed to be used as either static library or dynamic library and can be integrated in existing applications.
       
      The following features are implemented:
      Physically based sensor and focal length calculation Autoexposure Manual exposure Lense distortion Bloom (influenced by ISO, Shutter Speed, Sensor type etc.) Bokeh (influenced by Aperture, Sensor type and focal length) Tonemapping  
      You can find the repository at https://github.com/0x2A/physicam
       
      I would be happy about feedback, suggestions or contributions.

    • By altay
      Hi folks,
      Imagine we have 8 different light sources in our scene and want dynamic shadow map for each of them. The question is how do we generate shadow maps? Do we render the scene for each to get the depth data? If so, how about performance? Do we deal with the performance issues just by applying general methods (e.g. frustum culling)?
      Thanks,
       
  • Popular Now