• Advertisement
Sign in to follow this  

OpenGL Seeking maximum performance with OpenGL!

This topic is 2725 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Greetings,

I once read the analogy that drawing with OpenGL is like crossing an ocean. Ultimately, you better have a fully packed cruise ship rather than make several small voyages if you want to get your people the other side as fast as possible. That said, it made me think, I’m currently doing several draw calls per frame with glDrawRangeElements and VOBs. I’m wondering what is the next step… considering a single pass pipeline, is it possible to make only one draw call per frame? And bam!

Share this post


Link to post
Share on other sites
Advertisement
Its probably not possible for any reasonable complexity, and certain techniques are fundamentally multi-pass. Moreover, trying to pack everything into one draw call isn't going to get you anything -- using 4 draw calls doesn't necessarily happen in half the time as 8 calls, so its not like using just one draw call is any sort of ring you should be reaching for.

I'm not really familiar with what the reasonable limits are from personal experience, but I do recall reading an article once which said anything fewer than a couple hundred per frame didn't have any real impact, and even that data may be dated since CPU/GPUs have moved on, and also more recent models for 3D driver interaction have reduced the number of kernel/user mode switching and have also begun to embrace multi-threading. The reasons to avoid many draw calls are to avoid those (relatively) slow kernel/user mode switches, and also to not change the GPUs worldview so frequently, which causes GPU caches and pipelines to be flushed more often than necessary.

Organize your draw calls by like material to minimize your draw calls, but don't get caught up trying to chase the 1 draw call game -- though that might make an interesting theme for a coding competition [grin]

Share this post


Link to post
Share on other sites
I've heard a rule of thumb that a draw-call is only "waste" if there's less than ~100 triangles included in it. As long as you're making decent use of your draw-calls (i.e. each one does quite a bit of work) then there's not too much to worry about.

Some anecdotes:
On my GeForce200 I can do about 1M polygons with 1000 draw-calls at 30hz (that's 1000 triangles per draw-call).

I know of one proprietary game engine (DX9/360/PS3) that starts displaying warnings to the artists (optimize your meshes!) once the draw-call counter goes over 2000/frame. i.e. too many draw-calls is recognized here as an art problem, not a code problem.

I've worked on a planetary renderer that was doing 150K triangles in 20K draw-calls per frame (that's only 7-8 triangles per draw-call!). By grouping triangles into the same draw-calls (from 20K down to 1K), we improved the per-frame timings from 350ms (~3 fps) to 30ms (~30fps). That's a 20x reduction in draw-calls for a 10x speed improvement (you can't make a general rule out of that though - there's too many other factors!). n.b. this kind of ties in with the ">100 tri per batch" rule of thumb - we were breaking that rule and had bad performance, then we complied with the rule and had good performance.

On the PS3, one of the most expensive state changes is switching shaders - this can overshadow draw-call costs.
Quote:
Original post by golgoth13
is it possible to make only one draw call per frame? And bam!
Only if every object has the same rendering state (textures, materials, shaders, etc).

Share this post


Link to post
Share on other sites
Yup. My experiments indicated that the "sweet spot" was somewhere around 4000 triangles per call. Reducing the number of triangles per call below 4000 increased overhead fairly quickly, while increasing the number of triangles per call above 4000 decreased overhead rather slowly.

If you're really crazy about squeezing every last gram of performance this way, you might find it worthwhile to draw somewhere between 16K and 64K triangles at a time. Beyond that, the overhead is insignificant.

Share this post


Link to post
Share on other sites
Interesting, seems like the real rule of thumb here is “It’s all relative”.

Quote:
Yup. My experiments indicated that the "sweet spot" was somewhere around 4000 triangles per call. Reducing the number of triangles per call below 4000 increased overhead fairly quickly, while increasing the number of triangles per call above 4000 decreased overhead rather slowly.


I can almost see the light but there is a missing link and it's how the Polys count is being controlled. Even if we have one material for 100 geometries... we still need to draw them one by one with 100 draw calls right?

If not, how can we renderer severals geometries with the same rendering states in one draw call?

Share this post


Link to post
Share on other sites
Another way to minimize draw calls might be to do high-level visibility, transform, morphing, etc. calculations on the GPU, storing the result in a buffer using GL_transform_feedback3 support, and use glDrawElementsIndirect or glDrawArraysIndirect. What would really be great though is to be able to switch shaders without an API call.

Share this post


Link to post
Share on other sites
As I posted on another thread, I would strongly recommend to seek the best structure for you vertex shader caché in order to increase your VBO performance on the drawCalls.

That is the next more interesting step (for normal VBO numbers, not 20K of course).

Regards.

Share this post


Link to post
Share on other sites
Quote:
calculations on the GPU, storing the result in a buffer using GL_transform_feedback3 support, and use glDrawElementsIndirect or glDrawArraysIndirect.


This seem to be pushing the envelope to the next step indeed. My guess is, this technique involve geometry shaders. I never hear or seen anything like this yet... so it have to be part of GL 4. I'll m currently bind with GL 2.0~3.3.

Quote:
As I posted on another thread, I would strongly recommend to seek the best structure for you vertex shader caché in order to increase your VBO performance on the drawCalls.


I'm currently working with glBindBuffer; glBufferData; and (glBufferSubData for vertex attributes):

Anybody has experience in this to share and how to achieve good results?

Thx for all your inputs.

Share this post


Link to post
Share on other sites
Check some specifications for the most common graphic cards, align your vertex memory structure to their cache block fetch. Also make sure you have your data is interleaved. This should help a lot your VBO performance as all vertex data for a single pass on your vertex shader will be a cache hit.

Another great optimization you may do on the vertex side is within your vertex shader. For instance on PS3 an "if" statement within Cell or GPU is extremly costly due to stalls it incurs in.

You are worrying too much on how you bandwidth is used, but your bottleneck might be on other stuff. Having the fastest drawcall system in the planet, might mean nothing if you are fill rate capped or whatever.

Share this post


Link to post
Share on other sites
Quote:
For instance on PS3 an "if" statement within Cell or GPU is extremly costly due to stalls it incurs in… You are worrying too much on how you bandwidth is used, but your bottleneck might be on other stuff.


Touch down, I decide to go for the unique shader that does it all. I have 4-5 if statement in my shader. I’m guessing it is also applicable with GeFocre 8 family. Was too good to be true…

Cant bypass the multiple shaders concept could we? I despise the idea… damn it.

Quote:
Check some specifications for the most common graphic cards, align your vertex memory structure to their cache block fetch. Also make sure you have your data is interleaved.


You mean there is an ultimate way to build VBOs? I m curious to find out how I can optimize this:


void OpenGL::SetVertexBuffer(Geometry *in_geometry)
{
UInt &l_id = in_geometry->GetVertexBuffer()->GetArrayId();

if (l_id == 0 || in_geometry->IsState(STATE_RESET_VOB))
{
glDeleteBuffers(1, &l_id);

VertexBuffer *l_vbo = in_geometry->GetVertexBuffer();

if (l_id == 0)
glGenBuffers(1, &l_id);

glBindBuffer(GL_ARRAY_BUFFER, l_id);
glBufferData(GL_ARRAY_BUFFER, l_vbo->GetArraySize(), NULL, l_vbo->GetType());

for (UInt i = 0; i < in_geometry->GetUVSets().GetCount(); ++i)
{
ArrayBuffer<Vector2f> &l_uvs = in_geometry->GetUVSets()->GetUVs();
glBufferSubData(GL_ARRAY_BUFFER, l_uvs.GetOffset(), l_uvs.GetSize(), l_uvs.GetData());
}
if (in_geometry->GetVertexColors().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetVertexColors().GetOffset(), in_geometry->GetVertexColors().GetSize(), in_geometry->GetVertexColors().GetData());

if (in_geometry->GetCurvatures().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetCurvatures().GetOffset(), in_geometry->GetCurvatures().GetSize(), in_geometry->GetCurvatures().GetData());

if (in_geometry->GetEdgeFlags().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetEdgeFlags().GetOffset(), in_geometry->GetEdgeFlags().GetSize(), in_geometry->GetEdgeFlags().GetData());

if (in_geometry->GetFogCoords().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetFogCoords().GetOffset(), in_geometry->GetFogCoords().GetSize(), in_geometry->GetFogCoords().GetData());

if (in_geometry->GetNormals().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetNormals().GetOffset(), in_geometry->GetNormals().GetSize(), in_geometry->GetNormals().GetData());

if (in_geometry->GetUTangents().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetUTangents().GetOffset(), in_geometry->GetUTangents().GetSize(), in_geometry->GetUTangents().GetData());

if (in_geometry->GetVTangents().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetVTangents().GetOffset(), in_geometry->GetVTangents().GetSize(), in_geometry->GetVTangents().GetData());

if (in_geometry->GetVertices().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetVertices().GetOffset(), in_geometry->GetVertices().GetSize(), in_geometry->GetVertices().GetData());

if (in_geometry->GetWeights().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetWeights().GetOffset(), in_geometry->GetWeights().GetSize(), in_geometry->GetWeights().GetData());

if (in_geometry->GetDeformers().IsValid())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetDeformers().GetOffset(), in_geometry->GetDeformers().GetSize(), in_geometry->GetDeformers().GetData());
}
else
{
glBindBuffer(GL_ARRAY_BUFFER, l_id);

if (in_geometry->GetVertices().IsDynamic())
glBufferSubData(GL_ARRAY_BUFFER, in_geometry->GetVertices().GetOffset(), in_geometry->GetVertices().GetSize(), in_geometry->GetVertices().GetData());
}

}


[Edited by - golgoth13 on September 2, 2010 2:57:01 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by golgoth13
I'm currently working with glBindBuffer; glBufferData; and (glBufferSubData for vertex attributes):

Anybody has experience in this to share and how to achieve good results?

If you're doing dynamic updates, I strongly suggest you search this forum and the opengl.org forums for when glBuffer[Sub]Data has better performance and when glMapBuffer[Range] + memcpy has better performance. You probably want to support both mechanisms for updates and your program would decide based on the size of the data which method to use.

[Edit] And let's not forget about bindless VBOs.
My only concern here is that if you're not careful, I could imagine the possibility that the driver might end up optimizing better than you...

Share this post


Link to post
Share on other sites
Quote:
And let's not forget about bindless VBOs.


Just out of curiosity, Bindless VBOs looks promising indeed but it’s an hardware specific feature! If it’s so good, why not make it part of the OpenGL core?

When I see Hardware specific extensions I dismissed them automatically… I’m not sure if I should get a new perspective on this. As it is for now, I’m not touching gl*NV extensions with a 10 feet pole. Perhaps this is still going in the “It’s all relative” department.

Share this post


Link to post
Share on other sites
Quote:
Original post by golgoth13
Just out of curiosity, Bindless VBOs looks promising indeed but it’s an hardware specific feature! If it’s so good, why not make it part of the OpenGL core?

When I see Hardware specific extensions I dismissed them automatically… I’m not sure if I should get a new perspective on this. As it is for now, I’m not touching gl*NV extensions with a 10 feet pole. Perhaps this is still going in the “It’s all relative” department.

There's nothing wrong with supporting multiple hardware targets so that you can leverage whatever is available. The number of combinations is quite small, actually--NVIDIA and ATI, OpenGL 4 capable hardware or not--that's at most four variations. Your VBOs should be abstracted with some wrapper class(es) and then you can hide the complexity of using whichever of the three approaches for VBO updates mentioned here.

Share this post


Link to post
Share on other sites
Quote:
Original post by golgoth13
Just out of curiosity, Bindless VBOs looks promising indeed but it’s an hardware specific feature! If it’s so good, why not make it part of the OpenGL core?


Bindless is not hardware, but a vendor specific extension. And it is quite useful if you have thousands of VBOs. It is hard to achieve more than 1.3x to 2.0x speed up, but it is not negligible. In short, avoiding VAO and using Bindless in scene with thousands of VBOs it is likely that the speed boost is about two times. I'm using Bindless for some time, and have a very pleasant experience.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Tags

  • Advertisement
  • Popular Now

  • Similar Content

    • By LifeArtist
      Good Evening,
      I want to make a 2D game which involves displaying some debug information. Especially for collision, enemy sights and so on ...
      First of I was thinking about all those shapes which I need will need for debugging purposes: circles, rectangles, lines, polygons.
      I am really stucked right now because of the fundamental question:
      Where do I store my vertices positions for each line (object)? Currently I am not using a model matrix because I am using orthographic projection and set the final position within the VBO. That means that if I add a new line I would have to expand the "points" array and re-upload (recall glBufferData) it every time. The other method would be to use a model matrix and a fixed vbo for a line but it would be also messy to exactly create a line from (0,0) to (100,20) calculating the rotation and scale to make it fit.
      If I proceed with option 1 "updating the array each frame" I was thinking of having 4 draw calls every frame for the lines vao, polygons vao and so on. 
      In addition to that I am planning to use some sort of ECS based architecture. So the other question would be:
      Should I treat those debug objects as entities/components?
      For me it would make sense to treat them as entities but that's creates a new issue with the previous array approach because it would have for example a transform and render component. A special render component for debug objects (no texture etc) ... For me the transform component is also just a matrix but how would I then define a line?
      Treating them as components would'nt be a good idea in my eyes because then I would always need an entity. Well entity is just an id !? So maybe its a component?
      Regards,
      LifeArtist
    • By QQemka
      Hello. I am coding a small thingy in my spare time. All i want to achieve is to load a heightmap (as the lowest possible walking terrain), some static meshes (elements of the environment) and a dynamic character (meaning i can move, collide with heightmap/static meshes and hold a varying item in a hand ). Got a bunch of questions, or rather problems i can't find solution to myself. Nearly all are deal with graphics/gpu, not the coding part. My c++ is on high enough level.
      Let's go:
      Heightmap - i obviously want it to be textured, size is hardcoded to 256x256 squares. I can't have one huge texture stretched over entire terrain cause every pixel would be enormous. Thats why i decided to use 2 specified textures. First will be a tileset consisting of 16 square tiles (u v range from 0 to 0.25 for first tile and so on) and second a 256x256 buffer with 0-15 value representing index of the tile from tileset for every heigtmap square. Problem is, how do i blend the edges nicely and make some computationally cheap changes so its not obvious there are only 16 tiles? Is it possible to generate such terrain with some existing program?
      Collisions - i want to use bounding sphere and aabb. But should i store them for a model or entity instance? Meaning i have 20 same trees spawned using the same tree model, but every entity got its own transformation (position, scale etc). Storing collision component per instance grats faster access + is precalculated and transformed (takes additional memory, but who cares?), so i stick with this, right? What should i do if object is dynamically rotated? The aabb is no longer aligned and calculating per vertex min/max everytime object rotates/scales is pretty expensive, right?
      Drawing aabb - problem similar to above (storing aabb data per instance or model). This time in my opinion per model is enough since every instance also does not have own vertex buffer but uses the shared one (so 20 trees share reference to one tree model). So rendering aabb is about taking the model's aabb, transforming with instance matrix and voila. What about aabb vertex buffer (this is more of a cosmetic question, just curious, bumped onto it in time of writing this). Is it better to make it as 8 points and index buffer (12 lines), or only 2 vertices with min/max x/y/z and having the shaders dynamically generate 6 other vertices and draw the box? Or maybe there should be just ONE 1x1x1 cube box template moved/scaled per entity?
      What if one model got a diffuse texture and a normal map, and other has only diffuse? Should i pass some bool flag to shader with that info, or just assume that my game supports only diffuse maps without fancy stuff?
      There were several more but i forgot/solved them at time of writing
      Thanks in advance
    • By RenanRR
      Hi All,
      I'm reading the tutorials from learnOpengl site (nice site) and I'm having a question on the camera (https://learnopengl.com/Getting-started/Camera).
      I always saw the camera being manipulated with the lookat, but in tutorial I saw the camera being changed through the MVP arrays, which do not seem to be camera, but rather the scene that changes:
      Vertex Shader:
      #version 330 core layout (location = 0) in vec3 aPos; layout (location = 1) in vec2 aTexCoord; out vec2 TexCoord; uniform mat4 model; uniform mat4 view; uniform mat4 projection; void main() { gl_Position = projection * view * model * vec4(aPos, 1.0f); TexCoord = vec2(aTexCoord.x, aTexCoord.y); } then, the matrix manipulated:
      ..... glm::mat4 projection = glm::perspective(glm::radians(fov), (float)SCR_WIDTH / (float)SCR_HEIGHT, 0.1f, 100.0f); ourShader.setMat4("projection", projection); .... glm::mat4 view = glm::lookAt(cameraPos, cameraPos + cameraFront, cameraUp); ourShader.setMat4("view", view); .... model = glm::rotate(model, glm::radians(angle), glm::vec3(1.0f, 0.3f, 0.5f)); ourShader.setMat4("model", model);  
      So, some doubts:
      - Why use it like that?
      - Is it okay to manipulate the camera that way?
      -in this way, are not the vertex's positions that changes instead of the camera?
      - I need to pass MVP to all shaders of object in my scenes ?
       
      What it seems, is that the camera stands still and the scenery that changes...
      it's right?
       
       
      Thank you
       
    • By dpadam450
      Sampling a floating point texture where the alpha channel holds 4-bytes of packed data into the float. I don't know how to cast the raw memory to treat it as an integer so I can perform bit-shifting operations.

      int rgbValue = int(textureSample.w);//4 bytes of data packed as color
      // algorithm might not be correct and endianness might need switching.
      vec3 extractedData = vec3(  rgbValue & 0xFF000000,  (rgbValue << 8) & 0xFF000000, (rgbValue << 16) & 0xFF000000);
      extractedData /= 255.0f;
    • By Devashish Khandelwal
      While writing a simple renderer using OpenGL, I faced an issue with the glGetUniformLocation function. For some reason, the location is coming to be -1.
      Anyone has any idea .. what should I do?
  • Advertisement