# Bad performance when rendering medium amount of meshes

This topic is 871 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm getting massive performance loss when rendering a medium-sized scene with a reasonable(?) amount of meshes.

The stats for all rendered (non-occluded) objects in the scene are:

Triangles: 204432
Vertices: 68449
Material Changes (glBindTexture): 239
Meshes: 7153
Render Duration: 23ms (~43fps)


I know that rendering a lot of low-poly meshes is a lot more expensive than rendering a handful of high-poly meshes, but still, 7153 meshes with an average of ~28 triangles doesn't seem like a big deal to me, and yet the performance goes down the drain.

Before rendering, all of my meshes are first sorted by shader, then by material. The main render process is as follows:

foreach shader
foreach material
glBindTexture(material)
foreach mesh
glBindVertexArray(vao) // Vertex Array (Vertex +UV +Normal Buffers)
glBindBuffer(ibo) // Index Buffer
glDrawElements(GL_TRIANGLES,vertexCount,GL_UNSIGNED_INT,(void*)0)
end
end
end


(Pseudo Code)

I have a decent graphics card (AMD Radeon R9 200 Series) which I believe should be able to handle a lot more stress than this. I've spent hours profiling with both CPU and GPU profilers, debugging, trying various optimization methods, but the bottleneck is definitely the central rendering process (Code above).

Is the amount of meshes really the problem here? If not, what could be causing this massive decrease in performance?

I'm not looking for culling methods, right now I'm just trying to improve my rendering pipeline.

##### Share on other sites

Do you have a lot of overdraw (many overlapping pixels)? Since what you described is the amount of vertices you have, when there are A LOT more pixels than vertices when rendering a mesh. You can quickly check whether fill rate (overdraw, or just slow pixel shader) is a problem by changing the window size (changes number of pixels but keeps vertices the ~same).

EDIT:

btw, you can associate the index buffer with the VAO (just like VBOs) if you dont specifically need to use multiple index buffers with the same VAO

and you probably should use 16-bit indices if your meshes only have ~28 verts.

Edited by Waterlimon

##### Share on other sites

It is the number of draw calls you are making. Every time you call glDrawElements you incur an overhead. You are much better off combining meshes together. Like sethhope said, you will want to combine anything static into batches. If you have a bunch of crates in your scene, for example, combine them all into a single mesh and draw that once instead of drawing them individually.

##### Share on other sites

Yeah, draw calls. Around ~1,000 is a decent maximum to aim for. ~4,000 is a rough upper limit for the older APIs.

You want both instancing and mesh combining (baking). Which is better is a trade-off you have to evaluate for your specific case.

Baking "all crates" is a bad idea, since that creates a single mesh that spans your whole level, which is really bad for culling and any dynamic bounding box system you have in place. You can bake localized clusters of objects, but then you lose out on instancing, and of course the objects must be static for baking for be possible.

Instancing doesn't scale forever, though, so just relying on instancing to solve everything isn't a guaranteed solution either. But for ~7,000 objects, it's almost probably what you want, assuming most of those 7,000 objects are the same mesh drawn with different transforms.

##### Share on other sites
How are you measuring time? I'm guessing that's total CPU per frame?
Add some more timing code to measure glSwapBuffers, so you can exclude it from the per/frame total. Also get some timings for how long your mesh loop takes.
You can also use ARB_timer_query to measure GPU time per frame.

If your problem is that your GPU time per frame is the bottleneck, then you'll have to optimize your shaders / data formats / overdraw / etc.
If you problem is your CPU e per frame is the bottleneck, then it's a more traditional optimization problem. Measure your CPU-side code to see where the time is going.

##### Share on other sites

What is also bad, unless this is just a test, is that your objects are so small that every 28 triangles it draws, you have to stall the GPU to figure out what is going to happen next and setup things.  You want the GPU to just draw as many triangles in one go as you can.

##### Share on other sites

Material Changes (glBindTexture): 239

This is extreme, post your gpu first, so we can tell wheather you have the performance issue or not at all.

##### Share on other sites

Uhm am i wrong or might it just be the high amount of BufferBinds(both) ?

Would be way better to pack stuff into bigger vao's and use an offset in glDrawElements

##### Share on other sites

glBindVertexArray(vao) // Vertex Array (Vertex +UV +Normal Buffers)
Yep, thats going to be slow.

As Ryokeen suggested, you totally can pack meshes into a single buffer and just send an offset to the draw call. Check ARB_draw_elements_base_vertex, there are similar calls for drawing plain arrays, or instanced  arrays/elements draws.

That way you can just pack all your static meshes in one big buffer, managing the offsets yourself (which is fun :P) and have only a couple VAO switches. Since you're essentially doing memory management there, you need to have in mind things like memory fragmentation (ie, what happens if you pack 500 meshes then remove 200 randomly from the same buffer, things get fragmented), so beware.

The idea is not to use VAOs to specify "this is a single mesh that I can draw and the buffers attached have only that mesh" but more like "this is one kind of vertex format I support, and the buffers attached have tons of meshes with the same format".

##### Share on other sites

That way you can just pack all your static meshes in one big buffer, managing the offsets yourself (which is fun ) and have only a couple VAO switches. Since you're essentially doing memory management there, you need to have in mind things like memory fragmentation (ie, what happens if you pack 500 meshes then remove 200 randomly from the same buffer, things get fragmented), so beware.

So, basically I need 3 "global" buffers (1 for vertices, 1 for normals, 1 for uv coordinates), then pack all static (Why just static? My dynamic meshes have the same format, can't I just include them as well?) mesh data in those three. During rendering I then just bind these three buffers once at the beginning (=1 vao switch) and use glDrawElementsBaseVertex for each mesh with the appropriate offset.

How are you measuring time? I'm guessing that's total CPU per frame?

No, it's just the time for the render loop (The pseudo code). I've used std::chrono::high_resolution_clock to measure it, so it's just the CPU time. I'll give ARB_timer_query a try.

According to the profiler "Very Sleepy", the main CPU bottleneck is with "DrvPresentBuffers". I'm not sure if that means it's the GPU itself, or the synchronization/data transfer from CPU to GPU.

If your problem is that your GPU time per frame is the bottleneck, then you'll have to optimize your shaders / data formats / overdraw / etc.
If you problem is your CPU e per frame is the bottleneck, then it's a more traditional optimization problem. Measure your CPU-side code to see where the time is going.

I'm pretty sure the shader isn't the problem, the fps stay the same even if I simply discard all fragments and deactivate the vertex shader.

Changing the resolution also changes nothing (I've tried switching between 640x480 and 1920x1080, fps is the same), so I think I can also throw out overdraw as a possible candidate?

Edited by Silverlan

##### Share on other sites
You don't need one buffer per attribute, you can put them all in the same buffer (either interleaved or separate).

##### Share on other sites

You don't need one buffer per attribute, you can put them all in the same buffer (either interleaved or separate).

Hm... I don't think I understand how that's supposed to work.

So, I create a single buffer, and push all of my vertex, normal, uv and index data into that buffer:

V = Vertex

N = Normal

I = Index

|x| = 4 Bytes

Buffer Data: ...|V1|V1|V1|V2|V2|V2|V3|V3|V3|V4|V4|V4|N1|N1|N1|N2|N2|N2|N3|N3|N3|N4|N4|N4|UV1|UV1|UV2|UV2|UV3|UV3|UV4|UV4|I1|I2|I3|I4|I5|I6|...

Then, during rendering, I can use glDrawElementsBaseVertex to point it to the first index (I1) and draw the mesh:

offsetToFirstIndex = grabOffset()

glDrawElementsBaseVertex(GL_TRIANGLES,2,GL_UNSIGNED_INT,(void*)0,offsetToFirstIndex)

But what about the normals and uv coordinates? I'd still have to use glVertexAttribPointer for both to specify their respective offsets, which means I'd still need a VAO for each mesh.

What am I missing?

##### Share on other sites

You don't need one buffer per attribute, you can put them all in the same buffer (either interleaved or separate).

Hm... I don't think I understand how that's supposed to work.

So, I create a single buffer, and push all of my vertex, normal, uv and index data into that buffer:

V = Vertex

N = Normal

I = Index

|x| = 4 Bytes

Buffer Data: ...|V1|V1|V1|V2|V2|V2|V3|V3|V3|V4|V4|V4|N1|N1|N1|N2|N2|N2|N3|N3|N3|N4|N4|N4|UV1|UV1|UV2|UV2|UV3|UV3|UV4|UV4|I1|I2|I3|I4|I5|I6|...

Then, during rendering, I can use glDrawElementsBaseVertex to point it to the first index (I1) and draw the mesh:

offsetToFirstIndex = grabOffset()

glDrawElementsBaseVertex(GL_TRIANGLES,2,GL_UNSIGNED_INT,(void*)0,offsetToFirstIndex)

But what about the normals and uv coordinates? I'd still have to use glVertexAttribPointer for both to specify their respective offsets, which means I'd still need a VAO for each mesh.

What am I missing?

V1|V1|V1|N1|N1|N1|UV1|UV1|V2|V2|V2|N2|N2|N2|UV2|UV2|V3|V3|V3|N3|N3|N3|UV3|UV3|V4|V4|V4|N4|N4|N4|UV4|UV4 and use GL_ARB_vertex_attrib_binding so that the buffer is decoupled from the format.

##### Share on other sites

You don't need one buffer per attribute, you can put them all in the same buffer (either interleaved or separate).

Hm... I don't think I understand how that's supposed to work.
So, I create a single buffer, and push all of my vertex, normal, uv and index data into that buffer.
I'd still have to use glVertexAttribPointer for both to specify their respective offsets, which means I'd still need a VAO for each mesh.
What am I missing?

I didn't mention VAO's. Just that you can use a single VBO for all attributes, rather than multiple VBOs.

Some people recommend making one VAO per "type" of object / "set of attributes" -- e.g. a VAO for use when your object has positions + normals, and a different VAO for when your object has positions + normals + UV's (and then you constantly modify these VAOs for different objects, if they're use different VBO pointers).
Other people recommend using one VAO per object so that you don't have to modify them.
Other people still, recommend using one global VAO that's shared by everything, and just constantly modify the hell out of it for every draw call as if you're writing pre-VAO OpenGL code...

Also, it's common for attributes to be interleaved, such as:
V1|N1|UV1||V2|N2|UV2||V3|N3|UV3
V1|V2|V3||N1|N2|N3||UV1|UV2|UV3
Or, semi-interleaving is also common if the same mesh is rendered with multiple different shaders. e.g. if one shader only reads positions, but another reads position+normal+uv, then you might use a compromise layout such as:
V1|V2|V3||N1|UV1||N2|UV2||N3|UV3

These layout make good use of the stride and offset parameters when binding a VBO
If multiple different objects exist within the same VBO, you don't have to rebind the VBO/attributes (i.e. don't have to modify the VAO) -- the offset from glDrawElementsBaseVertex is added to the offset that was passed into glVertexAttribPointer for each attribute.

Buffer Data: ...|V1|V1|....|UV4|I1|I2|I3|I4|I5|I6|...
Then, during rendering, I can use glDrawElementsBaseVertex to point it to the first index (I1) and draw the mesh:
offsetToFirstIndex = grabOffset()
glDrawElementsBaseVertex(GL_TRIANGLES,2,GL_UNSIGNED_INT,(void*)0,offsetToFirstIndex)

In your example code, you're passing "0" as the index buffer offset, and "offsetToFirstIndex" as the vertex buffer offset.

Indices will be fetched from the index buffer (aka "element array buffer") using: eab + 0 * sizeof(byte).
Vertex attributes will be fetched from their vbo using: attribute.vbo + attribute.offset * sizeof(byte) + (index + offsetToFirstIndex) * attribute.stride

So you want to replace that "0" with the offset in bytes from the start of the buffer to the first index that you want to fetch, and replace "offsetToFirstIndex" with the offset in vertices that should be added to every index value that is fetched / how far into the VBO vertex #0 is located. Because this parameter is an offset-in-vertices, not an offset-in-bytes, it works across all attributes equally, no matter how they are laid out in memory.

 Also, GL_UNSIGNED_SHORT is a much more performant (and memory-saving) format for indices -- try to use 16bit indices instead of 32bit ones wherever possible, even on modern hardware (and especially on old hardware).

Edited by Hodgman

##### Share on other sites

Thank you, but I'm still unclear on a couple of things.

I've switched the data order to:

V1|N1|UV1|V2|N2|UV2|V3|N3|UV3

But what about the indices? Is it not possible to just append them to the same buffer (i.e. V1|N1|UV1|V2|N2|UV2|V3|N3|UV3|I1|I2|I3|I4|I5|I6), or is an element buffer absolutely required?

Either way, I've created a test-scenario with just one object and no vao.

There are two buffers, the vbo with the data as described above, and the element buffer with the vertex indices.

During rendering I then use:

glBindBuffer(GL_ARRAY_BUFFER,dataBuffer) // vbo
// Vertex Data
glEnableVertexAttribArray(0)
glVertexAttribPointer(
0,
3, // 3 Floats
GL_FLOAT,
GL_FALSE,
sizeof(float) *5, // Offset between vertices is sizeof(normal) +sizeof(uv)
(void*)0 // First vertex starts at the beginning
);
//

// Normal Data
glEnableVertexAttribArray(1)
glVertexAttribPointer(
1,
3, // 3 Floats
GL_FLOAT,
GL_FALSE,
sizeof(float) *5, // Offset between normals is sizeof(uv) +sizeof(vertex)
(void*)(sizeof(float) *3) // First normal starts after first vertex
);
//

// UV Data
glEnableVertexAttribArray(2)
glVertexAttribPointer(
2,
2, // 2 Floats
GL_FLOAT,
GL_FALSE,
sizeof(float) *6, // Offset between uvs is sizeof(vertex) +sizeof(normal)
(void*)(sizeof(float) *6) // First uv starts after first normal
);
//
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER,indexBuffer); // index/element buffer
glDrawElementsBaseVertex(
GL_TRIANGLES,
numTriangles,
GL_UNSIGNED_INT,
(void*)0, // For testing purposes; Index buffer contains only one mesh, which starts at index 0
0 // Not sure about this one? VBO vertex #0 is located at position 0 in the data buffer
);


(I know this isn't effective code, I'm doing it this way to help me understand. I'll optimize it once I got it working)

The mesh is rendered, however not correctly (Vertices, normals and uv coordinates are wrong).

Edited by Silverlan

##### Share on other sites

I've switched the data order to:
V1|N1|UV1|V2|N2|UV2|V3|N3|UV3

During rendering I then use:

...
The mesh is rendered, however not correctly (Vertices, normals and uv coordinates are wrong).
Your stride and offset values are all wacky.

If your vertex structure looks like:

struct Vertex { float px, py, pz, nx, ny, nz, u, v; }

Then:

position has offset=0, stride=sizeof(Vertex)

normal has offset = sizeof(float)*3, stride=sizeof(Vertex)

uv has offset = sizeof(float)*6, stride=sizeof(Vertex)

Offset is the number of bytes from the start of the vertex to the start of that element. Stride is how many bytes you have to advance by to arrive at the next vertices value of that same element.

But what about the indices? Is it not possible to just append them to the same buffer (i.e. V1|N1|UV1|V2|N2|UV2|V3|N3|UV3|I1|I2|I3|I4|I5|I6), or is an element buffer absolutely required?
"Element array buffer" is a binding slot in the API. You should be able to bind the same VBO to that slot (and then the indices parameter to glDrawElementsBaseVertex is an offset in bytes into that VBO of where to read the indices from).

##### Share on other sites

You'll find that things will be much easier for you if you define a struct for your vertex format; e.g:

struct myVertex {

float Position[3];

float Normal[3];

float TexCoord[2];

};

You'll then be able to use sizeof (myVertex) and offsetof (myVertex, Position) instead of messing around with multiplying sizeof (float) by whatever and introducing possibility of errors into your program.

##### Share on other sites

But what about the indices? Is it not possible to just append them to the same buffer (i.e. V1|N1|UV1|V2|N2|UV2|V3|N3|UV3|I1|I2|I3|I4|I5|I6), or is an element buffer absolutely required?
"Element array buffer" is a binding slot in the API. You should be able to bind the same VBO to that slot (and then the indices parameter to glDrawElementsBaseVertex is an offset in bytes into that VBO of where to read the indices from).

Thanks, I think I got it now.

One more thing however:

The last parameter of glDrawElementsBaseVertex is an offset in vertices. So that means I can't add the indices to the same buffer after all, unless I make sure they 'fit' into 'packages' the same size as a multiple of that of one vertex (sizeof(position) +sizeof(normal) +sizeof(uv)) by leaving some bytes after the indices unused, is that correct?

I think just using a second global buffer for the indices would be easier and shouldn't have any performance impact, since I still only need to bind it once, right?

// EDIT:

Well, sadly it turns out that this method didn't help at all, in fact it made matters worse. Execution time for the render loop went up from 23ms to 39ms. My render loop now looks like this:

glBindVertexArray(vao) // Enables the vertex/uv/normal arrays and binds the global buffer
foreach material
glBindTexture(material)
foreach mesh
glDrawElementsBaseVertex(GL_TRIANGLES,vertexCount,GL_UNSIGNED_INT,...);
end
end
end


There are no buffer binds or changes at all between meshes, but since the performance dropped substantially, I suppose that wasn't the issue after all. Or perhaps glDrawElementsBaseVertex just isn't supported by my drivers very well?

// EDIT2:

On second try, it did actually increase performance slightly, but it's definitely not the main bottleneck.

Edited by Silverlan

##### Share on other sites

The last parameter of glDrawElementsBaseVertex is an offset in vertices. So that means I can't add the indices to the same buffer after all, unless I make sure they 'fit' into 'packages' the same size as a multiple of that of one vertex (sizeof(position) +sizeof(normal) +sizeof(uv)) by leaving some bytes after the indices unused, is that correct?
I think just using a second global buffer for the indices would be easier and shouldn't have any performance impact, since I still only need to bind it once, right?

It doesn't mean anything about where you put the indices, because it's an offset of where to begin reading vertices from... not where to read indices from.
It does mean that if you want to put different types of vertices into the same buffer, you can end up with alignment considerations though, yeah.
Yes, putting indices into a separate buffer is much easier to handle

it's definitely not the main bottleneck

Put some timing code into your loops to find out what part is taking up the most time.

##### Share on other sites

Well, I've run into another impasse.

I've decided to add the indices to the same buffer as the vertex data, so the structure of the global buffer now looks like this:

V1|N1|UV1|V2|N2|UV2|V3|N3|UV3|I1|I2|I3|I4|...

This works just fine.

However some meshes require additional vertex data aside from the positions, normals and uv coordinates. All vertices in the global buffer need to have the same structure, otherwise I run into problems when rendering shadows (Which skip the normal +uv data and don't need to know about the additional data (except in a few special cases)).

My initial idea was that I could keep the format of the global buffer (Positions, Normals, UV and Indices), and create a separate buffer for each mesh that requires additional data. This would result in more buffer changes during rendering, however since these type of meshes are a lot more uncommon than regular meshes, it wouldn't be a problem.

So, basically all regular vertex data is still stored in the global buffer.

All meshes with additional data have an additional buffer, which contains said data.

This is fine in theory, however the last parameter of "glDrawElementsBaseVertex" basically makes that impossible from what I can tell.

I'd need the basevertex to only affect the global buffer, but not the additional buffer (Because the additional buffer only contains data for the mesh that is currently being rendered). Is that in any way possible?

If not, what are my options?

Do I have to separate these types of meshes from the global buffer altogether, and just use my old method?

##### Share on other sites

If not, what are my options?

Do I have to separate these types of meshes from the global buffer altogether, and just use my old method?

In general, you should aim to pack into 32 bytes as many attributes per vertex as possible to accomodate as many vertex programs as possible. Vertex alignment is of most important performance issue actualy (as much that if you have a 27 bytes big vertex, driver will put empty alignment bytes, or not and render multiple times slower) .  When you batch geometries to a common buffer, yes, the alteration of indicies really demands the particular vertex buffer, unless you the same way pack/batch and index the other attributes second buffer, base vertex in draw call will be common for the draw call - as well as indicies buffer is a single common thing.

##### Share on other sites

The answers here have helped me a lot, I've been able to increase the performance significantly, thanks everyone!

However, I haven't quite reached my goal yet.

I have a small scene with a bunch of models (trees) scattered all over the place:

The trees are still a major bottleneck, but I'm not sure what I can do to optimize it. I'm already doing frustum culling.

Occlusion queries wouldn't help, considering almost nothing is obstructed and most meshes are very small.

There are several different tree models with several LODs each, so instancing doesn't make much sense either.

The trees don't require any additional buffer changes (They're also part of the global buffer), but I believe the main problem stems from uploading the object matrices.

The matrices are a std140 uniform block inside the shader, and they're uploaded for each object using glBufferSubData. (I'm assuming there's no performance difference to using glUniform*?)

Since the trees are static, I could potentially create an array buffer during initialization and only upload the matrices once at the start. During rendering I'd then just have to upload an index.

However, is it even possible to tie an array buffer to a uniform/uniform block in that way? If so, how?

Also, can I bundle several glDrawElementsBaseVertex-calls together, similar to how display lists used to work, and then just call them as a batch somehow?

// Edit:

Another problem is that I I'm using cascaded shadow mapping with 4 cascades, which means I have to bind the matrix of all shadow casters 5 times total.

This is especially problematic considering I can't use any culling when rendering shadows.

Edited by Silverlan

##### Share on other sites

There are several different tree models with several LODs each, so instancing doesn't make much sense either.

Instancing can still make sense... for example in your screenshots how many lod's of the visible trees are being used?  3?  I see what-250 of trees so 250 divided by 3 (assuming one base model) is still enough to justify using instancing.

##### Share on other sites

The matrices are a std140 uniform block inside the shader, and they're uploaded for each object using glBufferSubData. (I'm assuming there's no performance difference to using glUniform*?)

Actually it could be a huge difference. Calling glBufferSubData for each tree every frame will make the GPU wait for the CPU to upload the data. This can and will kill your performance. The only way to actually make UBOs perform better than glUniformX is to make a huge UBO for all your trees and upload all the matrices at once before rendering and then for each tree use glBindBufferRange to bind the correct transforms. This will be nice and fast. At least that is my experience with UBOs. To avoid synchronization between GPU and CPU you should use buffer orphaning or manual synchronization. Mote info here: https://www.opengl.org/wiki/Buffer_Object_Streaming. And here: http://www.gamedev.net/topic/655969-speed-gluniform-vs-uniform-buffer-objects/