Bad performance when rendering medium amount of meshes

Started by
24 comments, last by JohnnyCode 8 years, 6 months ago

Well, I've run into another impasse.

I've decided to add the indices to the same buffer as the vertex data, so the structure of the global buffer now looks like this:

V1|N1|UV1|V2|N2|UV2|V3|N3|UV3|I1|I2|I3|I4|...

This works just fine.

However some meshes require additional vertex data aside from the positions, normals and uv coordinates. All vertices in the global buffer need to have the same structure, otherwise I run into problems when rendering shadows (Which skip the normal +uv data and don't need to know about the additional data (except in a few special cases)).

My initial idea was that I could keep the format of the global buffer (Positions, Normals, UV and Indices), and create a separate buffer for each mesh that requires additional data. This would result in more buffer changes during rendering, however since these type of meshes are a lot more uncommon than regular meshes, it wouldn't be a problem.

So, basically all regular vertex data is still stored in the global buffer.

All meshes with additional data have an additional buffer, which contains said data.

This is fine in theory, however the last parameter of "glDrawElementsBaseVertex" basically makes that impossible from what I can tell.

I'd need the basevertex to only affect the global buffer, but not the additional buffer (Because the additional buffer only contains data for the mesh that is currently being rendered). Is that in any way possible?

If not, what are my options?

Do I have to separate these types of meshes from the global buffer altogether, and just use my old method?

Advertisement


If not, what are my options?

Do I have to separate these types of meshes from the global buffer altogether, and just use my old method?

In general, you should aim to pack into 32 bytes as many attributes per vertex as possible to accomodate as many vertex programs as possible. Vertex alignment is of most important performance issue actualy (as much that if you have a 27 bytes big vertex, driver will put empty alignment bytes, or not and render multiple times slower) . When you batch geometries to a common buffer, yes, the alteration of indicies really demands the particular vertex buffer, unless you the same way pack/batch and index the other attributes second buffer, base vertex in draw call will be common for the draw call - as well as indicies buffer is a single common thing.

The answers here have helped me a lot, I've been able to increase the performance significantly, thanks everyone!

However, I haven't quite reached my goal yet.

I have a small scene with a bunch of models (trees) scattered all over the place:

aLQtlqi.jpg

The trees are still a major bottleneck, but I'm not sure what I can do to optimize it. I'm already doing frustum culling.

Occlusion queries wouldn't help, considering almost nothing is obstructed and most meshes are very small.

There are several different tree models with several LODs each, so instancing doesn't make much sense either.

The trees don't require any additional buffer changes (They're also part of the global buffer), but I believe the main problem stems from uploading the object matrices.

The matrices are a std140 uniform block inside the shader, and they're uploaded for each object using glBufferSubData. (I'm assuming there's no performance difference to using glUniform*?)

Since the trees are static, I could potentially create an array buffer during initialization and only upload the matrices once at the start. During rendering I'd then just have to upload an index.

However, is it even possible to tie an array buffer to a uniform/uniform block in that way? If so, how?

Also, can I bundle several glDrawElementsBaseVertex-calls together, similar to how display lists used to work, and then just call them as a batch somehow?

// Edit:

Another problem is that I I'm using cascaded shadow mapping with 4 cascades, which means I have to bind the matrix of all shadow casters 5 times total.

miK6xFZ.jpg

This is especially problematic considering I can't use any culling when rendering shadows.


There are several different tree models with several LODs each, so instancing doesn't make much sense either.

Instancing can still make sense... for example in your screenshots how many lod's of the visible trees are being used? 3? I see what-250 of trees so 250 divided by 3 (assuming one base model) is still enough to justify using instancing.

-potential energy is easily made kinetic-

The matrices are a std140 uniform block inside the shader, and they're uploaded for each object using glBufferSubData. (I'm assuming there's no performance difference to using glUniform*?)

Actually it could be a huge difference. Calling glBufferSubData for each tree every frame will make the GPU wait for the CPU to upload the data. This can and will kill your performance. The only way to actually make UBOs perform better than glUniformX is to make a huge UBO for all your trees and upload all the matrices at once before rendering and then for each tree use glBindBufferRange to bind the correct transforms. This will be nice and fast. At least that is my experience with UBOs. To avoid synchronization between GPU and CPU you should use buffer orphaning or manual synchronization. Mote info here: https://www.opengl.org/wiki/Buffer_Object_Streaming. And here: http://www.gamedev.net/topic/655969-speed-gluniform-vs-uniform-buffer-objects/

The trees are still a major bottleneck, but I'm not sure what I can do to optimize it. I'm already doing frustum culling.

The distant trees cover really small fillrate (the only busy stuff for gpu to alter target), and if their presence is affecting framerate significantly, you should defintely investigate the issue further, you will be mind blown if you will get to harmonize big scene covering small fill, no hack LOD or what hacks involved! happy.png

This topic is closed to new replies.

Advertisement