OpenGL performance tips

Started by
5 comments, last by Lodeman 10 years, 1 month ago

Hi all,

For my engine I need to render a bunch of different meshes, including an alpha pass.
Currently the way I do it is fairly straight-forward:

- I have one big vertex and index buffer in which I load in all my mesh data. During the render phase I use this to "instance" my geometry.

- I have frustum culling to neglect scenery that doesn't need to be drawn

- During the update part of the game-loop, I create a render-queue, which sorts the meshes so they can be drawn more efficiently

- At render-time, I bind the large mesh-buffer one time before drawing any meshes

- Drawing is done by going through the render-queue. I bind the correct texture_2D_Array for the mesh and I draw the mesh with glDrawElementsBaseVertex. Thus I just pass in the correct index to draw a certain mesh, using one and the same buffer (instancing)

- I disable the buffer at the end of the render loop

- After all opaque objects were rendered, I do the alpha pass in a similar way, also using the big buffer. Although in this case I cannot sort them mesh per mesh, since they are sorted by depth.

- I use one and the same shader-program for drawing all these meshes, and only one sampler2DArray at texture index 0. The array contains a diffuse map and an optional bumpmap.

I'm finding that with the current setup I'm not quite getting the performance I'd like to get. Therefore I'm hoping to receive some tips on how this sort of mesh-rendering problem is usually tackled by more experienced programmers. For example, is it common-practise to use just one shader-program for rendering all meshes? Or is there a much more efficient way that would remove the need to always re-bind the correct texture when switching between meshes?

Any suggestions are very welcome!

Cheers!

Advertisement

is it common-practise to use just one shader-program for rendering all meshes?

No. Use permuations, breaking shaders reasonably between run-time branches and compile-time variants.


is there a much more efficient way that would remove the need to always re-bind the correct texture when switching between meshes?

You should be sorting by texture as a second criterion if shaders match (which would seem to always be your case).


In both cases, setting textures and shaders should be done only through custom wrappers that keep track of the last shader/textures set and early-out if the same is being set again.
And not just shaders and textures but every state change should be redundancy checked. Culling on/off, depth-test function, nothing should be set to the same value that it already is.


I create a render-queue, which sorts the meshes so they can be drawn more efficiently

A bad render queue is worse than no render queue at all. Did you time it?
Make sure you are taking advantage of per-frame temporal coherence with an insertion sort on item indices.
Do not sort actual render-queue objects and do not use std::sort().


Is your shader optimized? Are you reducing overdraw with a render-queue check on depth (following matching shaders and textures)?
Are you doing something silly such as recreating or copying over vertex buffers that are in use each frame?


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

You don't give nearly enough information to stimulate a meaningful answer. For example, "...which sorts meshes so they can be rendered more efficiently..." says little.

Pictures, polygon counts, texture sizes, number of GL calls... these things build a basis for consideration where performance is concerned. Given the vageuess of what you provide, I can imagine scenarios where you are bus bound, geometry bound, fill-rate bound, or ALU bound.

I really would like to help :-)

You should be sorting by texture as a second criterion if shaders match (which would seem to always be your case).


In both cases, setting textures and shaders should be done only through custom wrappers that keep track of the last shader/textures set and early-out if the same is being set again.
And not just shaders and textures but every state change should be redundancy checked. Culling on/off, depth-test function, nothing should be set to the same value that it already is.

Ah yes, currently I sort per "mesh". So for example I'll have a few pinetree variants, and I'd first go over pine variant 1 and draw all those instances, then variant 2 etc... As they do share textures, it would indeed be a good move to sort per texture instead of per mesh.
The custom wrapper is also a great suggestion, will make work of that too.

A bad render queue is worse than no render queue at all. Did you time it?
Make sure you are taking advantage of per-frame temporal coherence with an insertion sort on item indices.
Do not sort actual render-queue objects and do not use std::sort().

I do suppose my current queue, based on per mesh-sorting, is inefficient. To clarify, my current renderqueue is essentially a map<int MeshID, vector<MeshInstance>>
As for how I construct it per frame, I use an octree to do frustum culling. For any mesh instance that falls in the view frustum, I check if it's mesh type is already in the render queue. If so I append the instance to the corresponding MeshInstance vector,if the Mesh type is not yet in the queue, I add a new MeshID to the map.
This queue worked well back when I was only testing instances that didn't share any textures (1 pinetree variant, 1 house, 1 bush, etc...definitely gave a performance boost as opposed to just switching between meshes randomly, I did time this), but is now outdated. So yeah, I'll look into improving this by sorting per texture.

Is your shader optimized? Are you reducing overdraw with a render-queue check on depth (following matching shaders and textures)?
Are you doing something silly such as recreating or copying over vertex buffers that are in use each frame?

I am only calling this each frame: glBindVertexArray(s_MeshBuffer.VAO);

So not recreating or copying over buffers.
As for reducing overdraw, could you elaborate on that? I'm not familiar with this.

No. Use permuations, breaking shaders reasonably between run-time branches and compile-time variants.

Could you also elaborate on this? Currently all my scenery requires the same shader code. They have the same lighting calculations, calculate an optional bumpmap (I use a uniform boolean to check if a bumpmap needs to be sampled), sample the diffuse texture, sample shadowmaps.
I'd like to have some examples as to when one would really distinguish between using another shader program, or just having a boolean to check if a certain functionality is needed.

You don't give nearly enough information to stimulate a meaningful answer.

I'm afraid that's because I don't have sufficient OpenGL monitoring yet, I was first trying to make things "work" before sufficiently considering performance. Definitely on the todo list though. Mainly my purpose for this thread was getting general performance improvement tips.

To sketch a bit of context, this is the type of scene I'm rendering:
Polygon count for the scenery isn't anything out of the ordinary (although I can't atm give a number), texture sizes depend on the asset, but for example both the bark texture on the tree and the texture on the rocks are 512*512.
Scenery does not have LODs yet (another item on the infamous todo list), terrain however does (terrain performs decently on its own).

eeib.jpg

Thanks for the feedback so far.
Cheers!

Use mipmaps.

Realtime shadows are slow.

You can potentially batch draw calls together by sending in texture ids through a vertex attribute.

"Only" frustum culling >for *indoor* scene geometry< is not quite adequate.. Spatial culling is usually needed, PVS like in quake or umbra being the best. I use an octree with hardware occlusion.

currently I sort per "mesh". So for example I'll have a few pinetree variants, and I'd first go over pine variant 1 and draw all those instances, then variant 2 etc.

The leaves have transparency so they shouldn’t even be in the same render queue as the trunks. You should be creating a small render-queue item for each mesh part and end up drawing all the drunks with the same texture first, switch to rock texture and draw all the rocks, etc., and on the second pass draw all the translucent items in a similar sorted order.


Currently all my scenery requires the same shader code.

Probably a problem. For your leaves you likely use “discard” to perform alpha testing. If this keyword appears also in your opaque objects’ code (such as the trunks and rocks) then you likely have a large performance issue, since just the very presence of “discard” disables early-Z and other early-outs.


I'd like to have some examples as to when one would really distinguish between using another shader program, or just having a boolean to check if a certain functionality is needed.

If the boolean is switched on and off rapidly it is best as a uniform. If many (perhaps a ratio of the number of objects in the scene, such as 25% of them or above) can be drawn with the same boolean enabled or disabled, it should be a second shader.
I just mentioned a perfect example: Translucent vs. opaque objects. Plus what about creating shadow maps? Surely you don’t sample textures for that…


As for reducing overdraw, could you elaborate on that? I'm not familiar with this.

Don’t draw objects on top of each other except when necessary as in the case of translucency; draw objects behind each other so the same pixel does not execute the pixel shader more than once.


So yeah, I'll look into improving this by sorting per texture.

You should look into completely redesigning it. It should be as simple as checking the parts of a mesh for being inside the frustum, rebuilding the render queue based on which mesh parts are visible, and re-sorting an array of indices inside that render queue which should already be near sorted order from the previous frame (this is called taking advantage of temporal coherence) and sort the indices in-place using an insertion sort (like the bubble sort, it has a best case of O(n) when items are already in sorted order, but the bubble sort requires at least twice as many writes as insertion sort, twice as many cache misses, and asymptotically more branch mispredictions).


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

You've given me very useful and relevant information, time to get crackin'.

Thanks!
Lodeman

This topic is closed to new replies.

Advertisement