Uniform buffer updates

Started by
7 comments, last by AbandonedAccount 10 years, 1 month ago

I use uniform buffers to set constants in the shaders.

Currently each uniform block is backed by a uniform buffer of the appropriate size, glBindBufferBase is called once per frame and glNamedBufferSubDataEXT is called for every object without orphaning.

I tried to optimize this by using a larger uniform buffer, calling glBindBufferRange and updating subsequent regions in the buffer and this turned out to be significantly slower. After looking around I found this and similar threads that talk about the same problem. The suggestion seems to be to use one large uniform buffer for all objects, only update once with the data for all objects and call glBindBufferRange for every drawcall.

Is this the definite way to go with in OpenGL, regardless of using BufferSubData or MapBufferRange? At one place it was suggested that for small amounts of data glUniformfv is the fastest choice. It would be nice to implement comparable levels of performance with uniform buffers.

What is your experience with updating shader uniforms in OpenGL?

Advertisement

Obviously for runtime speed performance, it would be better to upload only once and bind the range. This will take more memory but run faster.

But of course this wouldn't be viable for truely dynamic data, which couldn't be precalculated at a single upload stage. That's when runtime BufferSubData calls becomes viable. If your information isn't truely dynamic like this, then I'd say BufferSubData should only be used at the single upload stage.

Regardless, I think direct glUniform* calls have perfectly fine performance. The usage of uniform buffers might be overhyped. Especially when you're using truely dynamic data, the cases in which they will add performance diminishes. Basically when you use the SAME identical (updated) information multiple times. If you're updating a real dynamic uniform buffer and only using that information once, then you may aswell just use a straight glUniformfv

When using a larger uniform buffer and ranges, there are a lot of things you have to do right. And it won't always be the best performance for every situation. While I can't know if it would provide better performance in your specific circumstance, I can ask a few things to make sure you were setting it up right. Were updates far enough apart to avoid conflicts (can be perf hit if call A writes to first half of block A, and call B writes to second half of block A)? Were you invalidating (orphaning) old data early enough or were you orphaning just before writing new data to the same location? Were you only orphaning large chunks at a time, preferably that map directly to memory pages? Was your buffer at least twice the size of the chunks you orphan? Were you using unsynchronized when writing new data? Were you ever blocking on your sync? Were you only sync'ing per orphan call instead of per frame? Did you test performance to make sure STREAM is faster than DYNAMIC (buffer hint)?

I almost certainly left out some important performance traps and considerations, but that should get you started. If you were already aware of those rules and doing everything right, it's probably wiser to just provide sample code to see if someone catches something you missed. It's always possible your code just won't gain a performance benefit from this stuff anyways, and this is complicated enough it's probably not worth your time unless you need the extra performance.

I'm the author of the original recommendation, and it came about through considerable experimentation in an attempt to get comparable performance out of GL UBOs as you can get from D3D cbuffers.

The issue was that in D3D the per-object orphan/write/use/orphan/write/use/orphan/write/use pattern (with a single small buffer containing storage for a single object) works, it runs fast and gives robust and consistent behaviour across different hardware vendors. In GL none of this happens.

The answer to the question "have you tried <insert tired old standard buffer object recommendation here>?" is "yes - and it didn't work".

The only solution that worked robustly across different hardware from different vendors was to iterate over each object twice - this may not be such a big deal as you're probably already doing it anyway (e.g once to build a list of drawable objects, once to draw them) - with the first iteration choosing a range in a single large UBO (sized large enough for the max number of objects you want to handle in a frame) to use for the object and writing in it's data, the second calling glBindBufferRange and drawing.

The data was originally written to a system memory intermediate storage area, then that was loaded (between the two iterations) to the UBO using glBufferSubData. Mapping (even MapBufferRange with the appropriate flags) sometimes worked on one vendor but failed on another and gave no appreciable performance difference anyway, so it wasn't pursued much. glBufferData (NULL) before the glBufferSubData call gave no measurable performance difference. glBufferData on it's own gave no measurable performance difference. Different usage hints were a waste of time as drivers ignore these anyway.

I'm fully satisfied that all of this is 100% a specification/driver issue, particularly coming from the OpenGL buffer object specification, and poor driver implementations. After all, the hardware vendors can write a driver where the single-small-buffer and per-object orphan/write/use pattern works and works fast - they've done it in their D3D drivers. It would be interesting to test if the GL4.4 immutable buffer storage functionality helps any, but in the absence of AMD and Intel implementations I don't see anything meaningful or useful coming from such a test.

Finally, by way of comparison with standalone uniforms, the problem there was that in GL standalone uniforms are per-program state and there are no global/shared uniforms, outside of UBOs.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

What I currently do:


bindBufferBase(smallUniformBuffer);

for (o : objects)
{
    bufferSubData(smallUniformBuffer, o.transformation);
    draw(o.vertices);
}

What I think I should be doing:


offset = 0;
for (o : objects)
{
    memory[offset] = o.transformation;
    ++offset;
}

bufferData(hugeBuffer, memory);

offset = 0;
for (o : objects)
{
    bindBufferRange(hugeBuffer, offset);
    draw(o.vertices);
}

At first I was a bit frustrated because I am used to the Effects of the D3D-Sdk, but after reading the presentation about batched buffer updates it seems a D3D application can also benefit from doing it this way. So the architecture can be the same for both APIs.

@richardurich:

"Were updates far enough apart to avoid conflicts (can be perf hit if call A writes to first half of block A, and call B writes to second half of block A)?"

Can you explain this a bit more. Are you saying, that it is not good to write to the first half of the buffer and then to the second half, although the ranges don't intersect?

I'm the author of the original recommendation, and it came about through considerable experimentation in an attempt to get comparable performance out of GL UBOs as you can get from D3D cbuffers.

The issue was that in D3D the per-object orphan/write/use/orphan/write/use/orphan/write/use pattern (with a single small buffer containing storage for a single object) works, it runs fast and gives robust and consistent behaviour across different hardware vendors. In GL none of this happens.

The answer to the question "have you tried <insert tired old standard buffer object recommendation here>?" is "yes - and it didn't work".

The only solution that worked robustly across different hardware from different vendors was to iterate over each object twice - this may not be such a big deal as you're probably already doing it anyway (e.g once to build a list of drawable objects, once to draw them) - with the first iteration choosing a range in a single large UBO (sized large enough for the max number of objects you want to handle in a frame) to use for the object and writing in it's data, the second calling glBindBufferRange and drawing.

The data was originally written to a system memory intermediate storage area, then that was loaded (between the two iterations) to the UBO using glBufferSubData. Mapping (even MapBufferRange with the appropriate flags) sometimes worked on one vendor but failed on another and gave no appreciable performance difference anyway, so it wasn't pursued much. glBufferData (NULL) before the glBufferSubData call gave no measurable performance difference. glBufferData on it's own gave no measurable performance difference. Different usage hints were a waste of time as drivers ignore these anyway.

I'm fully satisfied that all of this is 100% a specification/driver issue, particularly coming from the OpenGL buffer object specification, and poor driver implementations. After all, the hardware vendors can write a driver where the single-small-buffer and per-object orphan/write/use pattern works and works fast - they've done it in their D3D drivers. It would be interesting to test if the GL4.4 immutable buffer storage functionality helps any, but in the absence of AMD and Intel implementations I don't see anything meaningful or useful coming from such a test.

Finally, by way of comparison with standalone uniforms, the problem there was that in GL standalone uniforms are per-program state and there are no global/shared uniforms, outside of UBOs.

Thanks, that answers my question!

Have you tried this architecture that works well for OpenGL with a D3D backend? Will it also work well there? At least it would make it less annoying, that this is apparently a driver weakness.

I'm indeed already iterating over each object twice but I'm wondering if I don't have to do it a third time now, because the first iteration is followed by a sort which could tell me when I don't need to update the per-material buffers for instance.

I'm also wondering about another thing. Is it very important that the uniform buffer is large enough to fit every single object drawn per frame, or can you achieve good performance with a buffer that is large enough to contain some object data before the data has to be changed. To me it sounds like that should already help, but then again the handling of uniform buffers shouldn't be so hard in the first place.

@richardurich:

"Were updates far enough apart to avoid conflicts (can be perf hit if call A writes to first half of block A, and call B writes to second half of block A)?"

Can you explain this a bit more. Are you saying, that it is not good to write to the first half of the buffer and then to the second half, although the ranges don't intersect?

A different way to look at the concept I'm talking about is don't put uniforms for group A at offset 0 and uniforms for group B at offset 12 (a single vec3). If you're updating an odd amount of data like 1 vec3, pad out to the next 64B boundary for the start of object B. Do not trust my 64 byte number there. I have no idea if 64 bytes is the right value. I feel like a single 4x4 matrix was the right size, but I could easily be remembering wrong and it could have changed anyways.

I don't know which calls or vendors took the hit, just the end result was X-byte align each group of uniforms. You'll also have to test that this is better performance since it may hurt you if you are bandwidth starved.

I know I've said it before, and I'm sure you already know it, but make sure you are bottlenecking on something this fixes before wasting a lot of time tuning the performance. And make sure you're late enough in development that the perf data you collect is going to be right. Unless you're familiar with coding something for development that also accounts for future optimizations (var for group alignment, one for map alignment, etc.), you're best off delaying this type of coding as long as humanly possible. For example, you want to 1-byte align during development so you catch any bugs where you stomp past your buffer instead of hiding the bug due to padding.

A different way to look at the concept I'm talking about is don't put uniforms for group A at offset 0 and uniforms for group B at offset 12 (a single vec3). If you're updating an odd amount of data like 1 vec3, pad out to the next 64B boundary for the start of object B. Do not trust my 64 byte number there. I have no idea if 64 bytes is the right value. I feel like a single 4x4 matrix was the right size, but I could easily be remembering wrong and it could have changed anyways.

I don't know which calls or vendors took the hit, just the end result was X-byte align each group of uniforms. You'll also have to test that this is better performance since it may hurt you if you are bandwidth starved.

I know I've said it before, and I'm sure you already know it, but make sure you are bottlenecking on something this fixes before wasting a lot of time tuning the performance. And make sure you're late enough in development that the perf data you collect is going to be right. Unless you're familiar with coding something for development that also accounts for future optimizations (var for group alignment, one for map alignment, etc.), you're best off delaying this type of coding as long as humanly possible. For example, you want to 1-byte align during development so you catch any bugs where you stomp past your buffer instead of hiding the bug due to padding.

OK, I think I understand. If we are thinking about the same thing calls to glBindBufferRange actually have to have an offset that is a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, which on my machine (and I think on many other cards) 256.

Just now I tried to evaluate if the uniform updates are a bottleneck in my case. For this test I stripped down the rendering pipeline as much as I could, regarding OpenGL interaction. I simulated the performance of the "optimized" uniform updates by replacing glBufferSubData(float4x4) with glBindBufferRange().

I compared the two approaches with 1K and 4K draw calls for very simple geometry (same vb for every draw call) and could not see any noticeable difference.

I concluded that the optimized version could not possibly be faster than just calling glBindBufferRange() for every differently transformed object, which in turn means this is not my bottleneck.

So has the driver situation improved or is my test/conclusion flawed?

If you slim down the OpenGL calls in the rendering pipeline so the driver is only busy 20% of the time, the frame rate won't change if you increase that to 40% (twice as slow) or decrease it to 10% (twice as fast)? You basically just guaranteed the driver has plenty of time to do all the memory management required, and that's mostly what you were trying to take off the driver's plate in the first place.

It sounds like you do not need to be worrying about this stuff yet, and may never need to worry about it.

This topic is closed to new replies.

Advertisement