SPEED: glUniform Vs. uniform buffer objects.

Started by
10 comments, last by cippyboy 9 years, 11 months ago

So recently I dropped all support for legacy openGL matrix functionality from my 3D engine, and instead of GL_MODELVIEW for example, I'm uploading my view matrices to a Uniform Buffer Object to be shared and accessed by my shaders. But as stated in a previous post, this caused a huge FPS drop (from 1000 to 100). I've managed to get this number up towards about 400 by trimming out as many matrix calls as possible (like transforming meshes before drawing, I've now transformed the vertices before uploading to the GPU), but I'm still not happy with the performance.

Would regular glUniform calls be quicker than using buffer objects, for example, when I bind a shader, I pass the current view/projection matrices as a uniform, rather than using the UBO's.

This would mean each shader would have it's own copy of the matrices rather than the global (UBO) ones...

Does anyone know if glUniform() calls are faster than UBO calls? (Which have proven to drop FPS quite significantly).

Jonathan.

Advertisement

Try it and see?

NOTJon.

Assuming that you are not doing something wrong (bad usage flags or such) glUniform is almost guaranteed to be slower than a buffer object (although obviously this will vary by platform and you'd have to try to be 100% sure).

The reason why I'm saying this is that just like when you draw with the deprecated fixed-function pipeline, uniforms will require the driver to batch together several uniform values into the equivalent of a buffer object anyway. The way the hardware works, uniforms are just values in a memory block on the GPU, and they need to be transmitted via PCIe or some similar bus. Such a transfer involves a lot of setup and synchronization (and a complete stall on some hardware, e.g. pre-Kepler nVIdia consumer cards, and all ATI cards that I know), and has very non-neglegible fixed overhead to start at all (and a very noticeable latency), plus some small overhead that depends on the amount of data transmitted. In other words, transmitting a single byte is more or less as expensive as transmitting a hundred kilobytes. Doing a transfer at all takes "forever" but bandwidth is pretty abundant.

If you do a couple of glUniform calls by hand, the driver must guess when it's time to batch together a few of them, and transfer them. That may work fine, or it may have to do another transfer if another few values follow. Also, more commands necessarily mean more function calls and more items on OpenGL's command queue being pushed and popped, which of course is not that much overhead, but still... it adds up.

If you just put everything in one buffer object (and you know how many you have!), it's one function call, one transfer, and it avoids the batch management work inside the driver. Which, at least in theory, must necessarily be faster.

In theory UBOs should be faster than glUniform calls, for all the reasons given in the previous post. They also give the advantage that uniforms can be shared by different program objects.

In practice UBOs can be considerably slower.

This reduction in speed is nothing to do with use of uniforms in buffers, nothing to do with number of function calls.

It's everything to do with OpenGL's buffer object API and how you use it.

With buffer objects you can't just treat them the same as a block of system memory that you can grab a pointer to and write to, read from, as required without suffering serious performance overhead. You have to manage them carefully and you have to know what kind of performance characteristics you can expect.

I'm going to assume that you have 1000 objects and that you don't have persistent buffer mapping. There are a number of ways you can manage this.

(1) Each object can have it's own UBO. To draw an object you update the UBO, bind it, then draw. You have 1000 UBO binds and 1000 UBO updates per frame. This is going to run slow.

(2) You have a single small UBO that all objects share. It's bound once during startup. To draw an object you update the UBO, then draw. You have a single UBO bind at startup, but 1000 UBO updates per frame. This is going to run slow.

(3) You have a single large UBO sized for 1000 objects. At the start of each frame you make a pass through your objects. You update the data they're going to use and copy it off to a system memory buffer. Then you make a single glBufferSubData call. Each object stores an object id from which you can reconstruct the offset it's data is at in the UBO. To draw an object you make a glBindBufferRange call, then draw. You have 1000 glBindBufferRange calls but one UBO update per frame. This is going to run fast.

The conclusion is that using UBOs involves some re-architecting. You can't just take a bunch of code using standalone uniforms, port it over to UBOs without changing anything else, and expect to get the same performance from it. You need to think about your updates, group them all together, and that doesn't mean doing 1000 updates at the same time, that means doing one update that covers all 1000 objects.

How can I so confidently lay the blame at the API here? Because you can do (1) and (2) in D3D, and with all other things being equal they run fast (with (2) being faster than (1) owing to D3D's explicit "discard" semantics). UBOs in GL, however, do not run fast under those circumstances.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

@mhagain

I thought I was doing something wrong as well when I saw they're slower (cause nobody will tell you that, just that using them is cool and fast). I too turned all uniforms into a single UBO, updating and binding it on a per object basis (just like in D3D11) and it's slower. I even made it so that if the camera doesn't move the update doesn't even happen, but binding 100 UBOs for each draw call was still slower than using plain a dozen glUniform calls, around 1-3% slower.

I was planning to do #3 as described above when I found the topic :).

Relative Games - My apps

Beware that this isn't necessarily going to run faster than standalone uniforms either (depends on how much data you have in the buffer vs how many standalone uniforms you were using, other factors, and will be very driver-dependent). 1% to 3% speed difference in either direction is IMO acceptable. At that stage the primary advantage of UBOs is one of convenience: being able to share uniforms among multiple different programs rather than having to reload them every time you change program.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Yeah, but sharing a small UBO between all programs and still have a second UBO per program is kind of bad too. I haven't tested on GL but I did this on D3D11, had 3 CB per object with different levels of updates, per frame, per material and per object, and using 3 CB per object was always slower than using just one. I had this happen on my old HD5770 and my newer HD7850, I initially thought it was the first generation DX11 drivers, seems like it wasn't. Some still argue that they see benefits from sharing some buffers but I haven't found any case in that favor yet, tried with 30 bone skinned meshes too, it was still faster to keep all the constant data in a CB than split bones in a CB and the rest in another CB (so that I could for example reuse the bones CB in a shadow pass where i needed just that).

UPDATE : Just implemented the global uniform buffer with glBindBufferRange calls. Discovered that the offset you put to glBindBufferRange needs to be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, and observed still a negative performance penalty of -1% for using a dozen calls to glBindBufferRange compared to using just glUniform calls. I'll also add that for glUniform calls I have a state minimizer which compares data with the previous data set up (from like a previous object) and if it's the same thing, it doesn't call glUniform. And the comparison was done standing still, so basically I don't even map the buffer to update it cause I'm not moving/changing any data.

UPDATE2 : Did it on D3D11.1 using VSSetConstantBuffers1 and PSSetConstantBuffers1 and there is indeed a difference of aproximately <=0.4%. I already had 99% GPU usage though so it's not all that unexpected, on GL though I had like ~63% GPU usage and it's mostly doing the same thing. Perhaps it's because I have relatively small constant buffers ? I got like 464 bytes per object which need to turn to 512 for alignment purposes.

Relative Games - My apps


(3) You have a single large UBO sized for 1000 objects. At the start of each frame you make a pass through your objects. You update the data they're going to use and copy it off to a system memory buffer. Then you make a single glBufferSubData call. Each object stores an object id from which you can reconstruct the offset it's data is at in the UBO. To draw an object you make a glBindBufferRange call, then draw. You have 1000 glBindBufferRange calls but one UBO update per frame. This is going to run fast.
Thing is, there are limitations:

It seems to me that for drawing at such scale, you'd have to maintain several big UBOs. Say, one for matrices, other for material data, etc.

For example, my card supports up to 65kB UBOs (querying for GL_MAX_UNIFORM_BLOCK_SIZE). And I've heard that number around a few times. Say that for drawing you need to upload a block with a mvp matrix and a modelView matrix for lighting. Thats 128 bytes, ie, 512 of those instances.

I haven't tried it yet but maybe you can bind ranges of different UBOs to the same binding slot in the shader program?

Or maybe you could just say "upload all the stuff i can, draw, upload the rest of the stuff, keep drawing". Which would reduce glBufferSubData calls drastically but you'd require some logic around to check how many of those blocks you can upload at once.

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

I'll need to check my code (I have all of this written and working) but if memory serves it's bytes. Yes, that typically means that you'll have some empty space at the end of each objects block in the UBO, but that's OK; the more important thing is to minimise the updates as much as possible by doing as many of them as possible in a single operation.

The unfortunate consequence of this is that GL code using UBOs will look quite different to D3D code using cbuffers, so it can be messy if you want to support both APIs in the same program.

What I personally haven't tested is the single-bind/multiple-update pattern using GL_ARB_buffer_storage and persistent mapping; I might try that later on as it would be interesting to get some visibility on how that works as an option.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

It's bytes and it refers to the Offset only, so if it says 256 you can do

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 256, Range1 );

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 512, Range2 );

....

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, n * 256, Range3 );

@mhagain

You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.

Relative Games - My apps

This topic is closed to new replies.

Advertisement