Jump to content

  • Log In with Google      Sign In   
  • Create Account


SPEED: glUniform Vs. uniform buffer objects.


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 TRONJon   Members   -  Reputation: 212

Like
1Likes
Like

Posted 27 April 2014 - 11:20 AM

So recently I dropped all support for legacy openGL matrix functionality from my 3D engine, and instead of GL_MODELVIEW for example, I'm uploading my view matrices to a Uniform Buffer Object to be shared and accessed by my shaders. But as stated in a previous post, this caused a huge FPS drop (from 1000 to 100). I've managed to get this number up towards about 400 by trimming out as many matrix calls as possible (like transforming meshes before drawing, I've now transformed the vertices before uploading to the GPU), but I'm still not happy with the performance.

 

Would regular glUniform calls be quicker than using buffer objects, for example, when I bind a shader, I pass the current view/projection matrices as a uniform, rather than using the UBO's.

 

This would mean each shader would have it's own copy of the matrices rather than the global (UBO) ones...

 

Does anyone know if glUniform() calls are faster than UBO calls? (Which have proven to drop FPS quite significantly).

 

Jonathan.



Sponsor:

#2 3TATUK2   Members   -  Reputation: 730

Like
0Likes
Like

Posted 27 April 2014 - 11:56 AM

Try it and see?

 

NOTJon.



#3 samoth   Crossbones+   -  Reputation: 4684

Like
7Likes
Like

Posted 27 April 2014 - 12:35 PM

Assuming that you are not doing something wrong (bad usage flags or such) glUniform is almost guaranteed to be slower than a buffer object (although obviously this will vary by platform and you'd have to try to be 100% sure).

 

The reason why I'm saying this is that just like when you draw with the deprecated fixed-function pipeline, uniforms will require the driver to batch together several uniform values into the equivalent of a buffer object anyway. The way the hardware works, uniforms are just values in a memory block on the GPU, and they need to be transmitted via PCIe or some similar bus. Such a transfer involves a lot of setup and synchronization (and a complete stall on some hardware, e.g. pre-Kepler nVIdia consumer cards, and all ATI cards that I know), and has very non-neglegible fixed overhead to start at all (and a very noticeable latency), plus some small overhead that depends on the amount of data transmitted. In other words, transmitting a single byte is more or less as expensive as transmitting a hundred kilobytes. Doing a transfer at all takes "forever" but bandwidth is pretty abundant.

 

If you do a couple of glUniform calls by hand, the driver must guess when it's time to batch together a few of them, and transfer them. That may work fine, or it may have to do another transfer if another few values follow. Also, more commands necessarily mean more function calls and more items on OpenGL's command queue being pushed and popped, which of course is not that much overhead, but still... it adds up.

If you just put everything in one buffer object (and you know how many you have!), it's one function call, one transfer, and it avoids the batch management work inside the driver. Which, at least in theory, must necessarily be faster.



#4 mhagain   Crossbones+   -  Reputation: 7821

Like
13Likes
Like

Posted 27 April 2014 - 01:19 PM

In theory UBOs should be faster than glUniform calls, for all the reasons given in the previous post.  They also give the advantage that uniforms can be shared by different program objects.

 

In practice UBOs can be considerably slower.

 

This reduction in speed is nothing to do with use of uniforms in buffers, nothing to do with number of function calls.

 

It's everything to do with OpenGL's buffer object API and how you use it.

 

With buffer objects you can't just treat them the same as a block of system memory that you can grab a pointer to and write to, read from, as required without suffering serious performance overhead.  You have to manage them carefully and you have to know what kind of performance characteristics you can expect.

 

I'm going to assume that you have 1000 objects and that you don't have persistent buffer mapping.  There are a number of ways you can manage this.

 

(1) Each object can have it's own UBO.  To draw an object you update the UBO, bind it, then draw.  You have 1000 UBO binds and 1000 UBO updates per frame.  This is going to run slow.

 

(2) You have a single small UBO that all objects share.  It's bound once during startup.  To draw an object you update the UBO, then draw.  You have a single UBO bind at startup, but 1000 UBO updates per frame.  This is going to run slow.

 

(3) You have a single large UBO sized for 1000 objects.  At the start of each frame you make a pass through your objects.  You update the data they're going to use and copy it off to a system memory buffer.  Then you make a single glBufferSubData call.  Each object stores an object id from which you can reconstruct the offset it's data is at in the UBO.  To draw an object you make a glBindBufferRange call, then draw.  You have 1000 glBindBufferRange calls but one UBO update per frame.  This is going to run fast.

 

The conclusion is that using UBOs involves some re-architecting.  You can't just take a bunch of code using standalone uniforms, port it over to UBOs without changing anything else, and expect to get the same performance from it.  You need to think about your updates, group them all together, and that doesn't mean doing 1000 updates at the same time, that means doing one update that covers all 1000 objects.

 

How can I so confidently lay the blame at the API here?  Because you can do (1) and (2) in D3D, and with all other things being equal they run fast (with (2) being faster than (1) owing to D3D's explicit "discard" semantics).  UBOs in GL, however, do not run fast under those circumstances.


Edited by mhagain, 27 April 2014 - 01:28 PM.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#5 Ciprian Stanciu   Members   -  Reputation: 206

Like
0Likes
Like

Posted 11 May 2014 - 07:04 PM

@mhagain

 

I thought I was doing something wrong as well when I saw they're slower (cause nobody will tell you that, just that using them is cool and fast). I too turned all uniforms into a single UBO, updating and binding it on a per object basis (just like in D3D11) and it's slower. I even made it so that if the camera doesn't move the update doesn't even happen, but binding 100 UBOs for each draw call was still slower than using plain a dozen glUniform calls, around 1-3% slower.

 

I was planning to do #3 as described above when I found the topic :).


Relative Games - My apps


#6 mhagain   Crossbones+   -  Reputation: 7821

Like
0Likes
Like

Posted 12 May 2014 - 03:02 AM

Beware that this isn't necessarily going to run faster than standalone uniforms either (depends on how much data you have in the buffer vs how many standalone uniforms you were using, other factors, and will be very driver-dependent).  1% to 3% speed difference in either direction is IMO acceptable.  At that stage the primary advantage of UBOs is one of convenience: being able to share uniforms among multiple different programs rather than having to reload them every time you change program.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#7 Ciprian Stanciu   Members   -  Reputation: 206

Like
0Likes
Like

Posted 12 May 2014 - 09:23 AM

Yeah, but sharing a small UBO between all programs and still have a second UBO per program is kind of bad too. I haven't tested on GL but I did this on D3D11, had 3 CB per object with different levels of updates, per frame, per material and per object, and using 3 CB per object was always slower than using just one. I had this happen on my old HD5770 and my newer HD7850, I initially thought it was the first generation DX11 drivers, seems like it wasn't. Some still argue that they see benefits from sharing some buffers but I haven't found any case in that favor yet, tried with 30 bone skinned meshes too, it was still faster to keep all the constant data in a CB than split bones in a CB and the rest in another CB (so that I could for example reuse the bones CB in a shadow pass where i needed just that).

 

UPDATE : Just implemented the global uniform buffer with glBindBufferRange calls. Discovered that the offset you put to glBindBufferRange needs to be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, and observed still a negative performance penalty of -1% for using a dozen calls to glBindBufferRange compared to using just glUniform calls. I'll also add that for glUniform calls I have a state minimizer which compares data with the previous data set up (from like a previous object) and if it's the same thing, it doesn't call glUniform. And the comparison was done standing still, so basically I don't even map the buffer to update it cause I'm not moving/changing any data.

 

UPDATE2 : Did it on D3D11.1 using VSSetConstantBuffers1 and PSSetConstantBuffers1 and there is indeed a difference of aproximately <=0.4%. I already had 99% GPU usage though so it's not all that unexpected, on GL though I had like ~63% GPU usage and it's mostly doing the same thing. Perhaps it's because I have relatively small constant buffers ? I got like 464 bytes per object which need to turn to 512 for alignment purposes.


Edited by cippyboy, 12 May 2014 - 08:10 PM.

Relative Games - My apps


#8 TheChubu   Crossbones+   -  Reputation: 4073

Like
0Likes
Like

Posted 13 May 2014 - 03:45 AM


(3) You have a single large UBO sized for 1000 objects.  At the start of each frame you make a pass through your objects.  You update the data they're going to use and copy it off to a system memory buffer.  Then you make a single glBufferSubData call.  Each object stores an object id from which you can reconstruct the offset it's data is at in the UBO.  To draw an object you make a glBindBufferRange call, then draw.  You have 1000 glBindBufferRange calls but one UBO update per frame.  This is going to run fast.
Thing is, there are limitations:

 

It seems to me that for drawing at such scale, you'd have to maintain several big UBOs. Say, one for matrices, other for material data, etc.

 

For example, my card supports up to 65kB UBOs (querying for GL_MAX_UNIFORM_BLOCK_SIZE). And I've heard that number around a few times. Say that for drawing you need to upload a block with a mvp matrix and a modelView matrix for lighting. Thats 128 bytes, ie, 512 of those instances.

 

I haven't tried it yet but maybe you can bind ranges of different UBOs to the same binding slot in the shader program?

 

Or maybe you could just say "upload all the stuff i can, draw, upload the rest of the stuff, keep drawing". Which would reduce glBufferSubData calls drastically but you'd require some logic around to check how many of those blocks you can upload at once.

 

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).


"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

 

My journals: dustArtemis ECS framework and Making a Terrain Generator


#9 mhagain   Crossbones+   -  Reputation: 7821

Like
0Likes
Like

Posted 13 May 2014 - 07:43 AM

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

 

I'll need to check my code (I have all of this written and working) but if memory serves it's bytes.  Yes, that typically means that you'll have some empty space at the end of each objects block in the UBO, but that's OK; the more important thing is to minimise the updates as much as possible by doing as many of them as possible in a single operation.

 

The unfortunate consequence of this is that GL code using UBOs will look quite different to D3D code using cbuffers, so it can be messy if you want to support both APIs in the same program.

 

What I personally haven't tested is the single-bind/multiple-update pattern using GL_ARB_buffer_storage and persistent mapping; I might try that later on as it would be interesting to get some visibility on how that works as an option.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#10 Ciprian Stanciu   Members   -  Reputation: 206

Like
0Likes
Like

Posted 13 May 2014 - 08:06 AM

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

 

 

It's bytes and it refers to the Offset only, so if it says 256 you can do

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 256, Range1 );

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 512, Range2 );

....

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, n * 256, Range3 );

 

@mhagain

You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.


Relative Games - My apps


#11 mhagain   Crossbones+   -  Reputation: 7821

Like
0Likes
Like

Posted 13 May 2014 - 09:25 AM

 

@mhagain

You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.

 

 

There's obviously a tipping point beyond which UBOs are faster than X number of glUniform calls, where X varies depending on your hardware and driver.

 

With a performance difference of under 1% I'd still incline towards using UBOs anyway as the interface is going to be cleaner and you get to share uniforms among multiple programs.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#12 TheChubu   Crossbones+   -  Reputation: 4073

Like
0Likes
Like

Posted 13 May 2014 - 11:59 AM

What about having an UBO which is an array of blocks (I'm not even sure you can use uniforms for indexing like that).

 

 
const uint MAX_INSTANCES = 1024;
 
uniform uint currentIndex;
 
struct Matrices
{
    mat4 modelViewProj;
    mat4 modelView;
};
 
layout  (std140, binding = 0) uniform MatrixBlocks
{
    Matrices matrices[MAX_INSTANCES]
};
 
main ()
{
    doMatrixStuff(matrices[currentIndex]);
}
Would that make the matrix array more tightly packed? (instead of having matrix, then another matrix at the 256 byte mark, and so on). You'd update "currentIndex" per instance drawn.

 

MAX_INSTANCES could be a constant or also an uniform (ie, instance count for this frame).


"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

 

My journals: dustArtemis ECS framework and Making a Terrain Generator


#13 Ciprian Stanciu   Members   -  Reputation: 206

Like
0Likes
Like

Posted 13 May 2014 - 09:01 PM

What about having an UBO which is an array of blocks (I'm not even sure you can use uniforms for indexing like that).

 
const uint MAX_INSTANCES = 1024;
 
uniform uint currentIndex;
 
struct Matrices
{
    mat4 modelViewProj;
    mat4 modelView;
};
 
layout  (std140, binding = 0) uniform MatrixBlocks
{
    Matrices matrices[MAX_INSTANCES]
};
 
main ()
{
    doMatrixStuff(matrices[currentIndex]);
}
Would that make the matrix array more tightly packed? (instead of having matrix, then another matrix at the 256 byte mark, and so on). You'd update "currentIndex" per instance drawn.

 

MAX_INSTANCES could be a constant or also an uniform (ie, instance count for this frame).

 

Interesting idea to do a single buffer binding AND do a glUniform call per object, using the best of both worlds I guess.

 

And yes you can index into block variables, the drivers have to support it though and it's a bit off spec. I did that on AMD and once I'm at around 80% of the max 16K / block, it sometimes crashes. On GLES3 the last time I tried it, it crashed the shader compilation on Adreno 320 and told PowerVR about and they exhibited the same issue, now it's supposedly fixed (on the reference driver at least), I just have to get a GLES3 iOS device now and see for myself :).

 

I tried the glBindBufferRange thing on my Nexus 4 btw and I'm exhibiting a serious 25% performance degradation (again with absolutely no updates per frame) but I also can't put one of my structures inside the block, it gives off some stupid errors, I think it ignores some variables and then ends up with different blocks pretending to be the same one, so I had to put my light properties (PS only) in standard glUniform calls. On Nexus 4 the alignment offset is 4 bytes, btw, pretty small in comparison.


Relative Games - My apps





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS