Sign in to follow this  

OpenGL Uniform buffer updates

This topic is 1420 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I use uniform buffers to set constants in the shaders.

Currently each uniform block is backed by a uniform buffer of the appropriate size, glBindBufferBase is called once per frame and glNamedBufferSubDataEXT is called for every object without orphaning.

 

I tried to optimize this by using a larger uniform buffer, calling glBindBufferRange and updating subsequent regions in the buffer and this turned out to be significantly slower. After looking around I found this and similar threads that talk about the same problem. The suggestion seems to be to use one large uniform buffer for all objects, only update once with the data for all objects and call glBindBufferRange for every drawcall. 

 

Is this the definite way to go with in OpenGL, regardless of using BufferSubData or MapBufferRange? At one place it was suggested that for small amounts of data glUniformfv is the fastest choice. It would be nice to implement comparable levels of performance with uniform buffers.

 

What is your experience with updating shader uniforms in OpenGL?

Share this post


Link to post
Share on other sites

Obviously for runtime speed performance, it would be better to upload only once and bind the range. This will take more memory but run faster.

 

But of course this wouldn't be viable for truely dynamic data, which couldn't be precalculated at a single upload stage. That's when runtime BufferSubData calls becomes viable. If your information isn't truely dynamic like this, then I'd say BufferSubData should only be used at the single upload stage.

 

Regardless, I think direct glUniform* calls have perfectly fine performance. The usage of uniform buffers might be overhyped. Especially when you're using truely dynamic data, the cases in which they will add performance diminishes. Basically when you use the SAME identical (updated) information multiple times. If you're updating a real dynamic uniform buffer and only using that information once, then you may aswell just use a straight glUniformfv

Share this post


Link to post
Share on other sites

When using a larger uniform buffer and ranges, there are a lot of things you have to do right. And it won't always be the best performance for every situation. While I can't know if it would provide better performance in your specific circumstance, I can ask a few things to make sure you were setting it up right. Were updates far enough apart to avoid conflicts (can be perf hit if call A writes to first half of block A, and call B writes to second half of block A)? Were you invalidating (orphaning) old data early enough or were you orphaning just before writing new data to the same location? Were you only orphaning large chunks at a time, preferably that map directly to memory pages? Was your buffer at least twice the size of the chunks you orphan? Were you using unsynchronized when writing new data? Were you ever blocking on your sync? Were you only sync'ing per orphan call instead of per frame? Did you test performance to make sure STREAM is faster than DYNAMIC (buffer hint)?

 

I almost certainly left out some important performance traps and considerations, but that should get you started. If you were already aware of those rules and doing everything right, it's probably wiser to just provide sample code to see if someone catches something you missed. It's always possible your code just won't gain a performance benefit from this stuff anyways, and this is complicated enough it's probably not worth your time unless you need the extra performance.

Share this post


Link to post
Share on other sites

I'm the author of the original recommendation, and it came about through considerable experimentation in an attempt to get comparable performance out of GL UBOs as you can get from D3D cbuffers.

 

The issue was that in D3D the per-object orphan/write/use/orphan/write/use/orphan/write/use pattern (with a single small buffer containing storage for a single object) works, it runs fast and gives robust and consistent behaviour across different hardware vendors.  In GL none of this happens.

 

The answer to the question "have you tried <insert tired old standard buffer object recommendation here>?" is "yes - and it didn't work".

 

The only solution that worked robustly across different hardware from different vendors was to iterate over each object twice - this may not be such a big deal as you're probably already doing it anyway (e.g once to build a list of drawable objects, once to draw them) - with the first iteration choosing a range in a single large UBO (sized large enough for the max number of objects you want to handle in a frame) to use for the object and writing in it's data, the second calling glBindBufferRange and drawing.

 

The data was originally written to a system memory intermediate storage area, then that was loaded (between the two iterations) to the UBO using glBufferSubData.  Mapping (even MapBufferRange with the appropriate flags) sometimes worked on one vendor but failed on another and gave no appreciable performance difference anyway, so it wasn't pursued much.  glBufferData (NULL) before the glBufferSubData call gave no measurable performance difference.  glBufferData on it's own gave no measurable performance difference.  Different usage hints were a waste of time as drivers ignore these anyway.

 

I'm fully satisfied that all of this is 100% a specification/driver issue, particularly coming from the OpenGL buffer object specification, and poor driver implementations.  After all, the hardware vendors can write a driver where the single-small-buffer and per-object orphan/write/use pattern works and works fast - they've done it in their D3D drivers.  It would be interesting to test if the GL4.4 immutable buffer storage functionality helps any, but in the absence of AMD and Intel implementations I don't see anything meaningful or useful coming from such a test.

 

Finally, by way of comparison with standalone uniforms, the problem there was that in GL standalone uniforms are per-program state and there are no global/shared uniforms, outside of UBOs.

Edited by mhagain

Share this post


Link to post
Share on other sites

What I currently do:

bindBufferBase(smallUniformBuffer);

for (o : objects)
{
    bufferSubData(smallUniformBuffer, o.transformation);
    draw(o.vertices);
}

What I think I should be doing:

offset = 0;
for (o : objects)
{
    memory[offset] = o.transformation;
    ++offset;
}

bufferData(hugeBuffer, memory);

offset = 0;
for (o : objects)
{
    bindBufferRange(hugeBuffer, offset);
    draw(o.vertices);
}

At first I was a bit frustrated because I am used to the Effects of the D3D-Sdk, but after reading the presentation about batched buffer updates it seems a D3D application can also benefit from doing it this way. So the architecture can be the same for both APIs.

 

@richardurich:

"Were updates far enough apart to avoid conflicts (can be perf hit if call A writes to first half of block A, and call B writes to second half of block A)?"

Can you explain this a bit more. Are you saying, that it is not good to write to the first half of the buffer and then to the second half, although the ranges don't intersect?

Share this post


Link to post
Share on other sites

I'm the author of the original recommendation, and it came about through considerable experimentation in an attempt to get comparable performance out of GL UBOs as you can get from D3D cbuffers.

 

The issue was that in D3D the per-object orphan/write/use/orphan/write/use/orphan/write/use pattern (with a single small buffer containing storage for a single object) works, it runs fast and gives robust and consistent behaviour across different hardware vendors.  In GL none of this happens.

 

The answer to the question "have you tried <insert tired old standard buffer object recommendation here>?" is "yes - and it didn't work".

 

The only solution that worked robustly across different hardware from different vendors was to iterate over each object twice - this may not be such a big deal as you're probably already doing it anyway (e.g once to build a list of drawable objects, once to draw them) - with the first iteration choosing a range in a single large UBO (sized large enough for the max number of objects you want to handle in a frame) to use for the object and writing in it's data, the second calling glBindBufferRange and drawing.

 

The data was originally written to a system memory intermediate storage area, then that was loaded (between the two iterations) to the UBO using glBufferSubData.  Mapping (even MapBufferRange with the appropriate flags) sometimes worked on one vendor but failed on another and gave no appreciable performance difference anyway, so it wasn't pursued much.  glBufferData (NULL) before the glBufferSubData call gave no measurable performance difference.  glBufferData on it's own gave no measurable performance difference.  Different usage hints were a waste of time as drivers ignore these anyway.

 

I'm fully satisfied that all of this is 100% a specification/driver issue, particularly coming from the OpenGL buffer object specification, and poor driver implementations.  After all, the hardware vendors can write a driver where the single-small-buffer and per-object orphan/write/use pattern works and works fast - they've done it in their D3D drivers.  It would be interesting to test if the GL4.4 immutable buffer storage functionality helps any, but in the absence of AMD and Intel implementations I don't see anything meaningful or useful coming from such a test.

 

Finally, by way of comparison with standalone uniforms, the problem there was that in GL standalone uniforms are per-program state and there are no global/shared uniforms, outside of UBOs.

Thanks, that answers my question!

Have you tried this architecture that works well for OpenGL with a D3D backend? Will it also work well there? At least it would make it less annoying, that this is apparently a driver weakness.

 

I'm indeed already iterating over each object twice but I'm wondering if I don't have to do it a third time now, because the first iteration is followed by a sort which could tell me when I don't need to update the per-material buffers for instance. 

 

I'm also wondering about another thing. Is it very important that the uniform buffer is large enough to fit every single object drawn per frame, or can you achieve good performance with a buffer that is large enough to contain some object data before the data has to be changed. To me it sounds like that should already help, but then again the handling of uniform buffers shouldn't be so hard in the first place.

Share this post


Link to post
Share on other sites

@richardurich:

"Were updates far enough apart to avoid conflicts (can be perf hit if call A writes to first half of block A, and call B writes to second half of block A)?"

Can you explain this a bit more. Are you saying, that it is not good to write to the first half of the buffer and then to the second half, although the ranges don't intersect?

A different way to look at the concept I'm talking about is don't put uniforms for group A at offset 0 and uniforms for group B at offset 12 (a single vec3). If you're updating an odd amount of data like 1 vec3, pad out to the next 64B boundary for the start of object B. Do not trust my 64 byte number there. I have no idea if 64 bytes is the right value. I feel like a single 4x4 matrix was the right size, but I could easily be remembering wrong and it could have changed anyways.

 

I don't know which calls or vendors took the hit, just the end result was X-byte align each group of uniforms. You'll also have to test that this is better performance since it may hurt you if you are bandwidth starved.

 

I know I've said it before, and I'm sure you already know it, but make sure you are bottlenecking on something this fixes before wasting a lot of time tuning the performance. And make sure you're late enough in development that the perf data you collect is going to be right. Unless you're familiar with coding something for development that also accounts for future optimizations (var for group alignment, one for map alignment, etc.), you're best off delaying this type of coding as long as humanly possible. For example, you want to 1-byte align during development so you catch any bugs where you stomp past your buffer instead of hiding the bug due to padding.

Share this post


Link to post
Share on other sites

A different way to look at the concept I'm talking about is don't put uniforms for group A at offset 0 and uniforms for group B at offset 12 (a single vec3). If you're updating an odd amount of data like 1 vec3, pad out to the next 64B boundary for the start of object B. Do not trust my 64 byte number there. I have no idea if 64 bytes is the right value. I feel like a single 4x4 matrix was the right size, but I could easily be remembering wrong and it could have changed anyways.

 

I don't know which calls or vendors took the hit, just the end result was X-byte align each group of uniforms. You'll also have to test that this is better performance since it may hurt you if you are bandwidth starved.

 

I know I've said it before, and I'm sure you already know it, but make sure you are bottlenecking on something this fixes before wasting a lot of time tuning the performance. And make sure you're late enough in development that the perf data you collect is going to be right. Unless you're familiar with coding something for development that also accounts for future optimizations (var for group alignment, one for map alignment, etc.), you're best off delaying this type of coding as long as humanly possible. For example, you want to 1-byte align during development so you catch any bugs where you stomp past your buffer instead of hiding the bug due to padding.

OK, I think I understand. If we are thinking about the same thing calls to glBindBufferRange actually have to have an offset that is a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, which on my machine (and I think on many other cards) 256.

 

Just now I tried to evaluate if the uniform updates are a bottleneck in my case. For this test I stripped down the rendering pipeline as much as I could, regarding OpenGL interaction. I simulated the performance of the "optimized" uniform updates by replacing glBufferSubData(float4x4) with glBindBufferRange().

I compared the two approaches with 1K and 4K draw calls for very simple geometry (same vb for every draw  call) and could not see any noticeable difference.

 

I concluded that the optimized version could not possibly be faster than just calling glBindBufferRange() for every differently transformed object, which in turn means this is not my bottleneck. 

 

So has the driver situation improved or is my test/conclusion flawed?

Edited by B_old

Share this post


Link to post
Share on other sites

If you slim down the OpenGL calls in the rendering pipeline so the driver is only busy 20% of the time, the frame rate won't change if you increase that to 40% (twice as slow) or decrease it to 10% (twice as fast)? You basically just guaranteed the driver has plenty of time to do all the memory management required, and that's mostly what you were trying to take off the driver's plate in the first place.

 

It sounds like you do not need to be worrying about this stuff yet, and may never need to worry about it.

Share this post


Link to post
Share on other sites

This topic is 1420 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Similar Content

    • By xhcao
      Does sync be needed to read texture content after access texture image in compute shader?
      My simple code is as below,
      glUseProgram(program.get());
      glBindImageTexture(0, texture[0], 0, GL_FALSE, 3, GL_READ_ONLY, GL_R32UI);
      glBindImageTexture(1, texture[1], 0, GL_FALSE, 4, GL_WRITE_ONLY, GL_R32UI);
      glDispatchCompute(1, 1, 1);
      // Does sync be needed here?
      glUseProgram(0);
      glBindFramebuffer(GL_READ_FRAMEBUFFER, framebuffer);
      glFramebufferTexture2D(GL_READ_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
                                     GL_TEXTURE_CUBE_MAP_POSITIVE_X + face, texture[1], 0);
      glReadPixels(0, 0, kWidth, kHeight, GL_RED_INTEGER, GL_UNSIGNED_INT, outputValues);
       
      Compute shader is very simple, imageLoad content from texture[0], and imageStore content to texture[1]. Does need to sync after dispatchCompute?
    • By Jonathan2006
      My question: is it possible to transform multiple angular velocities so that they can be reinserted as one? My research is below:
      // This works quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); quat quaternion2 = GEMultiplyQuaternions(quaternion1, GEQuaternionFromAngleRadians(angleRadiansVector2)); quat quaternion3 = GEMultiplyQuaternions(quaternion2, GEQuaternionFromAngleRadians(angleRadiansVector3)); glMultMatrixf(GEMat4FromQuaternion(quaternion3).array); // The first two work fine but not the third. Why? quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); vec3 vector1 = GETransformQuaternionAndVector(quaternion1, angularVelocity1); quat quaternion2 = GEQuaternionFromAngleRadians(angleRadiansVector2); vec3 vector2 = GETransformQuaternionAndVector(quaternion2, angularVelocity2); // This doesn't work //quat quaternion3 = GEQuaternionFromAngleRadians(angleRadiansVector3); //vec3 vector3 = GETransformQuaternionAndVector(quaternion3, angularVelocity3); vec3 angleVelocity = GEAddVectors(vector1, vector2); // Does not work: vec3 angleVelocity = GEAddVectors(vector1, GEAddVectors(vector2, vector3)); static vec3 angleRadiansVector; vec3 angularAcceleration = GESetVector(0.0, 0.0, 0.0); // Sending it through one angular velocity later in my motion engine angleVelocity = GEAddVectors(angleVelocity, GEMultiplyVectorAndScalar(angularAcceleration, timeStep)); angleRadiansVector = GEAddVectors(angleRadiansVector, GEMultiplyVectorAndScalar(angleVelocity, timeStep)); glMultMatrixf(GEMat4FromEulerAngle(angleRadiansVector).array); Also how do I combine multiple angularAcceleration variables? Is there an easier way to transform the angular values?
    • By dpadam450
      I have this code below in both my vertex and fragment shader, however when I request glGetUniformLocation("Lights[0].diffuse") or "Lights[0].attenuation", it returns -1. It will only give me a valid uniform location if I actually use the diffuse/attenuation variables in the VERTEX shader. Because I use position in the vertex shader, it always returns a valid uniform location. I've read that I can share uniforms across both vertex and fragment, but I'm confused what this is even compiling to if this is the case.
       
      #define NUM_LIGHTS 2
      struct Light
      {
          vec3 position;
          vec3 diffuse;
          float attenuation;
      };
      uniform Light Lights[NUM_LIGHTS];
       
       
    • By pr033r
      Hello,
      I have a Bachelor project on topic "Implenet 3D Boid's algorithm in OpenGL". All OpenGL issues works fine for me, all rendering etc. But when I started implement the boid's algorithm it was getting worse and worse. I read article (http://natureofcode.com/book/chapter-6-autonomous-agents/) inspirate from another code (here: https://github.com/jyanar/Boids/tree/master/src) but it still doesn't work like in tutorials and videos. For example the main problem: when I apply Cohesion (one of three main laws of boids) it makes some "cycling knot". Second, when some flock touch to another it scary change the coordination or respawn in origin (x: 0, y:0. z:0). Just some streng things. 
      I followed many tutorials, change a try everything but it isn't so smooth, without lags like in another videos. I really need your help. 
      My code (optimalizing branch): https://github.com/pr033r/BachelorProject/tree/Optimalizing
      Exe file (if you want to look) and models folder (for those who will download the sources):
      http://leteckaposta.cz/367190436
      Thanks for any help...

    • By Andrija
      I am currently trying to implement shadow mapping into my project , but although i can render my depth map to the screen and it looks okay , when i sample it with shadowCoords there is no shadow.
      Here is my light space matrix calculation
      mat4x4 lightViewMatrix; vec3 sun_pos = {SUN_OFFSET * the_sun->direction[0], SUN_OFFSET * the_sun->direction[1], SUN_OFFSET * the_sun->direction[2]}; mat4x4_look_at(lightViewMatrix,sun_pos,player->pos,up); mat4x4_mul(lightSpaceMatrix,lightProjMatrix,lightViewMatrix); I will tweak the values for the size and frustum of the shadow map, but for now i just want to draw shadows around the player position
      the_sun->direction is a normalized vector so i multiply it by a constant to get the position.
      player->pos is the camera position in world space
      the light projection matrix is calculated like this:
      mat4x4_ortho(lightProjMatrix,-SHADOW_FAR,SHADOW_FAR,-SHADOW_FAR,SHADOW_FAR,NEAR,SHADOW_FAR); Shadow vertex shader:
      uniform mat4 light_space_matrix; void main() { gl_Position = light_space_matrix * transfMatrix * vec4(position, 1.0f); } Shadow fragment shader:
      out float fragDepth; void main() { fragDepth = gl_FragCoord.z; } I am using deferred rendering so i have all my world positions in the g_positions buffer
      My shadow calculation in the deferred fragment shader:
      float get_shadow_fac(vec4 light_space_pos) { vec3 shadow_coords = light_space_pos.xyz / light_space_pos.w; shadow_coords = shadow_coords * 0.5 + 0.5; float closest_depth = texture(shadow_map, shadow_coords.xy).r; float current_depth = shadow_coords.z; float shadow_fac = 1.0; if(closest_depth < current_depth) shadow_fac = 0.5; return shadow_fac; } I call the function like this:
      get_shadow_fac(light_space_matrix * vec4(position,1.0)); Where position is the value i got from sampling the g_position buffer
      Here is my depth texture (i know it will produce low quality shadows but i just want to get it working for now):
      sorry because of the compression , the black smudges are trees ... https://i.stack.imgur.com/T43aK.jpg
      EDIT: Depth texture attachment:
      glTexImage2D(GL_TEXTURE_2D, 0,GL_DEPTH_COMPONENT24,fbo->width,fbo->height,0,GL_DEPTH_COMPONENT,GL_FLOAT,NULL); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE); glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_TEXTURE_2D, fbo->depthTexture, 0);
  • Popular Now