# OpenGL SPEED: glUniform Vs. uniform buffer objects.

This topic is 1343 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

So recently I dropped all support for legacy openGL matrix functionality from my 3D engine, and instead of GL_MODELVIEW for example, I'm uploading my view matrices to a Uniform Buffer Object to be shared and accessed by my shaders. But as stated in a previous post, this caused a huge FPS drop (from 1000 to 100). I've managed to get this number up towards about 400 by trimming out as many matrix calls as possible (like transforming meshes before drawing, I've now transformed the vertices before uploading to the GPU), but I'm still not happy with the performance.

Would regular glUniform calls be quicker than using buffer objects, for example, when I bind a shader, I pass the current view/projection matrices as a uniform, rather than using the UBO's.

This would mean each shader would have it's own copy of the matrices rather than the global (UBO) ones...

Does anyone know if glUniform() calls are faster than UBO calls? (Which have proven to drop FPS quite significantly).

Jonathan.

Try it and see?

NOTJon.

##### Share on other sites

I thought I was doing something wrong as well when I saw they're slower (cause nobody will tell you that, just that using them is cool and fast). I too turned all uniforms into a single UBO, updating and binding it on a per object basis (just like in D3D11) and it's slower. I even made it so that if the camera doesn't move the update doesn't even happen, but binding 100 UBOs for each draw call was still slower than using plain a dozen glUniform calls, around 1-3% slower.

I was planning to do #3 as described above when I found the topic :).

##### Share on other sites

Beware that this isn't necessarily going to run faster than standalone uniforms either (depends on how much data you have in the buffer vs how many standalone uniforms you were using, other factors, and will be very driver-dependent).  1% to 3% speed difference in either direction is IMO acceptable.  At that stage the primary advantage of UBOs is one of convenience: being able to share uniforms among multiple different programs rather than having to reload them every time you change program.

##### Share on other sites

Yeah, but sharing a small UBO between all programs and still have a second UBO per program is kind of bad too. I haven't tested on GL but I did this on D3D11, had 3 CB per object with different levels of updates, per frame, per material and per object, and using 3 CB per object was always slower than using just one. I had this happen on my old HD5770 and my newer HD7850, I initially thought it was the first generation DX11 drivers, seems like it wasn't. Some still argue that they see benefits from sharing some buffers but I haven't found any case in that favor yet, tried with 30 bone skinned meshes too, it was still faster to keep all the constant data in a CB than split bones in a CB and the rest in another CB (so that I could for example reuse the bones CB in a shadow pass where i needed just that).

UPDATE : Just implemented the global uniform buffer with glBindBufferRange calls. Discovered that the offset you put to glBindBufferRange needs to be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, and observed still a negative performance penalty of -1% for using a dozen calls to glBindBufferRange compared to using just glUniform calls. I'll also add that for glUniform calls I have a state minimizer which compares data with the previous data set up (from like a previous object) and if it's the same thing, it doesn't call glUniform. And the comparison was done standing still, so basically I don't even map the buffer to update it cause I'm not moving/changing any data.

UPDATE2 : Did it on D3D11.1 using VSSetConstantBuffers1 and PSSetConstantBuffers1 and there is indeed a difference of aproximately <=0.4%. I already had 99% GPU usage though so it's not all that unexpected, on GL though I had like ~63% GPU usage and it's mostly doing the same thing. Perhaps it's because I have relatively small constant buffers ? I got like 464 bytes per object which need to turn to 512 for alignment purposes.

Edited by cippyboy

##### Share on other sites

(3) You have a single large UBO sized for 1000 objects.  At the start of each frame you make a pass through your objects.  You update the data they're going to use and copy it off to a system memory buffer.  Then you make a single glBufferSubData call.  Each object stores an object id from which you can reconstruct the offset it's data is at in the UBO.  To draw an object you make a glBindBufferRange call, then draw.  You have 1000 glBindBufferRange calls but one UBO update per frame.  This is going to run fast.
Thing is, there are limitations:

It seems to me that for drawing at such scale, you'd have to maintain several big UBOs. Say, one for matrices, other for material data, etc.

For example, my card supports up to 65kB UBOs (querying for GL_MAX_UNIFORM_BLOCK_SIZE). And I've heard that number around a few times. Say that for drawing you need to upload a block with a mvp matrix and a modelView matrix for lighting. Thats 128 bytes, ie, 512 of those instances.

I haven't tried it yet but maybe you can bind ranges of different UBOs to the same binding slot in the shader program?

Or maybe you could just say "upload all the stuff i can, draw, upload the rest of the stuff, keep drawing". Which would reduce glBufferSubData calls drastically but you'd require some logic around to check how many of those blocks you can upload at once.

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

##### Share on other sites
There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

I'll need to check my code (I have all of this written and working) but if memory serves it's bytes.  Yes, that typically means that you'll have some empty space at the end of each objects block in the UBO, but that's OK; the more important thing is to minimise the updates as much as possible by doing as many of them as possible in a single operation.

The unfortunate consequence of this is that GL code using UBOs will look quite different to D3D code using cbuffers, so it can be messy if you want to support both APIs in the same program.

What I personally haven't tested is the single-bind/multiple-update pattern using GL_ARB_buffer_storage and persistent mapping; I might try that later on as it would be interesting to get some visibility on how that works as an option.

##### Share on other sites
There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

It's bytes and it refers to the Offset only, so if it says 256 you can do

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 256, Range1 );

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 512, Range2 );

....

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, n * 256, Range3 );

@mhagain

You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.

##### Share on other sites

@mhagain

You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.

There's obviously a tipping point beyond which UBOs are faster than X number of glUniform calls, where X varies depending on your hardware and driver.

With a performance difference of under 1% I'd still incline towards using UBOs anyway as the interface is going to be cleaner and you get to share uniforms among multiple programs.

##### Share on other sites

What about having an UBO which is an array of blocks (I'm not even sure you can use uniforms for indexing like that).


const uint MAX_INSTANCES = 1024;

uniform uint currentIndex;

struct Matrices
{
mat4 modelViewProj;
mat4 modelView;
};

layout  (std140, binding = 0) uniform MatrixBlocks
{
Matrices matrices[MAX_INSTANCES]
};

main ()
{
doMatrixStuff(matrices[currentIndex]);
}

Would that make the matrix array more tightly packed? (instead of having matrix, then another matrix at the 256 byte mark, and so on). You'd update "currentIndex" per instance drawn.

MAX_INSTANCES could be a constant or also an uniform (ie, instance count for this frame).

##### Share on other sites

What about having an UBO which is an array of blocks (I'm not even sure you can use uniforms for indexing like that).


const uint MAX_INSTANCES = 1024;

uniform uint currentIndex;

struct Matrices
{
mat4 modelViewProj;
mat4 modelView;
};

layout  (std140, binding = 0) uniform MatrixBlocks
{
Matrices matrices[MAX_INSTANCES]
};

main ()
{
doMatrixStuff(matrices[currentIndex]);
}

Would that make the matrix array more tightly packed? (instead of having matrix, then another matrix at the 256 byte mark, and so on). You'd update "currentIndex" per instance drawn.

MAX_INSTANCES could be a constant or also an uniform (ie, instance count for this frame).

Interesting idea to do a single buffer binding AND do a glUniform call per object, using the best of both worlds I guess.

And yes you can index into block variables, the drivers have to support it though and it's a bit off spec. I did that on AMD and once I'm at around 80% of the max 16K / block, it sometimes crashes. On GLES3 the last time I tried it, it crashed the shader compilation on Adreno 320 and told PowerVR about and they exhibited the same issue, now it's supposedly fixed (on the reference driver at least), I just have to get a GLES3 iOS device now and see for myself :).

I tried the glBindBufferRange thing on my Nexus 4 btw and I'm exhibiting a serious 25% performance degradation (again with absolutely no updates per frame) but I also can't put one of my structures inside the block, it gives off some stupid errors, I think it ignores some variables and then ends up with different blocks pretending to be the same one, so I had to put my light properties (PS only) in standard glUniform calls. On Nexus 4 the alignment offset is 4 bytes, btw, pretty small in comparison.

##### Share on other sites

This topic is 1343 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Create an account

Register a new account

• ### Similar Content

• By xhcao
Does sync be needed to read texture content after access texture image in compute shader?
My simple code is as below,
glUseProgram(program.get());
glBindImageTexture(0, texture[0], 0, GL_FALSE, 3, GL_READ_ONLY, GL_R32UI);
glBindImageTexture(1, texture[1], 0, GL_FALSE, 4, GL_WRITE_ONLY, GL_R32UI);
glDispatchCompute(1, 1, 1);
// Does sync be needed here?
glUseProgram(0);
GL_TEXTURE_CUBE_MAP_POSITIVE_X + face, texture[1], 0);
glReadPixels(0, 0, kWidth, kHeight, GL_RED_INTEGER, GL_UNSIGNED_INT, outputValues);

Compute shader is very simple, imageLoad content from texture[0], and imageStore content to texture[1]. Does need to sync after dispatchCompute?

• My question: is it possible to transform multiple angular velocities so that they can be reinserted as one? My research is below:

• I have this code below in both my vertex and fragment shader, however when I request glGetUniformLocation("Lights[0].diffuse") or "Lights[0].attenuation", it returns -1. It will only give me a valid uniform location if I actually use the diffuse/attenuation variables in the VERTEX shader. Because I use position in the vertex shader, it always returns a valid uniform location. I've read that I can share uniforms across both vertex and fragment, but I'm confused what this is even compiling to if this is the case.

#define NUM_LIGHTS 2
struct Light
{
vec3 position;
vec3 diffuse;
float attenuation;
};
uniform Light Lights[NUM_LIGHTS];

• By pr033r
Hello,
I have a Bachelor project on topic "Implenet 3D Boid's algorithm in OpenGL". All OpenGL issues works fine for me, all rendering etc. But when I started implement the boid's algorithm it was getting worse and worse. I read article (http://natureofcode.com/book/chapter-6-autonomous-agents/) inspirate from another code (here: https://github.com/jyanar/Boids/tree/master/src) but it still doesn't work like in tutorials and videos. For example the main problem: when I apply Cohesion (one of three main laws of boids) it makes some "cycling knot". Second, when some flock touch to another it scary change the coordination or respawn in origin (x: 0, y:0. z:0). Just some streng things.
I followed many tutorials, change a try everything but it isn't so smooth, without lags like in another videos. I really need your help.
My code (optimalizing branch): https://github.com/pr033r/BachelorProject/tree/Optimalizing
Exe file (if you want to look) and models folder (for those who will download the sources):
http://leteckaposta.cz/367190436
Thanks for any help...

• By Andrija
I am currently trying to implement shadow mapping into my project , but although i can render my depth map to the screen and it looks okay , when i sample it with shadowCoords there is no shadow.
Here is my light space matrix calculation
mat4x4 lightViewMatrix; vec3 sun_pos = {SUN_OFFSET * the_sun->direction[0], SUN_OFFSET * the_sun->direction[1], SUN_OFFSET * the_sun->direction[2]}; mat4x4_look_at(lightViewMatrix,sun_pos,player->pos,up); mat4x4_mul(lightSpaceMatrix,lightProjMatrix,lightViewMatrix); I will tweak the values for the size and frustum of the shadow map, but for now i just want to draw shadows around the player position
the_sun->direction is a normalized vector so i multiply it by a constant to get the position.
player->pos is the camera position in world space
the light projection matrix is calculated like this:
uniform mat4 light_space_matrix; void main() { gl_Position = light_space_matrix * transfMatrix * vec4(position, 1.0f); } Shadow fragment shader:
out float fragDepth; void main() { fragDepth = gl_FragCoord.z; } I am using deferred rendering so i have all my world positions in the g_positions buffer