Followers 0

OpenGL SPEED: glUniform Vs. uniform buffer objects.

11 posts in this topic

So recently I dropped all support for legacy openGL matrix functionality from my 3D engine, and instead of GL_MODELVIEW for example, I'm uploading my view matrices to a Uniform Buffer Object to be shared and accessed by my shaders. But as stated in a previous post, this caused a huge FPS drop (from 1000 to 100). I've managed to get this number up towards about 400 by trimming out as many matrix calls as possible (like transforming meshes before drawing, I've now transformed the vertices before uploading to the GPU), but I'm still not happy with the performance.

Would regular glUniform calls be quicker than using buffer objects, for example, when I bind a shader, I pass the current view/projection matrices as a uniform, rather than using the UBO's.

This would mean each shader would have it's own copy of the matrices rather than the global (UBO) ones...

Does anyone know if glUniform() calls are faster than UBO calls? (Which have proven to drop FPS quite significantly).

Jonathan.

1

Try it and see?

NOTJon.

-4

Share on other sites

I thought I was doing something wrong as well when I saw they're slower (cause nobody will tell you that, just that using them is cool and fast). I too turned all uniforms into a single UBO, updating and binding it on a per object basis (just like in D3D11) and it's slower. I even made it so that if the camera doesn't move the update doesn't even happen, but binding 100 UBOs for each draw call was still slower than using plain a dozen glUniform calls, around 1-3% slower.

I was planning to do #3 as described above when I found the topic :).

0

Share on other sites

Beware that this isn't necessarily going to run faster than standalone uniforms either (depends on how much data you have in the buffer vs how many standalone uniforms you were using, other factors, and will be very driver-dependent).  1% to 3% speed difference in either direction is IMO acceptable.  At that stage the primary advantage of UBOs is one of convenience: being able to share uniforms among multiple different programs rather than having to reload them every time you change program.

0

Share on other sites

Yeah, but sharing a small UBO between all programs and still have a second UBO per program is kind of bad too. I haven't tested on GL but I did this on D3D11, had 3 CB per object with different levels of updates, per frame, per material and per object, and using 3 CB per object was always slower than using just one. I had this happen on my old HD5770 and my newer HD7850, I initially thought it was the first generation DX11 drivers, seems like it wasn't. Some still argue that they see benefits from sharing some buffers but I haven't found any case in that favor yet, tried with 30 bone skinned meshes too, it was still faster to keep all the constant data in a CB than split bones in a CB and the rest in another CB (so that I could for example reuse the bones CB in a shadow pass where i needed just that).

UPDATE : Just implemented the global uniform buffer with glBindBufferRange calls. Discovered that the offset you put to glBindBufferRange needs to be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, and observed still a negative performance penalty of -1% for using a dozen calls to glBindBufferRange compared to using just glUniform calls. I'll also add that for glUniform calls I have a state minimizer which compares data with the previous data set up (from like a previous object) and if it's the same thing, it doesn't call glUniform. And the comparison was done standing still, so basically I don't even map the buffer to update it cause I'm not moving/changing any data.

UPDATE2 : Did it on D3D11.1 using VSSetConstantBuffers1 and PSSetConstantBuffers1 and there is indeed a difference of aproximately <=0.4%. I already had 99% GPU usage though so it's not all that unexpected, on GL though I had like ~63% GPU usage and it's mostly doing the same thing. Perhaps it's because I have relatively small constant buffers ? I got like 464 bytes per object which need to turn to 512 for alignment purposes.

Edited by cippyboy
0

Share on other sites

(3) You have a single large UBO sized for 1000 objects.  At the start of each frame you make a pass through your objects.  You update the data they're going to use and copy it off to a system memory buffer.  Then you make a single glBufferSubData call.  Each object stores an object id from which you can reconstruct the offset it's data is at in the UBO.  To draw an object you make a glBindBufferRange call, then draw.  You have 1000 glBindBufferRange calls but one UBO update per frame.  This is going to run fast.
Thing is, there are limitations:

It seems to me that for drawing at such scale, you'd have to maintain several big UBOs. Say, one for matrices, other for material data, etc.

For example, my card supports up to 65kB UBOs (querying for GL_MAX_UNIFORM_BLOCK_SIZE). And I've heard that number around a few times. Say that for drawing you need to upload a block with a mvp matrix and a modelView matrix for lighting. Thats 128 bytes, ie, 512 of those instances.

I haven't tried it yet but maybe you can bind ranges of different UBOs to the same binding slot in the shader program?

Or maybe you could just say "upload all the stuff i can, draw, upload the rest of the stuff, keep drawing". Which would reduce glBufferSubData calls drastically but you'd require some logic around to check how many of those blocks you can upload at once.

There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

0

Share on other sites
There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

I'll need to check my code (I have all of this written and working) but if memory serves it's bytes.  Yes, that typically means that you'll have some empty space at the end of each objects block in the UBO, but that's OK; the more important thing is to minimise the updates as much as possible by doing as many of them as possible in a single operation.

The unfortunate consequence of this is that GL code using UBOs will look quite different to D3D code using cbuffers, so it can be messy if you want to support both APIs in the same program.

What I personally haven't tested is the single-bind/multiple-update pattern using GL_ARB_buffer_storage and persistent mapping; I might try that later on as it would be interesting to get some visibility on how that works as an option.

0

Share on other sites
There is also alignment requirements which I don't quite understand yet (I'm not sure if my card is saying I should align my updates to 256 byte blocks or 256 bit blocks from what I've seen querying GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT).

It's bytes and it refers to the Offset only, so if it says 256 you can do

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 256, Range1 );

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, 512, Range2 );

....

glBindBufferRange( GL_UNIFORM_BUFFER, 0, BufferHandle, n * 256, Range3 );

@mhagain

You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.

0

Share on other sites

@mhagain

You talked about less updates, I do no updates, just bindings and it's still slower than glUniforms. I'll probably try the persistent mapping in GL next as well.

There's obviously a tipping point beyond which UBOs are faster than X number of glUniform calls, where X varies depending on your hardware and driver.

With a performance difference of under 1% I'd still incline towards using UBOs anyway as the interface is going to be cleaner and you get to share uniforms among multiple programs.

0

Share on other sites

What about having an UBO which is an array of blocks (I'm not even sure you can use uniforms for indexing like that).


const uint MAX_INSTANCES = 1024;

uniform uint currentIndex;

struct Matrices
{
mat4 modelViewProj;
mat4 modelView;
};

layout  (std140, binding = 0) uniform MatrixBlocks
{
Matrices matrices[MAX_INSTANCES]
};

main ()
{
doMatrixStuff(matrices[currentIndex]);
}

Would that make the matrix array more tightly packed? (instead of having matrix, then another matrix at the 256 byte mark, and so on). You'd update "currentIndex" per instance drawn.

MAX_INSTANCES could be a constant or also an uniform (ie, instance count for this frame).

1

Share on other sites

What about having an UBO which is an array of blocks (I'm not even sure you can use uniforms for indexing like that).


const uint MAX_INSTANCES = 1024;

uniform uint currentIndex;

struct Matrices
{
mat4 modelViewProj;
mat4 modelView;
};

layout  (std140, binding = 0) uniform MatrixBlocks
{
Matrices matrices[MAX_INSTANCES]
};

main ()
{
doMatrixStuff(matrices[currentIndex]);
}

Would that make the matrix array more tightly packed? (instead of having matrix, then another matrix at the 256 byte mark, and so on). You'd update "currentIndex" per instance drawn.

MAX_INSTANCES could be a constant or also an uniform (ie, instance count for this frame).

Interesting idea to do a single buffer binding AND do a glUniform call per object, using the best of both worlds I guess.

And yes you can index into block variables, the drivers have to support it though and it's a bit off spec. I did that on AMD and once I'm at around 80% of the max 16K / block, it sometimes crashes. On GLES3 the last time I tried it, it crashed the shader compilation on Adreno 320 and told PowerVR about and they exhibited the same issue, now it's supposedly fixed (on the reference driver at least), I just have to get a GLES3 iOS device now and see for myself :).

I tried the glBindBufferRange thing on my Nexus 4 btw and I'm exhibiting a serious 25% performance degradation (again with absolutely no updates per frame) but I also can't put one of my structures inside the block, it gives off some stupid errors, I think it ignores some variables and then ends up with different blocks pretending to be the same one, so I had to put my light properties (PS only) in standard glUniform calls. On Nexus 4 the alignment offset is 4 bytes, btw, pretty small in comparison.

0

Create an account

Register a new account

Followers 0

• Similar Content

• So it's been a while since I took a break from my whole creating a planet in DX11. Last time around I got stuck on fixing a nice LOD.
A week back or so I got help to find this:
https://github.com/sp4cerat/Planet-LOD
In general this is what I'm trying to recreate in DX11, he that made that planet LOD uses OpenGL but that is a minor issue and something I can solve. But I have a question regarding the code
He gets the position using this row
vec4d pos = b.var.vec4d["position"]; Which is then used further down when he sends the variable "center" into the drawing function:
if (pos.len() < 1) pos.norm(); world::draw(vec3d(pos.x, pos.y, pos.z));
Inside the draw function this happens:
draw_recursive(p3[0], p3[1], p3[2], center); Basically the 3 vertices of the triangle and the center of details that he sent as a parameter earlier: vec3d(pos.x, pos.y, pos.z)
Now onto my real question, he does vec3d edge_center[3] = { (p1 + p2) / 2, (p2 + p3) / 2, (p3 + p1) / 2 }; to get the edge center of each edge, nothing weird there.
But this is used later on with:
vec3d d = center + edge_center[i]; edge_test[i] = d.len() > ratio_size; edge_test is then used to evaluate if there should be a triangle drawn or if it should be split up into 3 new triangles instead. Why is it working for him? shouldn't it be like center - edge_center or something like that? Why adding them togheter? I asume here that the center is the center of details for the LOD. the position of the camera if stood on the ground of the planet and not up int he air like it is now.

Full code can be seen here:
https://github.com/sp4cerat/Planet-LOD/blob/master/src.simple/Main.cpp
If anyone would like to take a look and try to help me understand this code I would love this person. I'm running out of ideas on how to solve this in my own head, most likely twisted it one time to many up in my head
Toastmastern

• I googled around but are unable to find source code or details of implementation.
What keywords should I search for this topic?
Things I would like to know:
A. How to ensure that partially covered pixels are rasterized?
Apparently by expanding each triangle by 1 pixel or so, rasterization problem is almost solved.
But it will result in an unindexable triangle list without tons of overlaps. Will it incur a large performance penalty?
How to ensure proper synchronizations in GLSL?
GLSL seems to only allow int32 atomics on image.
C. Is there some simple ways to estimate coverage on-the-fly?
In case I am to draw 2D shapes onto an exisitng target:
1. A multi-pass whatever-buffer seems overkill.
2. Multisampling could cost a lot memory though all I need is better coverage.
Besides, I have to blit twice, if draw target is not multisampled.

• By mapra99
Hello

I am working on a recent project and I have been learning how to code in C# using OpenGL libraries for some graphics. I have achieved some quite interesting things using TAO Framework writing in Console Applications, creating a GLUT Window. But my problem now is that I need to incorporate the Graphics in a Windows Form so I can relate the objects that I render with some .NET Controls.

To deal with this problem, I have seen in some forums that it's better to use OpenTK instead of TAO Framework, so I can use the glControl that OpenTK libraries offer. However, I haven't found complete articles, tutorials or source codes that help using the glControl or that may insert me into de OpenTK functions. Would somebody please share in this forum some links or files where I can find good documentation about this topic? Or may I use another library different of OpenTK?

Thanks!

• Hello, I have been working on SH Irradiance map rendering, and I have been using a GLSL pixel shader to render SH irradiance to 2D irradiance maps for my static objects. I already have it working with 9 3D textures so far for the first 9 SH functions.
In my GLSL shader, I have to send in 9 SH Coefficient 3D Texures that use RGBA8 as a pixel format. RGB being used for the coefficients for red, green, and blue, and the A for checking if the voxel is in use (for the 3D texture solidification shader to prevent bleeding).
My problem is, I want to knock this number of textures down to something like 4 or 5. Getting even lower would be a godsend. This is because I eventually plan on adding more SH Coefficient 3D Textures for other parts of the game map (such as inside rooms, as opposed to the outside), to circumvent irradiance probe bleeding between rooms separated by walls. I don't want to reach the 32 texture limit too soon. Also, I figure that it would be a LOT faster.
Is there a way I could, say, store 2 sets of SH Coefficients for 2 SH functions inside a texture with RGBA16 pixels? If so, how would I extract them from inside GLSL? Let me know if you have any suggestions ^^.
• By KarimIO
EDIT: I thought this was restricted to Attribute-Created GL contexts, but it isn't, so I rewrote the post.
Hey guys, whenever I call SwapBuffers(hDC), I get a crash, and I get a "Too many posts were made to a semaphore." from Windows as I call SwapBuffers. What could be the cause of this?
Update: No crash occurs if I don't draw, just clear and swap.
static PIXELFORMATDESCRIPTOR pfd = // pfd Tells Windows How We Want Things To Be { sizeof(PIXELFORMATDESCRIPTOR), // Size Of This Pixel Format Descriptor 1, // Version Number PFD_DRAW_TO_WINDOW | // Format Must Support Window PFD_SUPPORT_OPENGL | // Format Must Support OpenGL PFD_DOUBLEBUFFER, // Must Support Double Buffering PFD_TYPE_RGBA, // Request An RGBA Format 32, // Select Our Color Depth 0, 0, 0, 0, 0, 0, // Color Bits Ignored 0, // No Alpha Buffer 0, // Shift Bit Ignored 0, // No Accumulation Buffer 0, 0, 0, 0, // Accumulation Bits Ignored 24, // 24Bit Z-Buffer (Depth Buffer) 0, // No Stencil Buffer 0, // No Auxiliary Buffer PFD_MAIN_PLANE, // Main Drawing Layer 0, // Reserved 0, 0, 0 // Layer Masks Ignored }; if (!(hDC = GetDC(windowHandle))) return false; unsigned int PixelFormat; if (!(PixelFormat = ChoosePixelFormat(hDC, &pfd))) return false; if (!SetPixelFormat(hDC, PixelFormat, &pfd)) return false; hRC = wglCreateContext(hDC); if (!hRC) { std::cout << "wglCreateContext Failed!\n"; return false; } if (wglMakeCurrent(hDC, hRC) == NULL) { std::cout << "Make Context Current Second Failed!\n"; return false; } ... // OGL Buffer Initialization glClear(GL_DEPTH_BUFFER_BIT | GL_COLOR_BUFFER_BIT); glBindVertexArray(vao); glUseProgram(myprogram); glDrawElements(GL_TRIANGLES, indexCount, GL_UNSIGNED_SHORT, (void *)indexStart); SwapBuffers(GetDC(window_handle));

• 15
• 11
• 15
• 12
• 19