OpenGL [SOLVED] Uniform buffer actually viable?

This topic is 988 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Hello.

I've just implemented a pretty nice batch renderer, but I'm struggling with uniform buffers. I have a great system which only maps a single buffer (built on a lot of experience with doing fast buffer mapping), and I place all my uniform block data in a single uniform buffer. I was assuming that this would be the fastest way to handle uniform changes, but it turns out that uniform buffers have some fatal drawbacks. For example, changing a single variable in a block forces me to allocate a new one with a minimum size of 256 bytes, which is HUGE. I can barely get over 64 bytes right now, so there's a lot of wasted space which seems to inhibit performance a lot, especially for simpler shaders with few uniforms. In almost all cases I change some kind of uniform value between draw calls, and in many cases I end up

I had this idea that I would split up uniforms into different blocks so that I only had to update the ones that change (view+projection matrices in one block, materials in one block, etc), but as it is now the winning move is to just pack everything into one block so that I don't waste that much space and reupload stuff that has changed to avoid the risk of having to update two smaller blocks with even more padding. It's getting to a point where I think it would be faster to just build a list of glUniform**() calls to do instead of bothering with uniform buffers.

Are uniform buffers just nonviable for real-life usage? Can I work around the offset alignment problem to reduce the padding? Is glUniform() simply superior in most cases and on most drivers?

EDIT: After googling a bit, I want to clarify that my buffer handling is very effective. I place all my uniform data in a single mapped buffer (persistently coherently mapped if possible, otherwise cycling unsynchronized), so there's only a single map operation done per frame. The problem is that the data uploading is simply really slow when the padding is added (can't batch upload it), and the buffers get really big. I think I'm gonna implement some hacky glUniform() calls to compare performance.

Also, this is OpenGL for PC, tested on an Nvidia card.

Edited by theagentd

Share on other sites
EDIT: After googling a bit, I want to clarify that my buffer handling is very effective. I place all my uniform data in a single mapped buffer (persistently coherently mapped if possible, otherwise cycling unsynchronized), so there's only a single map operation done per frame.

What do you mean mapped once per frame? Persistanly mapping is meand to only map once per lifetime of the buffer.

And you should not use mapping at all if you can't use persistently mapping (aka GL_ARB_buffer_storage)
Even when you use mapping unsynchronized, the driver thread still has to sync with your application thread and that can hurt performance badly.

Share on other sites

I can barely get over 64 bytes right now

Are you sure you are calculating that correctly?

64 bytes is a single 4x4 matrix...

Share on other sites
I'm not totally sure what the problem is- buffer updates should be a simple matter of dispatching a memcpy to the right place before draw. You complain about sizes, but have you run the numbers? 1k of uniform data times 4096 draw calls in a frame is 4 MB of uniform data, times triple buffer is 12 MB for all your uniforms. That's trivial, and I am guessing your scene is nowhere near that massive.

Share on other sites

EDIT: After googling a bit, I want to clarify that my buffer handling is very effective. I place all my uniform data in a single mapped buffer (persistently coherently mapped if possible, otherwise cycling unsynchronized), so there's only a single map operation done per frame.

What do you mean mapped once per frame? Persistanly mapping is meand to only map once per lifetime of the buffer.

And you should not use mapping at all if you can't use persistently mapping (aka GL_ARB_buffer_storage)
Even when you use mapping unsynchronized, the driver thread still has to sync with your application thread and that can hurt performance badly.

Either I'm mapping it unsynchronized once per frame (a single huge buffer for everything) or if persistent buffers are supported I map it once and reuse it forever. My data upload code is not the problem. The problem is the extreme amount of padding I end up with.

I can barely get over 64 bytes right now

Are you sure you are calculating that correctly?

64 bytes is a single 4x4 matrix...

Yes, especially if I want to split them up into multiple blocks to maximize reuse. I usually only have a few uniform variables I need to change each frame. Let's say I upload a view and a projection matrix to one block once, then I still need a 256 byte block just to fit in a few bytes of data (material data, uniform particle parameters, etc). There's no way I'll be even getting close to 256 bytes in 95% of all cases.

I really feel like I'm missing something here. In many cases I just want to change textures and a single or a couple of vec4()s of uniform data and I have to allocate an entirely new 256 byte block?

Share on other sites

The problem is the extreme amount of padding I end up with

Let's say I upload a view and a projection matrix to one block once, then I still need a 256 byte block just to fit in a few bytes of data

Don't allocate a block for each object. Pack all objects into one big block.

Edited by swiftcoder

Share on other sites

The problem is the extreme amount of padding I end up with

I have to? The offset passed into glBindBufferRange() has to be a multiple of 256 on my card, so any block smaller than 256 bytes (i.e. all of them) has to be padded to that. Packing in all uniforms regardless of update rate would defeat the purpose of being able to update different blocks at different rates.

Let's say I upload a view and a projection matrix to one block once, then I still need a 256 byte block just to fit in a few bytes of data

Don't allocate a block for each object. Pack all objects into one big block.

I already use instancing for objects with identical materials and vertex data, which is quite common. Each object type currently requires a 256 block. Are you saying that I should pack data for multiple object types into a single block? How do I do that? BTW, I'm limited to OpenGL 3.3.

Edited by theagentd

Share on other sites
Padding should only really be a memory usage issue, right? How does it interact with your code that fills the buffer (and get in the way of performance)? How many milliseconds are you spending on that UBO update?

Share on other sites

Here: http://www.gamedev.net/topic/655969-speed-gluniform-vs-uniform-buffer-objects/ I ended up implementing the idea I had at that time.

There was also a discussion with Mathias that I can't seem to find, he explained the instanceId with more detail.

Anyway, say that we have some per instance data. Like mv and mvp matrices:

Thats a single struct:

struct Transform
{
mat4 mvp;
mat4 mv;
// Then some padding to respect std140 if necessary. 12 bytes / vec3 at most.
}


Now, thats 128 bytes per struct right? If you wanted to place them sequentially on a buffer and bind the range for each Transform struct, yeah, you'd need to place 128 bytes, then pad, for every Transform instance.

Lets say we got our typical 64kB UBO and we define it like this

layout (std140, binding = TRANSFORM_SLOT ) uniform TransformsBlock
{
Transform[MAX_TRANSFORMS] transforms;
}


Where MAX_TRANSFORMS its max ubo size divided Transform instance size, given our 64kB UBO, that'd be 512 instances. Tightly packed.

Now the issue here is that while now you don't need to pad, since you're binding a lot of transforms at the same time, you need to index into the array to get the proper one for whatever you're drawing. There are many ways of providing an instance index, like with an additional attribute, with a normal uniform, with the instance ID, with a combination of instance ID and a vertex attribute, etc. I think Mathias talked about the instance indexing in the Vulkan thread, can't remember.

Anyway, once you got the index per instance uploaded its as straightforward as:

mat4 mvp = transforms[instanceId];


There, now you just need to bind the whole range to that TRANSFORM_SLOT, no more padding in between instance data.

Also, have in mind that you shouldn't put everything into a single buffer. Separate them between "globals" (stuff that never changes), per frame parameters, and per instance parameters. And choose appropiate update strategies for each. Probably mapping a global or per frame buffer for a single tiny update is a waste, glBufferSubData would suffice.

The strategy I use right now is to have a sort of ring buffer of a couple MB. I compute the maximum amount of instances I can upload in a single pass, given the kind of buffers that pass needs (say, TransformBlock and MaterialBlock).

Say that the max is 512. First I bind the ring buffer. Then iterate over the transform data, upload those 512 instances, then bind that range to the transform slot. Then iterate over the material data, upload 512 instances, then bind that range to the material slot. Then draw those 512 instances. Rinse and repeat for the rest of the draw tasks. The only padding I have is in between the kind of block I'm updating. Each block itself has its internal array of structs tightly packed.

Since its a ring buffer I just upload to the next available range, until it wraps around and starts again, by the time it wraps around that data will be quite a few frames old if you give it a couple megabytes.

That means that I can draw 10 thousand different things with 20 updates, 20 bind ranges, and only one buffer binding (drawcall count is a different matter, ideally you can also draw each instance batch with a single draw).

You can get smarter and pack UBOs in a way to reduce the calls even further via passing reduced forms of the matrices, packing different kinds of data into the same struct (ie, instead of having separate slots for transforms and material, just put them in the same struct), uploading all of the instance data in one step, and then just do a loop of bindRange-draw for all of them, and so on. That way you can handle batches of thousands with a dozen calls tops.

Share on other sites

Padding should only really be a memory usage issue, right? How does it interact with your code that fills the buffer (and get in the way of performance)? How many milliseconds are you spending on that UBO update?

Thats exactly, what I thought. What are you timing results ? Even with what you are experiencing (still can't see how ), I don't even think packing semantics would even come into play here as that would affect shader load operations and not buffer update.

1. 1
2. 2
Rutin
21
3. 3
4. 4
5. 5
frob
12

• 17
• 9
• 31
• 16
• 9
• Forum Statistics

• Total Topics
632614
• Total Posts
3007440

×