Jump to content
  • Advertisement
Sign in to follow this  
agleed

Batching draws - keeping buffers up to date

This topic is 857 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm currently trying to optimize a WebGL renderer. 

 

I'm currently doing "brute force" batching in our renderer - every frame, we dynamically generate a list of stuff that can be batched together. When you draw draw, say, 10 sprites with texture A, then 1 of something else, than another 10 sprites with texture B, and we will draw the first 10 sprites, the 1 object of something else, and the other 10 sprites each in one draw call.

 

We do this currently by generating - every frame for every batched object - buffers that are used to hold all the vertex data of the various batched objects as well as stuff like matrix transforms (currently, we actually replicate all matrix data for each vertex... will investigate object ID stream and matrix-texture soon) and using this to draw all of it. Each of those batches has one associated set of buffers that is used for drawing. It's a really simplistic approach, but it was already a HUGE win in terms of performance. Keep in mind we were already trying to minimize state changes.

 

But now one of the bottlenecks (we don't really have a huge one anymore, it's more like several distinct parts each take up 10-20% of performance or so) is the generation of these draw batching lists and their vertex buffers. So I would like to optimize it.

 

The biggest thing where I hope to increase performance here is by somehow exploiting the temporal coherence of these batching lists and their buffers. When you look at that first 10 sprites list over time, you will find that each frame there's a decent chance that one or two new sprites will be added, or some will be removed. Every other frame one of the sprites might change its vertex colors, or texture coordinates. It's also very likely that transformation data for a lot of those sprites changes. But overall, most of the data stays the same.

 

Are there algorithms for keeping the number of updates as small as possible, and the related work of detecting what changed as small as possible? There seem to be a lot of factors that complicate it... for example:

 

  • About 85% of the objects in our game are completely solid (or, at least only with fully opaque or fully transparent pixels), and about 15% use more complicated transparency. The solid ones can be drawn in any order, so the updating of their 'batch lists' and buffers can and should probably be handled differently from the transparent stuff, right?
  • How then do we detect similar batch lists and detect what exactly changed in the data of the respective objects? Say in frame 1 we have a batch that batches sprites A,B,C,D,E,F, all solid sprites, and in the next frame the list looks like F,E,D,C,B,A because the 'order of arrival' has changed somehow? In terms of rendering, we actually wouldn't need to change anything (maybe just update the transformation matrices because depth of the sprite has changed), you could still draw everything with the data from frame 1 and it would look correct (this is not true for sprites that would require transparency blending of course).  
  • What about the case when the batch remains mostly the same but something new is introduced or removed? E.g. A,B,C,X,D,E,F or A,B,D,E,F? It seems like in those cases you could just add X to the end of the buffer, or in the second case degenerate the triangles of the missing C, but you would have to detect somehow (quickly, such that the overhead from doing this isn't slower than just updating the entire buffer) that you can do that.
  • In the naive approach, we have one client side array of all the data that is going to be uploaded, and then one big gl.bufferSubData call. With these optimizations this could either turn into multiple bufferSubData calls (quickly becomes bad, I think), or still one big one but such that the buffer you use to upload with at least stays mostly the same. What to do here?
  • What about when you want to do stuff like explicit double or triple buffering of buffers, to avoid synchs with the driver?
  • What to do differently about objects that require in-order rendering?

 

edit: I should note that all rendered objects are in a parent-child hierarchy, and removing something from being drawn either means setting it invisible or removing it from its parent. 

Edited by agleed

Share this post


Link to post
Share on other sites
Advertisement

I really like this question (or set of questions) as it presents a large number of the issues developers in your position face.

From my experience if you are already on the right track, you identified a large bottleneck and improved it.

 

What you need to avoid it trying to catch/handle every situation in the most efficient way - this can often end up worse! I recommend an approach that doesn't try to solve all of the issues but just the main ones. Often enough the performance of such a solution is plenty good enough for most applications.

 

If you are optimising your engine for this game and this game alone things might be different, but suppose you want to use the engine for a future game one that has more transparent objects, or transparent objects completely mixed in with opaque ones all over the place? Then a seemingly good solution might end up being a bad one.

 

In terms of over complexity avoiding you could start with very simple dirty flags for things like children changed or not, rather that trying to work out which exact child was removed or added it might be easier just to know IF one was added or removed and respond accordingly, hope that makes some kind of sense.

 

You will have to make those calls though based on your needs.

 

That said some kind of double buffer might work for handling changes (or lack thereof) well i.e. for each frame swap buffers and copy large chunks from the last one where possible when no changes occur otherwise just write in the new data.

Just remember that blasting through more linear data can often be faster than jumping around all over the place trying to avoid unnecessary work due to cache misses anyway.

 

Look forward to seeing other replies in here :)

b

Share this post


Link to post
Share on other sites

For this kind of use case, buffer object streaming is the typical technique and is tried and trusted for over 15 years in Direct3D land (OpenGL didn't gain the ability to do a similar streaming model until much more recently). 

 

The thinking is, that because whether or not a range of the buffer is required by a pending draw call is such an important factor for performance, it's often faster to just replace everything than it is to replace small subranges.  This can seem counter-intuitive (writing more data is faster) but this approach allows the CPU and GPU to continue operating asynchronously which is a net win.

 

A brief outline of the technique:

 

Beginning at location 0 in the buffer, you append incoming data to the buffer, building up a batch as you go.  If state needs to change you flush the current batch (i.e. a draw call) then keep appending.

 

Eventually you run out of space in the buffer: the next set of incoming data won't fit.  At that stage you invalidate/orphan/discard (you'll see all 3 used depending on which sources you read, but it's just terminology and the end result is the same) the buffer - using glBufferData (..., NULL, ...) or GL_MAP_INVALIDATE_BUFFER_BIT - and what the driver will do is keep the existing buffer storage for pending draw calls, but allocate a new block of storage for future writing.  Reset to location 0 and begin again.  Again, this can seem counter-intuitive - allocating new storage at runtime surely must be slower - but you must remember that this is a common pattern for over 15 years.  Drivers are optimized around this pattern.  So what will actually happen is that after a few frames things will settle down and instead of making new allocations the driver will have it's own internal pool of 3 or so buffers that it just automatically cycles through.  New allocations will stop, and the driver will do it's own multi-buffering for you.

 

A further writeup (by one of the authors of the GL_ARB_map_buffer_range extension) is available here: https://www.opengl.org/discussion_boards/showthread.php/170118-VBOs-strangely-slow?p=1197780#post1197780

Edited by mhagain

Share this post


Link to post
Share on other sites

I kinda got lost in your post there buddy. So forgive me if this is not quite touching it.

Check this link out here...

 

http://realtimecollisiondetection.net/blog/?p=86

The general idea is to use a radix sort to sort your draw calls and to batch and group as much together as possible. It works pretty fast for larger groups of data, and will actually beat out quick sort if the data set is large enough. But the real reason why you want to use it, is because it actually helps you with state sorting.

Share this post


Link to post
Share on other sites

Thank you for the replies.

 

A bit more detail on my problem: My problem isn't really how to detect what to batch and how to batch it. I'm aware of baking required render data into IDs and sorting based on that. My problem is rather that even if you do that, the deltas between frames in terms of what kind of data actually needs updating is very small, yet if you manage the data naively (just upload all the data required for every batch anew every frame, like we do it now), you end up preparing (in terms of CPU work, which is actually quite a fair amount since we're running Javascript which is pretty slow) and transferring a ton of data where 90% of it is in principle unchanged on a frame-to-frame basis. 

 

It is most likely true that just in terms of the uploaded amount, just uploading the entire buffer again is probably faster (easy job for the driver and PCI-e bus) than trying to split it up into lots of small  bufferSubDatas to minimize the amount sent. But some of these updates would probably be such that I would still have a single bufferSubData call, just a lot smaller than one for updating the entire buffer? Then again, this might not matter at all of the orphaning trick just makes it THAT much faster...

 

But also, you could potentially skip most/a lot of the CPU work involved in putting the various data of many different objects into a single array for uploading (actually, on desktop GL this isn't really necessary thanks to mapped buffers, but WebGL doesn't have this unfortunately...). I'm not sure, maybe this sounds like a trivial work amount but actually it's about 10% of our entire rendering work load right now on desktop PCs.

 

I wondered and hoped that there would be some kind of algorithm or data structure which helps with this problem.  ...you have a "too large" data set and need to send it somewhere, so you only send deltas, but you need to find whatever changed and how it changed in an efficient and quick manner... 

Edited by agleed

Share this post


Link to post
Share on other sites

The problem that you have is even if only 10% of your data actually changes, its not quaranteed that this data is in the same area in memory. You could have 100k sprites, and 10k are dynamic, but those 10k are scattered around the buffer so you actually require 10k bufferSubData-calls for updating it (which is pretty bad compared to just one update that discards the entire last buffer).

 

But if you have static vertex data that can be identified or marked as static, than you can put those into a separate buffer that isn't updated on a per-frame basis, only when needed. You than have a separate buffer that is refilled on each frame, and you draw from both of them.

Edited by Juliean

Share this post


Link to post
Share on other sites
While Tangletail's link is an excellent read, I'm afraid your problem is that you're using WbGL and JavaScript.
There's a lot of optimizations... that are not possible in WebGL, and what you're doing shifted the work from DrawCall overhead to filling buffers from CPU using JavaScript.
Even if you got C's or Asm speed for filling those buffers, you're still going to hit RAM or bus bandwidth problems which are avoidable in modern APIs.

The best thing I can recommend is to setup geometry instanced buffers (aka software or manual instancing) and use vertexCount (or indexCount) to control how many instances you render (ie indexCount * numInstances), passing individual transform data in a uniform array.
Another solution is to use ANGLE_instanced_arrays extension when available.

Share this post


Link to post
Share on other sites

The problem that you have is even if only 10% of your data actually changes, its not quaranteed that this data is in the same area in memory. You could have 100k sprites, and 10k are dynamic, but those 10k are scattered around the buffer so you actually require 10k bufferSubData-calls for updating it (which is pretty bad compared to just one update that discards the entire last buffer).

 

But if you have static vertex data that can be identified or marked as static, than you can put those into a separate buffer that isn't updated on a per-frame basis, only when needed. You than have a separate buffer that is refilled on each frame, and you draw from both of them.

 

I know what you mean, but maybe to update those 10k sprites, you can find an optimal set of 3 bufferSubData calls that covers various ranges of changed and unchanged data inside the buffer, but in such a manner that the overall work done for transfers AND CPU work for preparing the transfers is much lower, if that makes sense?  :D Maybe it's a pipe dream and this is actually theoretically impossible to do in a fast manner... but I hope not!

 

 

While Tangletail's link is an excellent read, I'm afraid your problem is that you're using WbGL and JavaScript.
There's a lot of optimizations... that are not possible in WebGL, and what you're doing shifted the work from DrawCall overhead to filling buffers from CPU using JavaScript.

 

True, so now I am trying to find a way to reduce the amount of work I actually need to do, as opposed to optimizing how I do it (although there's probably still quite some space to optimize left there).

 

 

The best thing I can recommend is to setup geometry instanced buffers (aka software or manual instancing) and use vertexCount (or indexCount) to control how many instances you render (ie indexCount * numInstances), passing individual transform data in a uniform array.
Another solution is to use ANGLE_instanced_arrays extension when available.

 

Instancing is probably a really good idea to look into for particle systems and what not (also angle_instanced_arrays is really widespread), but most of our drawn sprites actually have distinct texture coordinate data, that would disable the application of instancing wouldn't it? Unless I use another manual vertex or index ID stream and get the texture coordinate data from somewhere else...

Edited by agleed

Share this post


Link to post
Share on other sites
I know what you mean, but maybe to update those 10k sprites, you can find an optimal set of 3 bufferSubData calls that covers various ranges of changed and unchanged data inside the buffer, but in such a manner that the overall work done for transfers AND CPU work for preparing the transfers is much lower, if that makes sense?  :D Maybe it's a pipe dream and this is actually theoretically impossible to do in a fast manner... but I hope not!

 

If those 10k sprites were scattered all over the place, what you want is impossible (literally). If they are batched together in like 3 places, then you could use an algorithm that batches changed into groups of bufferSubData calls (which is really simple: loop over all changed sprites in order, as long as they are adjacent you batch, as soon as there is a whole you call bufferSubData). Or alternatively just use a separate buffer, or place those dynamic sprites at the end of the regular buffer. Why find some fancy algorithm when you can do the work already at an earlier stage with ease? :)

Share this post


Link to post
Share on other sites

 

I know what you mean, but maybe to update those 10k sprites, you can find an optimal set of 3 bufferSubData calls that covers various ranges of changed and unchanged data inside the buffer, but in such a manner that the overall work done for transfers AND CPU work for preparing the transfers is much lower, if that makes sense?  :D Maybe it's a pipe dream and this is actually theoretically impossible to do in a fast manner... but I hope not!

 

If those 10k sprites were scattered all over the place, what you want is impossible (literally). If they are batched together in like 3 places, then you could use an algorithm that batches changed into groups of bufferSubData calls (which is really simple: loop over all changed sprites in order, as long as they are adjacent you batch, as soon as there is a whole you call bufferSubData). Or alternatively just use a separate buffer, or place those dynamic sprites at the end of the regular buffer. Why find some fancy algorithm when you can do the work already at an earlier stage with ease? :)

 

 

Not so sure about it being impossible... surely it's inefficient if we have to scan the list of all sprites every frame, but if we only track changes maybe we might construct a list of optimal transfer ranges? It would weird me out somewhat if there was no algorithm for this. This almost sounds like one of those coding interview questions! Okay, 10k vs 100k sprites this might be hard to do, but what if it's more like... 1k vs 100k? On the other hand... we probably never have frames where less than 1% of sprites, text instances and what not change.

Edited by agleed

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!