•      Sign In
• Create Account

Drawing lots of 2D boxes; should I batch the draw call, and what's the best way to do that?

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

12 replies to this topic

#1Cornstalks  Crossbones+   -  Reputation: 7007

Like
0Likes
Like

Posted 07 November 2012 - 08:17 AM

When batching draw calls, how does one handle different transform matrices (position/rotation) for each object? Does each object's transformation matrix have to be applied before the batch draw call?

Specifically, I'm using OpenGL ES 2.0 to draw lots of 2D boxes. I can draw each one individually just fine, but seeing as I'm targeting mobile platforms, I'm looking to try to squeeze as much performance out of this as I can. The boxes aren't textured; mostly I'm interested in minimizing draw calls to save CPU time. The more CPU time I save, the larger my physics simulation can be.

• If I draw a lot of boxes and want to batch the draw calls together into one call, does that require that I apply the transformation matrices of each object to its vertices before copying all the objects into a single buffer for drawing?
• When batches are drawn, how does one differentiate between each object being drawn (if it's possible)? Perhaps my understanding of batching is wrong, but the way I currently understand it is that you cannot differentiate between each object (because you just take the vertices of all the objects to draw and copy them all into one buffer, so it just looks like a whole bunch of vertices; also this would require using GL_TRIANGLES instead of GL_TRIANGLE_STRIP (unless degenerate triangles were inserted) so as to not connect two different objects with a stray triangle)
• If you had a bunch of 2D boxes to draw, each with its own position, rotation, and scaling, but all of the same color (and no textures), how would you (personally) draw them? As a follow up, what if each had its own color; does that change things?

I don't know a lot about drawing optimization techniques. I'm comfortable drawing things to the screen, but I'm not a fancy graphics programmer.
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Sponsor:

#2Kaptein  Prime Members   -  Reputation: 2208

Like
1Likes
Like

Posted 07 November 2012 - 08:31 AM

your understanding is correct
you can:
1. position vertices so that you only need 1 (or 2) matrices to determine their position in space
basically, if you have no choice but to transform them each frame, you can use the animation approach:
use a dynamic VBO, and transform each vertex each frame, and render everything in one go
this is a reasonable approach in many cases

2. use several draw calls: use one matrix that you translate back and forth, draw a range of vertices at a time
glDrawArrays takes 3 parameters: type, first, and count
so you would use the first parameter, and start with 0, then jump to say 4, 8, 12, 16.. if you only draw 1 box at a time
this is very slow though, so if you can, draw 100 boxes at a time

i use both, since if you have alot of vertices it may not be in your best interest to transform all of them each frame
instead using a few extra calls on groups of vertices that belong together is better
but it depends on your data
i'm sure other people can name other solutions, but as long as your boxes don't move, you should be able to do one or the other without problem

Edited by Kaptein, 07 November 2012 - 08:33 AM.

#3BitMaster  Crossbones+   -  Reputation: 5991

Like
3Likes
Like

Posted 07 November 2012 - 09:12 AM

I don't have a lot of experience on mobile platforms, but shouldn't the glDraw*Instanced family of calls solve your problem? You have one VBO containing the box geometry. In your shader you then have something like this
#define MAX_INSTANCE SomeReasonableValue

uniform mat4 mvp[MAX_INSTANCES];
...
outVertex = mvp[gl_InstanceID] * inVertex;

The only bottleneck (that is, the number of glDraw*Instanced calls required to draw N boxes) there will be how much space you have for uniforms. If you are limited to certain kinds of transformations (for example only translations, uniform scale and/or rotations around one axis) for the boxes you could try to send only those parameters to the program and building an instance-specific matrix on the fly. Of course building that matrix for every vertex might well be more costly than setting the larger uniform matrices. However, if you are limited to translations only this could be better:
uniform mat4 mvp;
uniform vec3 translations[MAX_INSTANCES];
...
outVertex = mvp * (inVertex + vec4(translations[gl_InstanceID], 0));


Edited by BitMaster, 07 November 2012 - 09:13 AM.

#4Cornstalks  Crossbones+   -  Reputation: 7007

Like
0Likes
Like

Posted 07 November 2012 - 09:53 AM

your understanding is correct
you can:
1. position vertices so that you only need 1 (or 2) matrices to determine their position in space
basically, if you have no choice but to transform them each frame, you can use the animation approach:
use a dynamic VBO, and transform each vertex each frame, and render everything in one go
this is a reasonable approach in many cases

That's one option I'm considering.

2. use several draw calls: use one matrix that you translate back and forth, draw a range of vertices at a time
glDrawArrays takes 3 parameters: type, first, and count
so you would use the first parameter, and start with 0, then jump to say 4, 8, 12, 16.. if you only draw 1 box at a time
this is very slow though, so if you can, draw 100 boxes at a time

Well, each box has its own transformation matrix because they're all independently movable, so I'm guessing this would require drawing one box at a time (using this method)?

i'm sure other people can name other solutions, but as long as your boxes don't move, you should be able to do one or the other without problem

The boxes certainly move, as they're part of a physics simulation, which unfortunately is what makes this complicated.

I don't have a lot of experience on mobile platforms, but shouldn't the glDraw*Instanced family of calls solve your problem?

I can certainly draw with them, but I'm trying to find ways to a) minimize the number of draw calls and b) put as much of the computation on the GPU instead of the CPU. I don't know how to draw things in batches without first applying each object's transformation matrix to all of its vertices on the CPU, and then using the transformed data in the draw call. I don't know if there's a different/better way to do this, because right now the options I'm seeing are a) make a draw call for each object and don't apply the transformations on the CPU, or b) apply the transformations on the CPU and make a batched draw call. I'm debating between the two and am interested if a third option exists.

You have one VBO containing the box geometry. In your shader you then have something like this

#define MAX_INSTANCE SomeReasonableValue

uniform mat4 mvp[MAX_INSTANCES];
...
outVertex = mvp[gl_InstanceID] * inVertex;


That's a neat idea, but glDraw*Instanced() drawing didn't appear until OpenGL ES 3.0, and I'm stuck with 2.0
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#5Olof Hedman  Crossbones+   -  Reputation: 4152

Like
0Likes
Like

Posted 07 November 2012 - 10:01 AM

For GLES 2.0 you have no choice but to do the transformation on CPU and load a dynamic VBO each frame if you have dynamic objects you want to batch.

On the plus-side, you have less calculations in your shader and can use that extra performance for making the pixels prettier. (or draw more boxes before GPU-limit)

Edited by Olof Hedman, 07 November 2012 - 10:10 AM.

#6BitMaster  Crossbones+   -  Reputation: 5991

Like
0Likes
Like

Posted 07 November 2012 - 10:01 AM

What about adding N box geometries to the same VBO, just one after the other. For each geometry, add an additional integer attribute which is constant for each box (0 to N-1). Then you have basically a handrolled glDraw*Instanced in batches of maximal N, with your integer attribute taking the role of gl_InstanceID.

Edited by BitMaster, 07 November 2012 - 10:02 AM.

#7Olof Hedman  Crossbones+   -  Reputation: 4152

Like
0Likes
Like

Posted 07 November 2012 - 10:08 AM

What about adding N box geometries to the same VBO, just one after the other. For each geometry, add an additional integer attribute which is constant for each box (0 to N-1). Then you have basically a handrolled glDraw*Instanced in batches of maximal N, with your integer attribute taking the role of gl_InstanceID.

You'd need to do add that attribute for each vertex, so a lot of extra integers.
I guess you'd have to put the matrixes in a texture too.
My gut says it will be slower then just transform on CPU, but I can't say I know

Edited by Olof Hedman, 07 November 2012 - 10:08 AM.

#8Olof Hedman  Crossbones+   -  Reputation: 4152

Like
0Likes
Like

Posted 07 November 2012 - 10:16 AM

Hmm, I think I disregarded the case of 2D and only trivial shading.

If so, I guess more unorthodox methods could yield result, specially if the rest of the simulation tax the CPU.

The vertexes on a 2D box is less data then a matrix though, so I still say an efficient CPU-transform is probably the best

#9BitMaster  Crossbones+   -  Reputation: 5991

Like
0Likes
Like

Posted 07 November 2012 - 10:24 AM

Well, that will largely depend on exactly which transformations are needed. In the pure 2D case you can get away with a mat2x3 and still be completely general. For translation with rotation you can get away with a single vec3 (2D translation and angle) or maybe a vec4 (2D translation and precalculated sin(angle) and cos(angle)). I guess the 'best' solution to this problem will be extremely domain-specific, so the more ideas Cornstalks has lying around, the better.

Edited by BitMaster, 07 November 2012 - 10:25 AM.

#10mhagain  Crossbones+   -  Reputation: 10103

Like
1Likes
Like

Posted 07 November 2012 - 11:03 AM

The one recommendation I haven't seen is to consider jumping up to ES3. Now, that may not be viable for you, but if it is you'll have instancing support, so happy days - you're in the promised land.

If you can't do that then you've got a balancing act between the cost of splitting batches versus the cost of updating your vertex data in a manner that would allow you to take it all in a single batch.

For a desktop implementation without either instancing or glMapBufferRange (which would be required to update VBOs in a reasonable manner and without stalling the pipeline - again, ES3 would make that problem go away too) my gut inclination would be to drop the use of VBOs altogether, use client-side arrays in system memory, and transform on the CPU. Note that I said "desktop" here; I'm not certain how much of the following is going to apply to a mobile implementation so take it with the appropriate sized grain of salt.

Before proceeding it needs to be noted that ES2 does allow use of client-side arrays in this manner.

The main rationale behind this is that updating a VBO can be a horribly expensive operation - if you get it wrong it can be orders of magnitude more expensive than just not using VBOs at all. The reason why is that if the VBO is currently in use for drawing your program will not be able to immediately update it - instead it must stall, wait for all pending drawing operations to complete, then the update can happen. Do this a few too many times per frame and some implementations will plunge you to single digit framerates.

I'm guessing that you don't really want that to happen. ;)

So lets look at transforming a box on the CPU. This is not as horrible as it may appear at first glance.

First thing is to use indexed drawing (via glDrawElements) which will reduce the amount of vertices that need to be transformed from 24 to 8 - that's quite a significant saving already.

Second thing is to look at the transformation itself. There are several shortcuts you can take here, with an obvious one being to check if the box needs to be rotated - if it doesn't then the transformation collapses from a full set of matrix calculation/multiplies to 3 additions. Nice! The same applies to scaling; again you can collapse the full transform to something much much simpler (and faster).

One other factor here is that the indices used for drawing many boxes are going to be static - they'll never change, so you can just set them up once and reuse them as needed. You'll need to burn a bit of extra memory to set up indices for multiple boxes, but I believe that the tradeoff is worth it.

You could also get a further reduction in vertex submission by just not bothering to draw cube faces that are facing away from the viewpoint, but that would mess things up a little with your static indices (although you could work around it by collapsing them to 0-area triangles and reusing the same vertex for all of them). I'd maybe save that one for a later avenue of potential optimization if needed.

If my advice about VBOs turns out to be wrong on mobile platforms (i.e. if the cost of updating is lower than I estimate) then you're in a nice position where you can use a dynamic VBO, a static index buffer, and just fill/draw. I'm not sure if I'd be happy mixing client-side vertex data with an index buffer though, but my limited experience of mobile platforms measn that I can't really comment further on that one.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

#11Cornstalks  Crossbones+   -  Reputation: 7007

Like
0Likes
Like

Posted 07 November 2012 - 12:33 PM

Yeah, the best option will likely be very specific to my case. Specifically, this is a 2D game on Android where I'm using Box2D to simulate physical interactions of a large number of 2D boxes/rectangles. The boxes are not textured, and at the moment I'm considering making them all one color. The boxes can freely move and rotate during the simulation. Box2D gives me a translation vector (x and y) and rotation vector (precomputed sin and cos) for each box, and I know each box's size (I'm considering making them all the same size).

The one recommendation I haven't seen is to consider jumping up to ES3. Now, that may not be viable for you, but if it is you'll have instancing support, so happy days - you're in the promised land.

Unfortunately, I can't, as I'm targeting Android devices and the best thing available is ES2.0.

First thing is to use indexed drawing (via glDrawElements) which will reduce the amount of vertices that need to be transformed from 24 to 8 - that's quite a significant saving already.

These are 2D boxes, so the savings are significantly reduced (but still present). Are the savings still significant enough, do you think?

One thought I've had (there's a problem with it though) is to make one buffer that holds transformation matrices (really just vec4s representing the object's translation and rotation vectors) for every object, and then when drawing use an index array to index into the transformation matrix buffer. That way, each vertex can reference the corresponding box's transformation matrix and the box's transformation matrix only needs to be sent once. Each update would require updating the transformation matrix buffer. The problem, however, is another vertex buffer would be needed to define the 4 vertices for each box. This other vertex buffer only needs 4 elements, as all the boxes can be represented with the same vertices and a different transformation matrix. However, I don't think I can specify two index buffers, one which indexes into the transformation matrix buffer and the other which indexes into the little vertex buffer.

I'm seriously considering abandoning batching altogether at this point and just drawing each box individually (and transforming on the GPU using a uniform matrix passed in). Vertex data and buffer indices remain the same for each draw call. The only thing that would change is the uniform matrix. Thoughts?
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#12mhagain  Crossbones+   -  Reputation: 10103

Like
0Likes
Like

Posted 07 November 2012 - 06:40 PM

Hmmm - when I saw the word "box" I automatically assumed a 3D shape (even if projected onto a 2D view), but could you clarify - are you talking "boxes" as I assumed with 6 sides, 8 corners, or are you talking rectangles? I'd withdraw a huge chunk of my previous post if the latter (and happily accept negative rep on it too).

I'm seriously considering abandoning batching altogether at this point and just drawing each box individually (and transforming on the GPU using a uniform matrix passed in). Vertex data and buffer indices remain the same for each draw call. The only thing that would change is the uniform matrix. Thoughts?

Worth benchmarking and seeing how you go. It's incredibly simple to implement and may turn out to be not a problem at all.

Edited by mhagain, 07 November 2012 - 06:40 PM.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

#13Cornstalks  Crossbones+   -  Reputation: 7007

Like
0Likes
Like

Posted 07 November 2012 - 07:42 PM

Hmmm - when I saw the word "box" I automatically assumed a 3D shape (even if projected onto a 2D view), but could you clarify - are you talking "boxes" as I assumed with 6 sides, 8 corners, or are you talking rectangles? I'd withdraw a huge chunk of my previous post if the latter (and happily accept negative rep on it too).

Boxes as in 2D rectangles and squares. 4 vertices, 2 triangles. I voted you up because even though a good amount of what you were talking about doesn't really apply in my particular case, there are things that you mentioned that I do appreciate because they may be very helpful in future projects.

I've got some basic rendering working now using the method in my last paragraph of my previous post. I plan on doing some stress testing and benchmarking and seeing if the rendering is enough of a bottleneck to try to optimize more, though I'm doubting it will at this point.
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS