• Create Account

## Most efficient way to batch drawings

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

14 replies to this topic

### #1Retsu90  Members

208
Like
0Likes
Like

Posted 27 December 2012 - 06:02 AM

Hi, I'm interested on a bit of theory about the best methods of optimization for OpenGL 3.0 (where a lot of function became deprecated).

On my current 2D framework, every sprite has own program with own values inside the uniform. Every sprite is draw separately and, now that I switched from 2.1 to 3.0, every sprite has own matrix Projection and View. Now my goal is to batch most vertexes possible and these are some ideas:

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.

None of these ideas work as I expected so now I'm here to ask you what is it the most efficient way to batch drawings in OpenGL 3.0.

### #2samoth  Members

8927
Like
1Likes
Like

Posted 27 December 2012 - 07:05 AM

If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm.

Edited by samoth, 27 December 2012 - 07:05 AM.

### #3max343  Members

346
Like
1Likes
Like

Posted 27 December 2012 - 08:12 AM

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy.

Edited by max343, 27 December 2012 - 08:13 AM.

### #4Retsu90  Members

208
Like
0Likes
Like

Posted 27 December 2012 - 08:50 AM

If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm.

So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy.

Yes, I mean instancing (I saw what instancing is it only now). Do you recommend me to send matrices in an uniform array or in a texture?

### #5max343  Members

346
Like
1Likes
Like

Posted 27 December 2012 - 09:09 AM

I always prefer using uniform buffers. Initially the piping is a bit tricky to understand, but once you grasp that part, their advantages over textures are apparent.

BTW, OpenGL 3 supports instancing.

Edited by max343, 27 December 2012 - 09:11 AM.

### #6max343  Members

346
Like
1Likes
Like

Posted 27 December 2012 - 10:31 AM

So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

I didn't read into it the first time, but the answer is no. A big no. It's much better to use uniform buffers for something this big (or for something that you're going to share). In fact it's better to limit the usage of global uniforms only to those cases in which the overhead of using the buffer is greater.

### #7Retsu90  Members

208
Like
0Likes
Like

Posted 27 December 2012 - 09:27 PM

Okay, I reduced the uses of shaders to one only and I've implemented the uses of VBO. I'm unpacking the triangle strip to a triangle list into a structure with 512 * sizeof(Vertex) size. I'm building and drawing the VBO when the structure is filled with this:

glBufferData(GL_ARRAY_BUFFER, m_vertexcacheIndex * sizeof(SuperVertex), m_vertexcache, GL_DYNAMIC_DRAW);
glVertexAttribPointer(vert_position, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(0 * sizeof(float)));
glVertexAttribPointer(vert_texture, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(3 * sizeof(float)));
glVertexAttribPointer(vert_color, 4, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(6 * sizeof(float)));
glDrawArrays(GL_TRIANGLES, 0, m_vertexcacheIndex);
m_vertexcacheIndex = 0;

where m_vertexcacheIndex is the vertices count inside th structure, m_vertexcache is the structure itself and supervertex is the structure definition. I debugged the software with gDEBugger, before VBO I was doing 12k gl calls per frame, now only 120 calls but I have bad performances. Before 720fps, now 350...

Edited by Retsu90, 27 December 2012 - 09:29 PM.

### #8Retsu90  Members

208
Like
0Likes
Like

Posted 28 December 2012 - 09:23 AM

I did some tests:

1) Call glVertexAttribPointer and glDrawArrays with GL_TRIANGLE_STRIP for every sprite (the original mode before to create this post), reaches 498fps. The stride here is 0, this mean that vertex position, texture position and color are in separate structures.

2) Cache the vertices in an array of 1024 structures. I'm copying the vertices that I'm passing to the cache with a memcpy. When the array is full, the content is drawn with glVertexAttribPointer and glDrawElements with GL_TRIANGLE_STRIP. I'm indexing the vertices here. The stride is 0. 589fps!!!

3) Same as above, but vertex position, texture position and color are on the same structure, this mean that I need to call memcpy to copy the sprite model, only once. I was expecting an improvment. 399fps.

4) Same as 4, but this time I'm unpacking the vertices from GL_TRIANGLE_STRIP to GL_TRIANGLES. I'm passing the 4 vertices and a function unpack them to 6 vertices. With this I don't need of indexed vertices. This takes much memory but the fps reached are 562!

5) Same as 3, but this time I'm using VBO: only 270fps.

6) Same as 4 but with VBO: 278fps.

Supposing that I'm not doing nothing's wrong, the best mode is the second. It doesn't take much memory and the indexing mode is easy to do. With this I can hardcode some basic models and indexing them. The vertex unpacking from STRIP to LIST can takes a lot of resources and it doesn't improve so much. I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility) and stores every attrib in a separate structure. For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

Edited by Retsu90, 28 December 2012 - 06:32 PM.

### #9Aks9  Members

1399
Like
1Likes
Like

Posted 29 December 2012 - 08:46 AM

For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

### #10Retsu90  Members

208
Like
0Likes
Like

Posted 29 December 2012 - 11:58 AM

For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

I'm measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

### #11mattdesl  Members

176
Like
1Likes
Like

Posted 29 December 2012 - 12:49 PM

A VBO should be performant enough for the vast majority of cases. If you need better performance you can pass points and expand them to triangles in a geometry shader.

Read up here on some techniques for VBO optimization with sprite batching:

http://www.java-gaming.org/topics/opengl-lightning-fast-managed-vbo-mapping/28209/view.html

Since ultimately the performance may vary depending on the driver, the absolute "fastest" solution is to use whatever works best for the driver. For example, in the intro cutscene of your game you might benchmark a few different rendering techniques, and pick whichever runs the fastest.

### #12Aks9  Members

1399
Like
1Likes
Like

Posted 30 December 2012 - 04:43 AM

| measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

Well, I don't know how gDEBugger measures execution time, but I have to draw your attention to the following facts:

1. All issued GL commands execute on both CPU and GPU.

2. CPU execution time is usually very short (of course it depends on a command), commands are set in a command queue and the control is returned to the CPU.

3. SwapBuffers, as its name implies, exchanges the front and back buffers. In order to do that, it flushes command queue and waits until the drawing is finished. It is probably dependent on the implementation, but on my laptop with Windows Vista, it is a blocking function. Take a look at the attached picture.

Blue lines represent CPU time, while red ones represent GPU time. Although you could say SwapBuffers consumes 78% of the frame time, it is simple not the truth. The answer is in the blue line in the window "Frame". GPU takes about 13ms to render the frame, although CPU is utilized only 0.67ms. That's what I talked about.

4. Frame-rate can be only 120 (rarely), 60, 30, 15, etc. if vsync is on. What you have posted is an effective frame-rate. So, it is better to use time of execution instead of FPS.

5. Having effective frame-rate greater than 120 induce performance state changing, since GPU is not utilized enough. It is very hard to profile application in such circumstances. That's why I proposed a performance state tracking alongside with profiling (take a look at OpenGL Insights, pg.527-534.).

### #13mhagain  Members

12429
Like
1Likes
Like

Posted 02 January 2013 - 05:59 AM

If you're running slower with a VBO then you're doing something wrong - most likely case is that the code you're using to update the VBO is causing sync points which are killing your framerate.  This is a common enough failing - you just can't treat a VBO as if it were just another block of memory that you can freely write to, read from, etc as if it were a regular pointer.  You should post the code you're using to update your VBO and it will be possible to comment further, but for now, and to get you started, a read of this article is recommended: http://www.opengl.org/wiki/Buffer_Object_Streaming

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

### #14mhagain  Members

12429
Like
1Likes
Like

Posted 03 January 2013 - 08:44 AM

I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility)

Not sure where in the OpenGL docs you read that, but you should be aware that the ability to use interleaved attribs has been part of OpenGL since the original GL_EXT_vertex_array in 1995.  In the common case interleaving should in fact be the faster option; there are certain cases for sure where it may be slower (such as using software T&L, doing a separate shadow pass, etc) but if none of those cases apply and if it still runs slower for you then - again - you've got something else wrong.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

### #15Retsu90  Members

208
Like
0Likes
Like

Posted 04 January 2013 - 09:08 AM

Okay, in these days I rewrote the entire sprite system. I'm using a pre-calculated unsigned short array for vertex indices and I'm copying the four vertices for each sprite in an array that is used as a cache. Now the framework reaches 997 fps with 20000 triangles. I'm currently using glVertexPointer and glDrawElements due to OpenGL 2.1 compatibility. I'm binding only one texture per frame. I discovered that the rendering isn't really CPU-limited, in fact I overclocked my video-card (it was in under clock to save power) and the framework reaches 1800fps. For VBO I don't understand exactly how to initialize and use it properly, it isn't the same thing to cache all the vertices in main memory then send all together before to call SwapBuffers? I forgot also to mention that more or less the 90% of the vertices changes every frame, so caching them inside the video-card has't a great effect... Also I don't understand how to buffer the uniforms and how to use them. It's possible also to avoid the glBindTexture? I know that I can upload the textures in a single big texture, but I'm asking if there is another way to switch with a batch/buffer the texture binding.

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.