# OpenGL Most efficient way to batch drawings

This topic is 1839 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, I'm interested on a bit of theory about the best methods of optimization for OpenGL 3.0 (where a lot of function became deprecated).

On my current 2D framework, every sprite has own program with own values inside the uniform. Every sprite is draw separately and, now that I switched from 2.1 to 3.0, every sprite has own matrix Projection and View. Now my goal is to batch most vertexes possible and these are some ideas:

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.

None of these ideas work as I expected so now I'm here to ask you what is it the most efficient way to batch drawings in OpenGL 3.0.

##### Share on other sites
If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm. Edited by samoth

##### Share on other sites
1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy. Edited by max343

##### Share on other sites
If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm.

So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy.

Yes, I mean instancing (I saw what instancing is it only now). Do you recommend me to send matrices in an uniform array or in a texture?

##### Share on other sites
I always prefer using uniform buffers. Initially the piping is a bit tricky to understand, but once you grasp that part, their advantages over textures are apparent.

BTW, OpenGL 3 supports instancing. Edited by max343

##### Share on other sites
So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

I didn't read into it the first time, but the answer is no. A big no. It's much better to use uniform buffers for something this big (or for something that you're going to share). In fact it's better to limit the usage of global uniforms only to those cases in which the overhead of using the buffer is greater.

##### Share on other sites

Okay, I reduced the uses of shaders to one only and I've implemented the uses of VBO. I'm unpacking the triangle strip to a triangle list into a structure with 512 * sizeof(Vertex) size. I'm building and drawing the VBO when the structure is filled with this:

glBufferData(GL_ARRAY_BUFFER, m_vertexcacheIndex * sizeof(SuperVertex), m_vertexcache, GL_DYNAMIC_DRAW);
glVertexAttribPointer(vert_position, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(0 * sizeof(float)));
glVertexAttribPointer(vert_texture, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(3 * sizeof(float)));
glVertexAttribPointer(vert_color, 4, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(6 * sizeof(float)));
glDrawArrays(GL_TRIANGLES, 0, m_vertexcacheIndex);
m_vertexcacheIndex = 0;

where m_vertexcacheIndex is the vertices count inside th structure, m_vertexcache is the structure itself and supervertex is the structure definition. I debugged the software with gDEBugger, before VBO I was doing 12k gl calls per frame, now only 120 calls but I have bad performances. Before 720fps, now 350...

Edited by Retsu90

##### Share on other sites

I did some tests:

1) Call glVertexAttribPointer and glDrawArrays with GL_TRIANGLE_STRIP for every sprite (the original mode before to create this post), reaches 498fps. The stride here is 0, this mean that vertex position, texture position and color are in separate structures.

2) Cache the vertices in an array of 1024 structures. I'm copying the vertices that I'm passing to the cache with a memcpy. When the array is full, the content is drawn with glVertexAttribPointer and glDrawElements with GL_TRIANGLE_STRIP. I'm indexing the vertices here. The stride is 0. 589fps!!!

3) Same as above, but vertex position, texture position and color are on the same structure, this mean that I need to call memcpy to copy the sprite model, only once. I was expecting an improvment. 399fps.

4) Same as 4, but this time I'm unpacking the vertices from GL_TRIANGLE_STRIP to GL_TRIANGLES. I'm passing the 4 vertices and a function unpack them to 6 vertices. With this I don't need of indexed vertices. This takes much memory but the fps reached are 562!

5) Same as 3, but this time I'm using VBO: only 270fps.

6) Same as 4 but with VBO: 278fps.

Supposing that I'm not doing nothing's wrong, the best mode is the second. It doesn't take much memory and the indexing mode is easy to do. With this I can hardcode some basic models and indexing them. The vertex unpacking from STRIP to LIST can takes a lot of resources and it doesn't improve so much. I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility) and stores every attrib in a separate structure. For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

Edited by Retsu90

##### Share on other sites
For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

##### Share on other sites
For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

I'm measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

##### Share on other sites

A VBO should be performant enough for the vast majority of cases. If you need better performance you can pass points and expand them to triangles in a geometry shader.

Read up here on some techniques for VBO optimization with sprite batching:

http://www.java-gaming.org/topics/opengl-lightning-fast-managed-vbo-mapping/28209/view.html

Since ultimately the performance may vary depending on the driver, the absolute "fastest" solution is to use whatever works best for the driver. For example, in the intro cutscene of your game you might benchmark a few different rendering techniques, and pick whichever runs the fastest.

##### Share on other sites
| measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

Well, I don't know how gDEBugger measures execution time, but I have to draw your attention to the following facts:

1. All issued GL commands execute on both CPU and GPU.

2. CPU execution time is usually very short (of course it depends on a command), commands are set in a command queue and the control is returned to the CPU.

3. SwapBuffers, as its name implies, exchanges the front and back buffers. In order to do that, it flushes command queue and waits until the drawing is finished. It is probably dependent on the implementation, but on my laptop with Windows Vista, it is a blocking function. Take a look at the attached picture.

Blue lines represent CPU time, while red ones represent GPU time. Although you could say SwapBuffers consumes 78% of the frame time, it is simple not the truth. The answer is in the blue line in the window "Frame". GPU takes about 13ms to render the frame, although CPU is utilized only 0.67ms. That's what I talked about.

4. Frame-rate can be only 120 (rarely), 60, 30, 15, etc. if vsync is on. What you have posted is an effective frame-rate. So, it is better to use time of execution instead of FPS.

5. Having effective frame-rate greater than 120 induce performance state changing, since GPU is not utilized enough. It is very hard to profile application in such circumstances. That's why I proposed a performance state tracking alongside with profiling (take a look at OpenGL Insights, pg.527-534.).

##### Share on other sites

If you're running slower with a VBO then you're doing something wrong - most likely case is that the code you're using to update the VBO is causing sync points which are killing your framerate.  This is a common enough failing - you just can't treat a VBO as if it were just another block of memory that you can freely write to, read from, etc as if it were a regular pointer.  You should post the code you're using to update your VBO and it will be possible to comment further, but for now, and to get you started, a read of this article is recommended: http://www.opengl.org/wiki/Buffer_Object_Streaming

##### Share on other sites
I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility)

Not sure where in the OpenGL docs you read that, but you should be aware that the ability to use interleaved attribs has been part of OpenGL since the original GL_EXT_vertex_array in 1995.  In the common case interleaving should in fact be the faster option; there are certain cases for sure where it may be slower (such as using software T&L, doing a separate shadow pass, etc) but if none of those cases apply and if it still runs slower for you then - again - you've got something else wrong.

##### Share on other sites

Okay, in these days I rewrote the entire sprite system. I'm using a pre-calculated unsigned short array for vertex indices and I'm copying the four vertices for each sprite in an array that is used as a cache. Now the framework reaches 997 fps with 20000 triangles. I'm currently using glVertexPointer and glDrawElements due to OpenGL 2.1 compatibility. I'm binding only one texture per frame. I discovered that the rendering isn't really CPU-limited, in fact I overclocked my video-card (it was in under clock to save power) and the framework reaches 1800fps. For VBO I don't understand exactly how to initialize and use it properly, it isn't the same thing to cache all the vertices in main memory then send all together before to call SwapBuffers? I forgot also to mention that more or less the 90% of the vertices changes every frame, so caching them inside the video-card has't a great effect... Also I don't understand how to buffer the uniforms and how to use them. It's possible also to avoid the glBindTexture? I know that I can upload the textures in a single big texture, but I'm asking if there is another way to switch with a batch/buffer the texture binding.

• ### Similar Content

• By _OskaR
Hi,
I have an OpenGL application but without possibility to wite own shaders.
I need to perform small VS modification - is possible to do it in an alternative way? Do we have apps or driver modifictions which will catch the shader sent to GPU and override it?
• By xhcao
Does sync be needed to read texture content after access texture image in compute shader?
My simple code is as below,
glUseProgram(program.get());
glBindImageTexture(0, texture[0], 0, GL_FALSE, 3, GL_READ_ONLY, GL_R32UI);
glBindImageTexture(1, texture[1], 0, GL_FALSE, 4, GL_WRITE_ONLY, GL_R32UI);
glDispatchCompute(1, 1, 1);
// Does sync be needed here?
glUseProgram(0);
GL_TEXTURE_CUBE_MAP_POSITIVE_X + face, texture[1], 0);
glReadPixels(0, 0, kWidth, kHeight, GL_RED_INTEGER, GL_UNSIGNED_INT, outputValues);

Compute shader is very simple, imageLoad content from texture[0], and imageStore content to texture[1]. Does need to sync after dispatchCompute?

• My question: is it possible to transform multiple angular velocities so that they can be reinserted as one? My research is below:

• I have this code below in both my vertex and fragment shader, however when I request glGetUniformLocation("Lights[0].diffuse") or "Lights[0].attenuation", it returns -1. It will only give me a valid uniform location if I actually use the diffuse/attenuation variables in the VERTEX shader. Because I use position in the vertex shader, it always returns a valid uniform location. I've read that I can share uniforms across both vertex and fragment, but I'm confused what this is even compiling to if this is the case.

#define NUM_LIGHTS 2
struct Light
{
vec3 position;
vec3 diffuse;
float attenuation;
};
uniform Light Lights[NUM_LIGHTS];

• By pr033r
Hello,
I have a Bachelor project on topic "Implenet 3D Boid's algorithm in OpenGL". All OpenGL issues works fine for me, all rendering etc. But when I started implement the boid's algorithm it was getting worse and worse. I read article (http://natureofcode.com/book/chapter-6-autonomous-agents/) inspirate from another code (here: https://github.com/jyanar/Boids/tree/master/src) but it still doesn't work like in tutorials and videos. For example the main problem: when I apply Cohesion (one of three main laws of boids) it makes some "cycling knot". Second, when some flock touch to another it scary change the coordination or respawn in origin (x: 0, y:0. z:0). Just some streng things.
I followed many tutorials, change a try everything but it isn't so smooth, without lags like in another videos. I really need your help.
My code (optimalizing branch): https://github.com/pr033r/BachelorProject/tree/Optimalizing
Exe file (if you want to look) and models folder (for those who will download the sources):
http://leteckaposta.cz/367190436
Thanks for any help...

• 14
• 16
• 10
• 17
• 9