# OpenGL Most efficient way to batch drawings

## Recommended Posts

Hi, I'm interested on a bit of theory about the best methods of optimization for OpenGL 3.0 (where a lot of function became deprecated).

On my current 2D framework, every sprite has own program with own values inside the uniform. Every sprite is draw separately and, now that I switched from 2.1 to 3.0, every sprite has own matrix Projection and View. Now my goal is to batch most vertexes possible and these are some ideas:

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.

None of these ideas work as I expected so now I'm here to ask you what is it the most efficient way to batch drawings in OpenGL 3.0.

##### Share on other sites
If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm. Edited by samoth

##### Share on other sites
1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy. Edited by max343

##### Share on other sites
If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm.

So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy.

Yes, I mean instancing (I saw what instancing is it only now). Do you recommend me to send matrices in an uniform array or in a texture?

##### Share on other sites
I always prefer using uniform buffers. Initially the piping is a bit tricky to understand, but once you grasp that part, their advantages over textures are apparent.

BTW, OpenGL 3 supports instancing. Edited by max343

##### Share on other sites
So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

I didn't read into it the first time, but the answer is no. A big no. It's much better to use uniform buffers for something this big (or for something that you're going to share). In fact it's better to limit the usage of global uniforms only to those cases in which the overhead of using the buffer is greater.

##### Share on other sites

Okay, I reduced the uses of shaders to one only and I've implemented the uses of VBO. I'm unpacking the triangle strip to a triangle list into a structure with 512 * sizeof(Vertex) size. I'm building and drawing the VBO when the structure is filled with this:

glBufferData(GL_ARRAY_BUFFER, m_vertexcacheIndex * sizeof(SuperVertex), m_vertexcache, GL_DYNAMIC_DRAW);
glVertexAttribPointer(vert_position, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(0 * sizeof(float)));
glVertexAttribPointer(vert_texture, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(3 * sizeof(float)));
glVertexAttribPointer(vert_color, 4, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(6 * sizeof(float)));
glDrawArrays(GL_TRIANGLES, 0, m_vertexcacheIndex);
m_vertexcacheIndex = 0;

where m_vertexcacheIndex is the vertices count inside th structure, m_vertexcache is the structure itself and supervertex is the structure definition. I debugged the software with gDEBugger, before VBO I was doing 12k gl calls per frame, now only 120 calls but I have bad performances. Before 720fps, now 350...

Edited by Retsu90

##### Share on other sites

I did some tests:

1) Call glVertexAttribPointer and glDrawArrays with GL_TRIANGLE_STRIP for every sprite (the original mode before to create this post), reaches 498fps. The stride here is 0, this mean that vertex position, texture position and color are in separate structures.

2) Cache the vertices in an array of 1024 structures. I'm copying the vertices that I'm passing to the cache with a memcpy. When the array is full, the content is drawn with glVertexAttribPointer and glDrawElements with GL_TRIANGLE_STRIP. I'm indexing the vertices here. The stride is 0. 589fps!!!

3) Same as above, but vertex position, texture position and color are on the same structure, this mean that I need to call memcpy to copy the sprite model, only once. I was expecting an improvment. 399fps.

4) Same as 4, but this time I'm unpacking the vertices from GL_TRIANGLE_STRIP to GL_TRIANGLES. I'm passing the 4 vertices and a function unpack them to 6 vertices. With this I don't need of indexed vertices. This takes much memory but the fps reached are 562!

5) Same as 3, but this time I'm using VBO: only 270fps.

6) Same as 4 but with VBO: 278fps.

Supposing that I'm not doing nothing's wrong, the best mode is the second. It doesn't take much memory and the indexing mode is easy to do. With this I can hardcode some basic models and indexing them. The vertex unpacking from STRIP to LIST can takes a lot of resources and it doesn't improve so much. I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility) and stores every attrib in a separate structure. For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

Edited by Retsu90

##### Share on other sites
For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

##### Share on other sites
For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

I'm measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

##### Share on other sites

A VBO should be performant enough for the vast majority of cases. If you need better performance you can pass points and expand them to triangles in a geometry shader.

Read up here on some techniques for VBO optimization with sprite batching:

http://www.java-gaming.org/topics/opengl-lightning-fast-managed-vbo-mapping/28209/view.html

Since ultimately the performance may vary depending on the driver, the absolute "fastest" solution is to use whatever works best for the driver. For example, in the intro cutscene of your game you might benchmark a few different rendering techniques, and pick whichever runs the fastest.

##### Share on other sites
| measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

Well, I don't know how gDEBugger measures execution time, but I have to draw your attention to the following facts:

1. All issued GL commands execute on both CPU and GPU.

2. CPU execution time is usually very short (of course it depends on a command), commands are set in a command queue and the control is returned to the CPU.

3. SwapBuffers, as its name implies, exchanges the front and back buffers. In order to do that, it flushes command queue and waits until the drawing is finished. It is probably dependent on the implementation, but on my laptop with Windows Vista, it is a blocking function. Take a look at the attached picture.

Blue lines represent CPU time, while red ones represent GPU time. Although you could say SwapBuffers consumes 78% of the frame time, it is simple not the truth. The answer is in the blue line in the window "Frame". GPU takes about 13ms to render the frame, although CPU is utilized only 0.67ms. That's what I talked about.

4. Frame-rate can be only 120 (rarely), 60, 30, 15, etc. if vsync is on. What you have posted is an effective frame-rate. So, it is better to use time of execution instead of FPS.

5. Having effective frame-rate greater than 120 induce performance state changing, since GPU is not utilized enough. It is very hard to profile application in such circumstances. That's why I proposed a performance state tracking alongside with profiling (take a look at OpenGL Insights, pg.527-534.).

##### Share on other sites

If you're running slower with a VBO then you're doing something wrong - most likely case is that the code you're using to update the VBO is causing sync points which are killing your framerate.  This is a common enough failing - you just can't treat a VBO as if it were just another block of memory that you can freely write to, read from, etc as if it were a regular pointer.  You should post the code you're using to update your VBO and it will be possible to comment further, but for now, and to get you started, a read of this article is recommended: http://www.opengl.org/wiki/Buffer_Object_Streaming

##### Share on other sites
I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility)

Not sure where in the OpenGL docs you read that, but you should be aware that the ability to use interleaved attribs has been part of OpenGL since the original GL_EXT_vertex_array in 1995.  In the common case interleaving should in fact be the faster option; there are certain cases for sure where it may be slower (such as using software T&L, doing a separate shadow pass, etc) but if none of those cases apply and if it still runs slower for you then - again - you've got something else wrong.

##### Share on other sites

Okay, in these days I rewrote the entire sprite system. I'm using a pre-calculated unsigned short array for vertex indices and I'm copying the four vertices for each sprite in an array that is used as a cache. Now the framework reaches 997 fps with 20000 triangles. I'm currently using glVertexPointer and glDrawElements due to OpenGL 2.1 compatibility. I'm binding only one texture per frame. I discovered that the rendering isn't really CPU-limited, in fact I overclocked my video-card (it was in under clock to save power) and the framework reaches 1800fps. For VBO I don't understand exactly how to initialize and use it properly, it isn't the same thing to cache all the vertices in main memory then send all together before to call SwapBuffers? I forgot also to mention that more or less the 90% of the vertices changes every frame, so caching them inside the video-card has't a great effect... Also I don't understand how to buffer the uniforms and how to use them. It's possible also to avoid the glBindTexture? I know that I can upload the textures in a single big texture, but I'm asking if there is another way to switch with a batch/buffer the texture binding.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628329
• Total Posts
2982104
• ### Similar Content

• Hi, New here.
I need some help. My fiance and I like to play this mobile game online that goes by real time. Her and I are always working but when we have free time we like to play this game. We don't always got time throughout the day to Queue Buildings, troops, Upgrades....etc....
I was told to look into DLL Injection and OpenGL/DirectX Hooking. Is this true? Is this what I need to learn?
How do I read the Android files, or modify the files, or get the in-game tags/variables for the game I want?
Any assistance on this would be most appreciated. I been everywhere and seems no one knows or is to lazy to help me out. It would be nice to have assistance for once. I don't know what I need to learn.
So links of topics I need to learn within the comment section would be SOOOOO.....Helpful. Anything to just get me started.
Thanks,
Dejay Hextrix
• By mellinoe
Hi all,
First time poster here, although I've been reading posts here for quite a while. This place has been invaluable for learning graphics programming -- thanks for a great resource!
Right now, I'm working on a graphics abstraction layer for .NET which supports D3D11, Vulkan, and OpenGL at the moment. I have implemented most of my planned features already, and things are working well. Some remaining features that I am planning are Compute Shaders, and some flavor of read-write shader resources. At the moment, my shaders can just get simple read-only access to a uniform (or constant) buffer, a texture, or a sampler. Unfortunately, I'm having a tough time grasping the distinctions between all of the different kinds of read-write resources that are available. In D3D alone, there seem to be 5 or 6 different kinds of resources with similar but different characteristics. On top of that, I get the impression that some of them are more or less "obsoleted" by the newer kinds, and don't have much of a place in modern code. There seem to be a few pivots:
The data source/destination (buffer or texture) Read-write or read-only Structured or unstructured (?) Ordered vs unordered (?) These are just my observations based on a lot of MSDN and OpenGL doc reading. For my library, I'm not interested in exposing every possibility to the user -- just trying to find a good "middle-ground" that can be represented cleanly across API's which is good enough for common scenarios.
Can anyone give a sort of "overview" of the different options, and perhaps compare/contrast the concepts between Direct3D, OpenGL, and Vulkan? I'd also be very interested in hearing how other folks have abstracted these concepts in their libraries.
• By aejt
I recently started getting into graphics programming (2nd try, first try was many years ago) and I'm working on a 3d rendering engine which I hope to be able to make a 3D game with sooner or later. I have plenty of C++ experience, but not a lot when it comes to graphics, and while it's definitely going much better this time, I'm having trouble figuring out how assets are usually handled by engines.
I'm not having trouble with handling the GPU resources, but more so with how the resources should be defined and used in the system (materials, models, etc).
This is my plan now, I've implemented most of it except for the XML parts and factories and those are the ones I'm not sure of at all:
I have these classes:
For GPU resources:
Geometry: holds and manages everything needed to render a geometry: VAO, VBO, EBO. Texture: holds and manages a texture which is loaded into the GPU. Shader: holds and manages a shader which is loaded into the GPU. For assets relying on GPU resources:
Material: holds a shader resource, multiple texture resources, as well as uniform settings. Mesh: holds a geometry and a material. Model: holds multiple meshes, possibly in a tree structure to more easily support skinning later on? For handling GPU resources:
ResourceCache<T>: T can be any resource loaded into the GPU. It owns these resources and only hands out handles to them on request (currently string identifiers are used when requesting handles, but all resources are stored in a vector and each handle only contains resource's index in that vector) Resource<T>: The handles given out from ResourceCache. The handles are reference counted and to get the underlying resource you simply deference like with pointers (*handle).
And my plan is to define everything into these XML documents to abstract away files:
Resources.xml for ref-counted GPU resources (geometry, shaders, textures) Resources are assigned names/ids and resource files, and possibly some attributes (what vertex attributes does this geometry have? what vertex attributes does this shader expect? what uniforms does this shader use? and so on) Are reference counted using ResourceCache<T> Assets.xml for assets using the GPU resources (materials, meshes, models) Assets are not reference counted, but they hold handles to ref-counted resources. References the resources defined in Resources.xml by names/ids. The XMLs are loaded into some structure in memory which is then used for loading the resources/assets using factory classes:
Factory classes for resources:
For example, a texture factory could contain the texture definitions from the XML containing data about textures in the game, as well as a cache containing all loaded textures. This means it has mappings from each name/id to a file and when asked to load a texture with a name/id, it can look up its path and use a "BinaryLoader" to either load the file and create the resource directly, or asynchronously load the file's data into a queue which then can be read from later to create the resources synchronously in the GL context. These factories only return handles.
Factory classes for assets:
Much like for resources, these classes contain the definitions for the assets they can load. For example, with the definition the MaterialFactory will know which shader, textures and possibly uniform a certain material has, and with the help of TextureFactory and ShaderFactory, it can retrieve handles to the resources it needs (Shader + Textures), setup itself from XML data (uniform values), and return a created instance of requested material. These factories return actual instances, not handles (but the instances contain handles).

Is this a good or commonly used approach? Is this going to bite me in the ass later on? Are there other more preferable approaches? Is this outside of the scope of a 3d renderer and should be on the engine side? I'd love to receive and kind of advice or suggestions!
Thanks!
• By nedondev
I 'm learning how to create game by using opengl with c/c++ coding, so here is my fist game. In video description also have game contain in Dropbox. May be I will make it better in future.
Thanks.

• So I've recently started learning some GLSL and now I'm toying with a POM shader. I'm trying to optimize it and notice that it starts having issues at high texture sizes, especially with self-shadowing.
Now I know POM is expensive either way, but would pulling the heightmap out of the normalmap alpha channel and in it's own 8bit texture make doing all those dozens of texture fetches more cheap? Or is everything in the cache aligned to 32bit anyway? I haven't implemented texture compression yet, I think that would help? But regardless, should there be a performance boost from decoupling the heightmap? I could also keep it in a lower resolution than the normalmap if that would improve performance.
Any help is much appreciated, please keep in mind I'm somewhat of a newbie. Thanks!

• 22
• 9
• 9
• 13
• 11