Followers 0

# OpenGL Most efficient way to batch drawings

## 13 posts in this topic

Hi, I'm interested on a bit of theory about the best methods of optimization for OpenGL 3.0 (where a lot of function became deprecated).

On my current 2D framework, every sprite has own program with own values inside the uniform. Every sprite is draw separately and, now that I switched from 2.1 to 3.0, every sprite has own matrix Projection and View. Now my goal is to batch most vertexes possible and these are some ideas:

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.

None of these ideas work as I expected so now I'm here to ask you what is it the most efficient way to batch drawings in OpenGL 3.0.

0

##### Share on other sites
If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm. Edited by samoth
1

##### Share on other sites
1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy. Edited by max343
1

##### Share on other sites
If only one or two transform matrices are unique for every sprite, I see no reason why you can't draw them all in one single call. You can index into a uniform array or into a buffer texture to read these, using e.g. gl_InstanceID in the vertex shader if you use instancing (or gl_VertexID divided by 4 otherwise).

Or, you can generate quads from points in the geometry shader and use either gl_VertexID or gl_PrimitiveID (which are the same in that particular case) as an index (in that case, transform is done in the GS too). A sprite likely does not have a dozen output attributes, so the geometry shader should be reasonably efficient, too.

Either solution is a thousand times more efficient than binding different uniforms (or even shaders!) for every sprite, or for some subset of sprites that you have determined with some clever batching algorithm.

So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

1) Use only one program for everything. The projection matrix is one, i can group the vertexes and send via glVertexAttribArray the values for shader and draw everything with one call. The problem is the model view matrix, that should be one for every vertex and this isn't the thing that I want because every sprite has own matrix.
I'm not sure what you mean by this, but maybe what you want is instancing.

2) Continue to use various shader. The projection matrix is shared between programs (how can I do it?), every sprite has own shader with own model view matrix and uniform values. The problem here is that I need to switch the program between sprite draws.
Uniform buffers make sharing easy.

Yes, I mean instancing (I saw what instancing is it only now). Do you recommend me to send matrices in an uniform array or in a texture?

0

##### Share on other sites
I always prefer using uniform buffers. Initially the piping is a bit tricky to understand, but once you grasp that part, their advantages over textures are apparent.

BTW, OpenGL 3 supports instancing. Edited by max343
1

##### Share on other sites
So if I have 100 sprites I should send 100 view model matrix with glUniformMatrix4fv and select them with gl_VertexID/4?

I didn't read into it the first time, but the answer is no. A big no. It's much better to use uniform buffers for something this big (or for something that you're going to share). In fact it's better to limit the usage of global uniforms only to those cases in which the overhead of using the buffer is greater.

1

##### Share on other sites

Okay, I reduced the uses of shaders to one only and I've implemented the uses of VBO. I'm unpacking the triangle strip to a triangle list into a structure with 512 * sizeof(Vertex) size. I'm building and drawing the VBO when the structure is filled with this:

glBufferData(GL_ARRAY_BUFFER, m_vertexcacheIndex * sizeof(SuperVertex), m_vertexcache, GL_DYNAMIC_DRAW);
glVertexAttribPointer(vert_position, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(0 * sizeof(float)));
glVertexAttribPointer(vert_texture, 3, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(3 * sizeof(float)));
glVertexAttribPointer(vert_color, 4, GL_FLOAT, GL_FALSE, sizeof(SuperVertex), BUFFER_OFFSET(6 * sizeof(float)));
glDrawArrays(GL_TRIANGLES, 0, m_vertexcacheIndex);
m_vertexcacheIndex = 0;

where m_vertexcacheIndex is the vertices count inside th structure, m_vertexcache is the structure itself and supervertex is the structure definition. I debugged the software with gDEBugger, before VBO I was doing 12k gl calls per frame, now only 120 calls but I have bad performances. Before 720fps, now 350...

Edited by Retsu90
0

##### Share on other sites

I did some tests:

1) Call glVertexAttribPointer and glDrawArrays with GL_TRIANGLE_STRIP for every sprite (the original mode before to create this post), reaches 498fps. The stride here is 0, this mean that vertex position, texture position and color are in separate structures.

2) Cache the vertices in an array of 1024 structures. I'm copying the vertices that I'm passing to the cache with a memcpy. When the array is full, the content is drawn with glVertexAttribPointer and glDrawElements with GL_TRIANGLE_STRIP. I'm indexing the vertices here. The stride is 0. 589fps!!!

3) Same as above, but vertex position, texture position and color are on the same structure, this mean that I need to call memcpy to copy the sprite model, only once. I was expecting an improvment. 399fps.

4) Same as 4, but this time I'm unpacking the vertices from GL_TRIANGLE_STRIP to GL_TRIANGLES. I'm passing the 4 vertices and a function unpack them to 6 vertices. With this I don't need of indexed vertices. This takes much memory but the fps reached are 562!

5) Same as 3, but this time I'm using VBO: only 270fps.

6) Same as 4 but with VBO: 278fps.

Supposing that I'm not doing nothing's wrong, the best mode is the second. It doesn't take much memory and the indexing mode is easy to do. With this I can hardcode some basic models and indexing them. The vertex unpacking from STRIP to LIST can takes a lot of resources and it doesn't improve so much. I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility) and stores every attrib in a separate structure. For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

Edited by Retsu90
0

##### Share on other sites
For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

1

##### Share on other sites
For some reason, VBO decrease the performances and with this, SwapBuffer takes a lot of CPU. However all this methods are CPU-limited, because the GPU isn't totally used. Much of the CPU is drawined by memcpy and SwapBuffer.

EDIT: I tried the same tests with the same software without edits on another computer that handle a Intel HD3000 (the first tests run on a Radeon 4870HD): 62, 178, 124, 163, 97, 207fps. VBO with triangle list is much faster this time. I'm starting to be confused...

If you gave us more details about the way you have measured the time, maybe we could find the cause. SwapBuffers is not a time-consuming instruction. The reason it take time is waiting for drawing to finish. That implies your measured time is incorrect. How did you measured it?

I'm measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

0

##### Share on other sites

A VBO should be performant enough for the vast majority of cases. If you need better performance you can pass points and expand them to triangles in a geometry shader.

Read up here on some techniques for VBO optimization with sprite batching:

http://www.java-gaming.org/topics/opengl-lightning-fast-managed-vbo-mapping/28209/view.html

Since ultimately the performance may vary depending on the driver, the absolute "fastest" solution is to use whatever works best for the driver. For example, in the intro cutscene of your game you might benchmark a few different rendering techniques, and pick whichever runs the fastest.

1

##### Share on other sites
| measuring it with gDEBugger, setting SwapBuffer as end-of-frame. With the profiling of Visual Studio, I can see clearly that SwapBuffers takes the 50% of the CPU in a single frame.

Well, I don't know how gDEBugger measures execution time, but I have to draw your attention to the following facts:

1. All issued GL commands execute on both CPU and GPU.

2. CPU execution time is usually very short (of course it depends on a command), commands are set in a command queue and the control is returned to the CPU.

3. SwapBuffers, as its name implies, exchanges the front and back buffers. In order to do that, it flushes command queue and waits until the drawing is finished. It is probably dependent on the implementation, but on my laptop with Windows Vista, it is a blocking function. Take a look at the attached picture.

Blue lines represent CPU time, while red ones represent GPU time. Although you could say SwapBuffers consumes 78% of the frame time, it is simple not the truth. The answer is in the blue line in the window "Frame". GPU takes about 13ms to render the frame, although CPU is utilized only 0.67ms. That's what I talked about.

4. Frame-rate can be only 120 (rarely), 60, 30, 15, etc. if vsync is on. What you have posted is an effective frame-rate. So, it is better to use time of execution instead of FPS.

5. Having effective frame-rate greater than 120 induce performance state changing, since GPU is not utilized enough. It is very hard to profile application in such circumstances. That's why I proposed a performance state tracking alongside with profiling (take a look at OpenGL Insights, pg.527-534.).

1

##### Share on other sites

If you're running slower with a VBO then you're doing something wrong - most likely case is that the code you're using to update the VBO is causing sync points which are killing your framerate.  This is a common enough failing - you just can't treat a VBO as if it were just another block of memory that you can freely write to, read from, etc as if it were a regular pointer.  You should post the code you're using to update your VBO and it will be possible to comment further, but for now, and to get you started, a read of this article is recommended: http://www.opengl.org/wiki/Buffer_Object_Streaming

1

##### Share on other sites
I should avoid the structures all-in-one (I read from OpenGL documentation that it's implemented for D3D compatibility)

Not sure where in the OpenGL docs you read that, but you should be aware that the ability to use interleaved attribs has been part of OpenGL since the original GL_EXT_vertex_array in 1995.  In the common case interleaving should in fact be the faster option; there are certain cases for sure where it may be slower (such as using software T&L, doing a separate shadow pass, etc) but if none of those cases apply and if it still runs slower for you then - again - you've got something else wrong.

1

##### Share on other sites

Okay, in these days I rewrote the entire sprite system. I'm using a pre-calculated unsigned short array for vertex indices and I'm copying the four vertices for each sprite in an array that is used as a cache. Now the framework reaches 997 fps with 20000 triangles. I'm currently using glVertexPointer and glDrawElements due to OpenGL 2.1 compatibility. I'm binding only one texture per frame. I discovered that the rendering isn't really CPU-limited, in fact I overclocked my video-card (it was in under clock to save power) and the framework reaches 1800fps. For VBO I don't understand exactly how to initialize and use it properly, it isn't the same thing to cache all the vertices in main memory then send all together before to call SwapBuffers? I forgot also to mention that more or less the 90% of the vertices changes every frame, so caching them inside the video-card has't a great effect... Also I don't understand how to buffer the uniforms and how to use them. It's possible also to avoid the glBindTexture? I know that I can upload the textures in a single big texture, but I'm asking if there is another way to switch with a batch/buffer the texture binding.

0

## Create an account

Register a new account

Followers 0

• ### Similar Content

• Hello, I have been working on SH Irradiance map rendering, and I have been using a GLSL pixel shader to render SH irradiance to 2D irradiance maps for my static objects. I already have it working with 9 3D textures so far for the first 9 SH functions.
In my GLSL shader, I have to send in 9 SH Coefficient 3D Texures that use RGBA8 as a pixel format. RGB being used for the coefficients for red, green, and blue, and the A for checking if the voxel is in use (for the 3D texture solidification shader to prevent bleeding).
My problem is, I want to knock this number of textures down to something like 4 or 5. Getting even lower would be a godsend. This is because I eventually plan on adding more SH Coefficient 3D Textures for other parts of the game map (such as inside rooms, as opposed to the outside), to circumvent irradiance probe bleeding between rooms separated by walls. I don't want to reach the 32 texture limit too soon. Also, I figure that it would be a LOT faster.
Is there a way I could, say, store 2 sets of SH Coefficients for 2 SH functions inside a texture with RGBA16 pixels? If so, how would I extract them from inside GLSL? Let me know if you have any suggestions ^^.
• By KarimIO
EDIT: I thought this was restricted to Attribute-Created GL contexts, but it isn't, so I rewrote the post.
Hey guys, whenever I call SwapBuffers(hDC), I get a crash, and I get a "Too many posts were made to a semaphore." from Windows as I call SwapBuffers. What could be the cause of this?
Update: No crash occurs if I don't draw, just clear and swap.
static PIXELFORMATDESCRIPTOR pfd = // pfd Tells Windows How We Want Things To Be { sizeof(PIXELFORMATDESCRIPTOR), // Size Of This Pixel Format Descriptor 1, // Version Number PFD_DRAW_TO_WINDOW | // Format Must Support Window PFD_SUPPORT_OPENGL | // Format Must Support OpenGL PFD_DOUBLEBUFFER, // Must Support Double Buffering PFD_TYPE_RGBA, // Request An RGBA Format 32, // Select Our Color Depth 0, 0, 0, 0, 0, 0, // Color Bits Ignored 0, // No Alpha Buffer 0, // Shift Bit Ignored 0, // No Accumulation Buffer 0, 0, 0, 0, // Accumulation Bits Ignored 24, // 24Bit Z-Buffer (Depth Buffer) 0, // No Stencil Buffer 0, // No Auxiliary Buffer PFD_MAIN_PLANE, // Main Drawing Layer 0, // Reserved 0, 0, 0 // Layer Masks Ignored }; if (!(hDC = GetDC(windowHandle))) return false; unsigned int PixelFormat; if (!(PixelFormat = ChoosePixelFormat(hDC, &pfd))) return false; if (!SetPixelFormat(hDC, PixelFormat, &pfd)) return false; hRC = wglCreateContext(hDC); if (!hRC) { std::cout << "wglCreateContext Failed!\n"; return false; } if (wglMakeCurrent(hDC, hRC) == NULL) { std::cout << "Make Context Current Second Failed!\n"; return false; } ... // OGL Buffer Initialization glClear(GL_DEPTH_BUFFER_BIT | GL_COLOR_BUFFER_BIT); glBindVertexArray(vao); glUseProgram(myprogram); glDrawElements(GL_TRIANGLES, indexCount, GL_UNSIGNED_SHORT, (void *)indexStart); SwapBuffers(GetDC(window_handle));
• By Tchom
Hey devs!

I've been working on a OpenGL ES 2.0 android engine and I have begun implementing some simple (point) lighting. I had something fairly simple working, so I tried to get fancy and added color-tinting light. And it works great... with only one or two lights. Any more than that, the application drops about 15 frames per light added (my ideal is at least 4 or 5). I know implementing lighting is expensive, I just didn't think it was that expensive. I'm fairly new to the world of OpenGL and GLSL, so there is a good chance I've written some crappy shader code. If anyone had any feedback or tips on how I can optimize this code, please let me know.

uniform mat4 u_MVPMatrix; uniform mat4 u_MVMatrix; attribute vec4 a_Position; attribute vec3 a_Normal; attribute vec2 a_TexCoordinate; varying vec3 v_Position; varying vec3 v_Normal; varying vec2 v_TexCoordinate; void main() { v_Position = vec3(u_MVMatrix * a_Position); v_TexCoordinate = a_TexCoordinate; v_Normal = vec3(u_MVMatrix * vec4(a_Normal, 0.0)); gl_Position = u_MVPMatrix * a_Position; } Fragment Shader
precision mediump float; uniform vec4 u_LightPos["+numLights+"]; uniform vec4 u_LightColours["+numLights+"]; uniform float u_LightPower["+numLights+"]; uniform sampler2D u_Texture; varying vec3 v_Position; varying vec3 v_Normal; varying vec2 v_TexCoordinate; void main() { gl_FragColor = (texture2D(u_Texture, v_TexCoordinate)); float diffuse = 0.0; vec4 colourSum = vec4(1.0); for (int i = 0; i < "+numLights+"; i++) { vec3 toPointLight = vec3(u_LightPos[i]); float distance = length(toPointLight - v_Position); vec3 lightVector = normalize(toPointLight - v_Position); float diffuseDiff = 0.0; // The diffuse difference contributed from current light diffuseDiff = max(dot(v_Normal, lightVector), 0.0); diffuseDiff = diffuseDiff * (1.0 / (1.0 + ((1.0-u_LightPower[i])* distance * distance))); //Determine attenuatio diffuse += diffuseDiff; gl_FragColor.rgb *= vec3(1.0) / ((vec3(1.0) + ((vec3(1.0) - vec3(u_LightColours[i]))*diffuseDiff))); //The expensive part } diffuse += 0.1; //Add ambient light gl_FragColor.rgb *= diffuse; } Am I making any rookie mistakes? Or am I just being unrealistic about what I can do? Thanks in advance
• By yahiko00
Hi,
Not sure to post at the right place, if not, please forgive me...
For a game project I am working on, I would like to implement a 2D starfield as a background.
I do not want to deal with static tiles, since I plan to slowly animate the starfield. So, I am trying to figure out how to generate a random starfield for the entire map.
I feel that using a uniform distribution for the stars will not do the trick. Instead I would like something similar to the screenshot below, taken from the game Star Wars: Empire At War (all credits to Lucasfilm, Disney, and so on...).

Is there someone who could have an idea of a distribution which could result in such a starfield?
Any insight would be appreciated

• I have just noticed that, in quake 3 and half - life, dynamic models are effected from light map. For example in dark areas, gun that player holds seems darker. How did they achieve this effect ? I can use image based lighting techniques however (Like placing an environment probe and using it for reflections and ambient lighting), this tech wasn't used in games back then, so there must be a simpler method to do this.
Here is a link that shows how modern engines does it. Indirect Lighting Cache It would be nice if you know a paper that explains this technique. Can I apply this to quake 3' s light map generator and bsp format ?

• 16
• 28
• 14
• 11
• 36