Followers 0

# OpenGL Efficient instancing in OpenGL

## 6 posts in this topic

The game I'm working on should be able to render dense forests with many trees and detailed foliage. I have been using instancing for drawing pretty much everything, but even so, I have lately hit some performance issues.

My implementation is based on storing instance data in uniforms. I restrict the object transformations so that only translation, uniform scale and rotation along one axis are allowed. For the rotation part, I pass sin(angle) and cos(angle) as uniforms. Thus 6 floats are passed per instance. This way, I can easily draw 256 instances at once by invoking glUniform4fv, glUniform2fv and glDrawElementsInstancedBaseVertex per batch. The particular draw command is used, because I use large VBO:s that store multiple meshes.

Lately I have noticed, that the performance is too low for my purposes. I used gDebugger in an attempt to finding the bottleneck. The FPS count was initially roughly 40. Lowering texture resolution had no effect. Disabling raster commands had negligible effect. Disabling draw commands boosted FPS to over 100. Thus I guess the conclusion is that the excecution is not CPU nor raster operation bound, but has to do with vertex processing.

I'm also using impostors for the trees, and level of detail for the meshes, but I have the feeling that I should be able to draw more instances of the meshed trees than what I'm currently able to. I actually had quite ok FPS of 80 with just the trees in place, but adding the foliage (a lot of instances of small low poly meshes) dropped the FPS to 40. Disabling either the trees or the foliage increases the FPS significantly. Disabling the terrain, which uses a lot of polygons, has no effect, so I think the issue is not being just bound by polygon count.

Could it be that uploading the uniform data is the limiting factor?

For some of the instanced object types, such as the trees, the transformation data is static and is stored in the leaf nodes of a bounding volume hierarchy (BVH) in proper arrays, so that glUniform* can be called without further assembly of data. It would then make sense to actually store these arrays in video memory. What is the best way to do this these days? I think that VBO:s are used in conjuction with glVertexAttribDivisor. To me this does not seem very neat approach, as "vertex attributes" are used for something that are clearly of "uniform" nature. But anyway, I could then make a huge VBO for the entire BVH and store a base instance value and number of instances for each leaf node. To render a leaf node, I would then use glDrawElementsInstancedBaseVertexBaseInstance. This is GL 4.2 core spec. which might be a bit too high. Are there better options? I also have objects (the foliage), for which the transformation data is dynamic (updated occasionally), as they are only placed around the camera. What would be the best way to store/transfer the transformation data in this case?

0

##### Share on other sites

I actually had quite ok FPS of 80 with just the trees in place, but adding the foliage (a lot of instances of small low poly meshes) dropped the FPS to 40.

Don't use FPS to measure performance. Always use millisecond per frame. 80fps is 12.5ms per frame. 40fps is 25ms per frame.
This probably means that your foliate taking 25-12.5 = 12.5ms on the GPU.
Instancing ONLY helps you out on the CPU-side of things - it does nothing to lighten the GPU-side workload. What kind of pixel shader is used on the foliage poly's? How much overdraw is there? Is blending / alpha testing enabled? How many pixels are covered (counting overdrawn pixels multiple times)? Does changing the frame-buffer resolution impact performance? What kind of GPU are you using?

You can use the high performance timer to measure the CPU duration of different operations per frame and EXT_timer_query to measure the GPU duration of different parts per frame. Measure the CPU and GPU costs so you're sure which one is actually the bottleneck.

It's also a good idea to time how long glSwapBuffers takes to execute -- if significant CPU time is spent in that call, it usually indicates that the CPU is waiting on the GPU (or is waiting for a vblank with vsync enabled -- disable vsync when profiling to avoid this  )

Could it be that uploading the uniform data is the limiting factor?

If so, you'd probably notice a significant amount of CPU time spent inside the functions that map/update those buffers.

2

##### Share on other sites

Don't use FPS to measure performance. Always use millisecond per frame. 80fps is 12.5ms per frame. 40fps is 25ms per frame.
This probably means that your foliate taking 25-12.5 = 12.5ms on the GPU.

Good remark. I'm actually aware of this, but old habits die hard. I know that drop from 80 to 40 fps is significant in terms of frame time. If I was experiencing a drop from 200 to 160, I would not be writing this.

Instancing ONLY helps you out on the CPU-side of things - it does nothing to lighten the GPU-side workload. What kind of pixel shader is used on the foliage poly's? How much overdraw is there? Is blending / alpha testing enabled? How many pixels are covered (counting overdrawn pixels multiple times)? Does changing the frame-buffer resolution impact performance?

The shaders for trees and foliage are mostly simple, although the vertices are animated in vertex shader, where a sin function is evaluated. I'm using alpha testing, not blending. Objects are roughly (render batches) ordered front to back, but there is probably still significant overdraw. However, I though that these factors were ruled out by the fact that disabling raster commands in gDebugger had practically no effect. I would expect that this also rules out being limited by frame-buffer bandwidth, although I will try to lower frame-buffer resolution when I get a chance.

You can use the high performance timer to measure the CPU duration of different operations per frame and EXT_timer_query to measure the GPU duration of different parts per frame.

It's also a good idea to time how long glSwapBuffers takes to execute -- if significant CPU time is spent in that call, it usually indicates that the CPU is waiting on the GPU.

Didn't know about EXT_timer_query. This should prove useful in general. In this case, I could the time spent by glUniform* to upload the instance data. I will try to time glSwapBuffers.

0

##### Share on other sites

I tried measuring the time that my "drawFoliage" takes by averaging over 10 frames. This function basically keeps the instance data up to date and calls glUniform*, does shader and texture binds and issues the instantiated draw call. I get 0 ms or 1 ms. I tried this with both SDL_GetTicks and clock_gettime. I think both yield 1 ms accuracy on my machine so the result is not that accurate. The timer library Hodgman linked also seems to promise at least 1 ms accuracy. Maybe I'll give it a try to see if I'm lucky and get more precision.

I also measured the time taken by SDL_GL_SwapBuffers(). It turned out to be between 0 ms and 36 ms! It seems also to be the case that zero and a larger value alternate.

I changed the shaders for the foliage to simpler ones with no normal mapping nor animation, but it had no effect.

Wouldn't the most accurate way to measure CPU time spent in functions be the use of a profiler? Often the large amount of time used in the initialization routines makes the results a bit difficult to read, but the call graph is pretty instructive. Of course per frame timing is not possible this way.

0

##### Share on other sites

It looks a lot like you're seeing the effects of buffering here.  I suggest adding a glFinish call before you get your start time, and another before you get your end time, which will give you the actual time that the driver and your hardware spend doing work; otherwise all that you're really measuring is the time taken to add everything to a command buffer: effectively the equivalent of a handful of memcpy calls.

The glFinish before getting your start time ensures that any pending work is completed before it returns, so that doesn't skew your results.

The glFinish before getting your end time ensures that the work you've just submitted is completed before it returns, otherwise the GL calls you're profiling may not actually issue on the GPU until up to 3 frames later.

1

##### Share on other sites

It looks a lot like you're seeing the effects of buffering here.  I suggest adding a glFinish call before you get your start time, and another before you get your end time, which will give you the actual time that the driver and your hardware spend doing work; otherwise all that you're really measuring is the time taken to add everything to a command buffer: effectively the equivalent of a handful of memcpy calls.

The glFinish before getting your start time ensures that any pending work is completed before it returns, so that doesn't skew your results.

The glFinish before getting your end time ensures that the work you've just submitted is completed before it returns, otherwise the GL calls you're profiling may not actually issue on the GPU until up to 3 frames later.

Thanks for the tip.

I tried this:

glFinish();
long int time = getTimeMilliSec();
SDL_GL_SwapBuffers();
glFinish();
cout << getTimeMilliSec()-time << endl;


Now I get 0 or 1 ms.

0

##### Share on other sites

It once again turned out that something completely different was going on. I first removed the call to the "drawFoliage" function, but surprisingly the framerate was still low. After removing the piece of code that pushes the foliage data into the BVH the framerate increased back to a good level. I then enabled my BVH visualizer code to see what was going on. I made a little illustration of this: http://www.perilouspenguin.com/pics/illustration.jpg .

By BVH works so that it starts off as a quadtree, but each object belongs to a single leaf only. This is achieved by refitting the bounding box of a node each time the objects have been divided into four subnodes. This makes it possible to have ready-to-use render lists in the leaves. Of course it is assumed that no single object spans a significant portion of the map. This has worked well thus far, but for some reason adding the foliage groups (which are small area collections of low poly foliage objects) totally screwed the BVH node distribution. Thus much more objects were considered as visible at any location than before. I managed to mitigate this issue somewhat by increasing the BVH subdivision depth, but I feel that this is not the proper fix.

0

## Create an account

Register a new account

Followers 0

• ### Similar Content

• So it's been a while since I took a break from my whole creating a planet in DX11. Last time around I got stuck on fixing a nice LOD.
A week back or so I got help to find this:
https://github.com/sp4cerat/Planet-LOD
In general this is what I'm trying to recreate in DX11, he that made that planet LOD uses OpenGL but that is a minor issue and something I can solve. But I have a question regarding the code
He gets the position using this row
vec4d pos = b.var.vec4d["position"]; Which is then used further down when he sends the variable "center" into the drawing function:
if (pos.len() < 1) pos.norm(); world::draw(vec3d(pos.x, pos.y, pos.z));
Inside the draw function this happens:
draw_recursive(p3[0], p3[1], p3[2], center); Basically the 3 vertices of the triangle and the center of details that he sent as a parameter earlier: vec3d(pos.x, pos.y, pos.z)
Now onto my real question, he does vec3d edge_center[3] = { (p1 + p2) / 2, (p2 + p3) / 2, (p3 + p1) / 2 }; to get the edge center of each edge, nothing weird there.
But this is used later on with:
vec3d d = center + edge_center[i]; edge_test[i] = d.len() > ratio_size; edge_test is then used to evaluate if there should be a triangle drawn or if it should be split up into 3 new triangles instead. Why is it working for him? shouldn't it be like center - edge_center or something like that? Why adding them togheter? I asume here that the center is the center of details for the LOD. the position of the camera if stood on the ground of the planet and not up int he air like it is now.

Full code can be seen here:
https://github.com/sp4cerat/Planet-LOD/blob/master/src.simple/Main.cpp
If anyone would like to take a look and try to help me understand this code I would love this person. I'm running out of ideas on how to solve this in my own head, most likely twisted it one time to many up in my head
Toastmastern

• I googled around but are unable to find source code or details of implementation.
What keywords should I search for this topic?
Things I would like to know:
A. How to ensure that partially covered pixels are rasterized?
Apparently by expanding each triangle by 1 pixel or so, rasterization problem is almost solved.
But it will result in an unindexable triangle list without tons of overlaps. Will it incur a large performance penalty?
How to ensure proper synchronizations in GLSL?
GLSL seems to only allow int32 atomics on image.
C. Is there some simple ways to estimate coverage on-the-fly?
In case I am to draw 2D shapes onto an exisitng target:
1. A multi-pass whatever-buffer seems overkill.
2. Multisampling could cost a lot memory though all I need is better coverage.
Besides, I have to blit twice, if draw target is not multisampled.

• By mapra99
Hello

I am working on a recent project and I have been learning how to code in C# using OpenGL libraries for some graphics. I have achieved some quite interesting things using TAO Framework writing in Console Applications, creating a GLUT Window. But my problem now is that I need to incorporate the Graphics in a Windows Form so I can relate the objects that I render with some .NET Controls.

To deal with this problem, I have seen in some forums that it's better to use OpenTK instead of TAO Framework, so I can use the glControl that OpenTK libraries offer. However, I haven't found complete articles, tutorials or source codes that help using the glControl or that may insert me into de OpenTK functions. Would somebody please share in this forum some links or files where I can find good documentation about this topic? Or may I use another library different of OpenTK?

Thanks!

• Hello, I have been working on SH Irradiance map rendering, and I have been using a GLSL pixel shader to render SH irradiance to 2D irradiance maps for my static objects. I already have it working with 9 3D textures so far for the first 9 SH functions.
In my GLSL shader, I have to send in 9 SH Coefficient 3D Texures that use RGBA8 as a pixel format. RGB being used for the coefficients for red, green, and blue, and the A for checking if the voxel is in use (for the 3D texture solidification shader to prevent bleeding).
My problem is, I want to knock this number of textures down to something like 4 or 5. Getting even lower would be a godsend. This is because I eventually plan on adding more SH Coefficient 3D Textures for other parts of the game map (such as inside rooms, as opposed to the outside), to circumvent irradiance probe bleeding between rooms separated by walls. I don't want to reach the 32 texture limit too soon. Also, I figure that it would be a LOT faster.
Is there a way I could, say, store 2 sets of SH Coefficients for 2 SH functions inside a texture with RGBA16 pixels? If so, how would I extract them from inside GLSL? Let me know if you have any suggestions ^^.
• By KarimIO
EDIT: I thought this was restricted to Attribute-Created GL contexts, but it isn't, so I rewrote the post.
Hey guys, whenever I call SwapBuffers(hDC), I get a crash, and I get a "Too many posts were made to a semaphore." from Windows as I call SwapBuffers. What could be the cause of this?
Update: No crash occurs if I don't draw, just clear and swap.
static PIXELFORMATDESCRIPTOR pfd = // pfd Tells Windows How We Want Things To Be { sizeof(PIXELFORMATDESCRIPTOR), // Size Of This Pixel Format Descriptor 1, // Version Number PFD_DRAW_TO_WINDOW | // Format Must Support Window PFD_SUPPORT_OPENGL | // Format Must Support OpenGL PFD_DOUBLEBUFFER, // Must Support Double Buffering PFD_TYPE_RGBA, // Request An RGBA Format 32, // Select Our Color Depth 0, 0, 0, 0, 0, 0, // Color Bits Ignored 0, // No Alpha Buffer 0, // Shift Bit Ignored 0, // No Accumulation Buffer 0, 0, 0, 0, // Accumulation Bits Ignored 24, // 24Bit Z-Buffer (Depth Buffer) 0, // No Stencil Buffer 0, // No Auxiliary Buffer PFD_MAIN_PLANE, // Main Drawing Layer 0, // Reserved 0, 0, 0 // Layer Masks Ignored }; if (!(hDC = GetDC(windowHandle))) return false; unsigned int PixelFormat; if (!(PixelFormat = ChoosePixelFormat(hDC, &pfd))) return false; if (!SetPixelFormat(hDC, PixelFormat, &pfd)) return false; hRC = wglCreateContext(hDC); if (!hRC) { std::cout << "wglCreateContext Failed!\n"; return false; } if (wglMakeCurrent(hDC, hRC) == NULL) { std::cout << "Make Context Current Second Failed!\n"; return false; } ... // OGL Buffer Initialization glClear(GL_DEPTH_BUFFER_BIT | GL_COLOR_BUFFER_BIT); glBindVertexArray(vao); glUseProgram(myprogram); glDrawElements(GL_TRIANGLES, indexCount, GL_UNSIGNED_SHORT, (void *)indexStart); SwapBuffers(GetDC(window_handle));

• 19
• 14
• 23
• 11
• 28