Efficient instancing in OpenGL

Started by
5 comments, last by jmakitalo 9 years, 9 months ago

The game I'm working on should be able to render dense forests with many trees and detailed foliage. I have been using instancing for drawing pretty much everything, but even so, I have lately hit some performance issues.

My implementation is based on storing instance data in uniforms. I restrict the object transformations so that only translation, uniform scale and rotation along one axis are allowed. For the rotation part, I pass sin(angle) and cos(angle) as uniforms. Thus 6 floats are passed per instance. This way, I can easily draw 256 instances at once by invoking glUniform4fv, glUniform2fv and glDrawElementsInstancedBaseVertex per batch. The particular draw command is used, because I use large VBO:s that store multiple meshes.

Lately I have noticed, that the performance is too low for my purposes. I used gDebugger in an attempt to finding the bottleneck. The FPS count was initially roughly 40. Lowering texture resolution had no effect. Disabling raster commands had negligible effect. Disabling draw commands boosted FPS to over 100. Thus I guess the conclusion is that the excecution is not CPU nor raster operation bound, but has to do with vertex processing.

I'm also using impostors for the trees, and level of detail for the meshes, but I have the feeling that I should be able to draw more instances of the meshed trees than what I'm currently able to. I actually had quite ok FPS of 80 with just the trees in place, but adding the foliage (a lot of instances of small low poly meshes) dropped the FPS to 40. Disabling either the trees or the foliage increases the FPS significantly. Disabling the terrain, which uses a lot of polygons, has no effect, so I think the issue is not being just bound by polygon count.

Could it be that uploading the uniform data is the limiting factor?

For some of the instanced object types, such as the trees, the transformation data is static and is stored in the leaf nodes of a bounding volume hierarchy (BVH) in proper arrays, so that glUniform* can be called without further assembly of data. It would then make sense to actually store these arrays in video memory. What is the best way to do this these days? I think that VBO:s are used in conjuction with glVertexAttribDivisor. To me this does not seem very neat approach, as "vertex attributes" are used for something that are clearly of "uniform" nature. But anyway, I could then make a huge VBO for the entire BVH and store a base instance value and number of instances for each leaf node. To render a leaf node, I would then use glDrawElementsInstancedBaseVertexBaseInstance. This is GL 4.2 core spec. which might be a bit too high. Are there better options? I also have objects (the foliage), for which the transformation data is dynamic (updated occasionally), as they are only placed around the camera. What would be the best way to store/transfer the transformation data in this case?

Thank you in advance.

Advertisement

I actually had quite ok FPS of 80 with just the trees in place, but adding the foliage (a lot of instances of small low poly meshes) dropped the FPS to 40.

Don't use FPS to measure performance. Always use millisecond per frame. 80fps is 12.5ms per frame. 40fps is 25ms per frame.
This probably means that your foliate taking 25-12.5 = 12.5ms on the GPU.
Instancing ONLY helps you out on the CPU-side of things - it does nothing to lighten the GPU-side workload. What kind of pixel shader is used on the foliage poly's? How much overdraw is there? Is blending / alpha testing enabled? How many pixels are covered (counting overdrawn pixels multiple times)? Does changing the frame-buffer resolution impact performance? What kind of GPU are you using?

You can use the high performance timer to measure the CPU duration of different operations per frame and EXT_timer_query to measure the GPU duration of different parts per frame. Measure the CPU and GPU costs so you're sure which one is actually the bottleneck.

It's also a good idea to time how long glSwapBuffers takes to execute -- if significant CPU time is spent in that call, it usually indicates that the CPU is waiting on the GPU (or is waiting for a vblank with vsync enabled -- disable vsync when profiling to avoid this cool.png )

Could it be that uploading the uniform data is the limiting factor?

If so, you'd probably notice a significant amount of CPU time spent inside the functions that map/update those buffers.


Don't use FPS to measure performance. Always use millisecond per frame. 80fps is 12.5ms per frame. 40fps is 25ms per frame.
This probably means that your foliate taking 25-12.5 = 12.5ms on the GPU.

Good remark. I'm actually aware of this, but old habits die hard. I know that drop from 80 to 40 fps is significant in terms of frame time. If I was experiencing a drop from 200 to 160, I would not be writing this.


Instancing ONLY helps you out on the CPU-side of things - it does nothing to lighten the GPU-side workload. What kind of pixel shader is used on the foliage poly's? How much overdraw is there? Is blending / alpha testing enabled? How many pixels are covered (counting overdrawn pixels multiple times)? Does changing the frame-buffer resolution impact performance?

The shaders for trees and foliage are mostly simple, although the vertices are animated in vertex shader, where a sin function is evaluated. I'm using alpha testing, not blending. Objects are roughly (render batches) ordered front to back, but there is probably still significant overdraw. However, I though that these factors were ruled out by the fact that disabling raster commands in gDebugger had practically no effect. I would expect that this also rules out being limited by frame-buffer bandwidth, although I will try to lower frame-buffer resolution when I get a chance.


You can use the high performance timer to measure the CPU duration of different operations per frame and EXT_timer_query to measure the GPU duration of different parts per frame.

It's also a good idea to time how long glSwapBuffers takes to execute -- if significant CPU time is spent in that call, it usually indicates that the CPU is waiting on the GPU.

Didn't know about EXT_timer_query. This should prove useful in general. In this case, I could the time spent by glUniform* to upload the instance data. I will try to time glSwapBuffers.

I tried measuring the time that my "drawFoliage" takes by averaging over 10 frames. This function basically keeps the instance data up to date and calls glUniform*, does shader and texture binds and issues the instantiated draw call. I get 0 ms or 1 ms. I tried this with both SDL_GetTicks and clock_gettime. I think both yield 1 ms accuracy on my machine so the result is not that accurate. The timer library Hodgman linked also seems to promise at least 1 ms accuracy. Maybe I'll give it a try to see if I'm lucky and get more precision.

I also measured the time taken by SDL_GL_SwapBuffers(). It turned out to be between 0 ms and 36 ms! It seems also to be the case that zero and a larger value alternate.

I changed the shaders for the foliage to simpler ones with no normal mapping nor animation, but it had no effect.

Wouldn't the most accurate way to measure CPU time spent in functions be the use of a profiler? Often the large amount of time used in the initialization routines makes the results a bit difficult to read, but the call graph is pretty instructive. Of course per frame timing is not possible this way.

It looks a lot like you're seeing the effects of buffering here. I suggest adding a glFinish call before you get your start time, and another before you get your end time, which will give you the actual time that the driver and your hardware spend doing work; otherwise all that you're really measuring is the time taken to add everything to a command buffer: effectively the equivalent of a handful of memcpy calls.

The glFinish before getting your start time ensures that any pending work is completed before it returns, so that doesn't skew your results.

The glFinish before getting your end time ensures that the work you've just submitted is completed before it returns, otherwise the GL calls you're profiling may not actually issue on the GPU until up to 3 frames later.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

It looks a lot like you're seeing the effects of buffering here. I suggest adding a glFinish call before you get your start time, and another before you get your end time, which will give you the actual time that the driver and your hardware spend doing work; otherwise all that you're really measuring is the time taken to add everything to a command buffer: effectively the equivalent of a handful of memcpy calls.

The glFinish before getting your start time ensures that any pending work is completed before it returns, so that doesn't skew your results.

The glFinish before getting your end time ensures that the work you've just submitted is completed before it returns, otherwise the GL calls you're profiling may not actually issue on the GPU until up to 3 frames later.

Thanks for the tip.

I tried this:


glFinish();
long int time = getTimeMilliSec();
SDL_GL_SwapBuffers();
glFinish();
cout << getTimeMilliSec()-time << endl;

Now I get 0 or 1 ms.

It once again turned out that something completely different was going on. I first removed the call to the "drawFoliage" function, but surprisingly the framerate was still low. After removing the piece of code that pushes the foliage data into the BVH the framerate increased back to a good level. I then enabled my BVH visualizer code to see what was going on. I made a little illustration of this: http://www.perilouspenguin.com/pics/illustration.jpg .

By BVH works so that it starts off as a quadtree, but each object belongs to a single leaf only. This is achieved by refitting the bounding box of a node each time the objects have been divided into four subnodes. This makes it possible to have ready-to-use render lists in the leaves. Of course it is assumed that no single object spans a significant portion of the map. This has worked well thus far, but for some reason adding the foliage groups (which are small area collections of low poly foliage objects) totally screwed the BVH node distribution. Thus much more objects were considered as visible at any location than before. I managed to mitigate this issue somewhat by increasing the BVH subdivision depth, but I feel that this is not the proper fix.

This topic is closed to new replies.

Advertisement