• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
theagentd

Custom pseudo-instancing faster than real instancing on Nvidia cards?

11 posts in this topic

EDIT: This seem to be caused by glDrawElementsInstanced() being extremely CPU intensive. See my 4th post further down.

 

Hello.

I have this situation where I need to render a large number of instances of a few different very simple meshes (100-300 triangles). Rendering them one and one turned out to be too CPU intensive, so instancing seemed like the perfect solution except for the fact that it requires OGL3. Therefore I came up with a "pseudo"-instancing method where I duplicated and stored my model 128 times in a VBO (instead of just once) and could therefore render up to 128 tiles in a single draw call by uploading instance positions to a uniform vec3[] which was used by a shader to position each instance.

Now I've also implemented an OGL3 version where I upload my per-instance data using super efficient manually synchronized VBO mapping instead of using glUniform3f and render the geometry using real instancing. However, this turned out to be remarkably slower on Nvidia hardware up to the point where my pseudo-instancing was 40% faster than real instancing. On the other hand, on AMD and Intel hardware real instancing is (sometimes much) faster.

 

Here's my test result data. Test1 = real instancing, Test2 = pseudo-instancing.

 

 

AMD HD5500
Test1: 27
Test2: 22
 
AMD HD6970
Test1: 225
Test2: 57
 
AMD HD7790
Test1: 195
Test2: 35
 
Nvidia GTX 295
Test1: 68
Test2: 89
 
Nvidia GTX 460M (laptop)
Test1: 83
Test2: 87
 
Intel HD5000*
Test1: 60
Test2: 54
 
GT 630 (rebranded 500 series GPU)
Test1: 54
Test2: 55
 
*Tested at much lower rendering resolutions to reduce the fragment bottleneck, so the FPS numbers on this test is much higher than it should be compared to the other cards.
 
 
My pseudo-instancing uses 128x more bandwidth (and memory), around 10x more draw calls and much less efficient memory uploading to the GPU than real instancing. I cannot for the love of god fathom why in the world this would be faster on Nvidia cards.
Edited by theagentd
0

Share this post


Link to post
Share on other sites

It wouldn't be 128x more bandwidth though, right? The same number of vertices need to be read and processed regardless of if you're "instancing" or not (unless your model is so tiny that all its vertices fit in the pre-transform cache?).

1

Share this post


Link to post
Share on other sites

Well, yes, but since there are so few vertices I assume that at least some data might still be available in some kind of cache. Regardless, it shouldn't be faster for a number of other reasons.

0

Share this post


Link to post
Share on other sites

My best guess is that your "super efficient manually synchronized VBO mapping" actually isn't.  Could you go into some more detail about how you accomplish this?  Because synchronization when mapping buffer objects is exactly the kind of thing that could cause something that is supposed to be faster to become mysteriously slower, and exactly the kind of thing that you could see wildly divergent results on different GPUs with (it would actually be the driver causing trouble though).

2

Share this post


Link to post
Share on other sites

It looks to me like the real instancing algorithm is a bit off.  If it only works properly on 2 out of seven GPU's then I suspect you will have to make a few adjustments for compatibility, some of those cards should show the same distinct improvement that two of them show.   It looks like the nVidia drivers are being forced to perform a software emulation on the CPU.  It could be something as simple as using an extension that's too new, it can a while for manufacturers to play catch up with one another.

There is an instancing demo available in the PowerVR SDK, available for Windows and Linux, and there is an another example published by Mali as well.   Maybe if you compare what you are doing with what they are doing you might be able to get it working on more cards.

 

http://www.imgtec.com/powervr/insider/sdkdownloads/index.asp

http://malideveloper.arm.com/develop-for-mali/sdks/opengl-es-sdk-for-linux/

0

Share this post


Link to post
Share on other sites

I'm sorry for taking so long to respond. Work is killing me...

 

@mhagain

My manual "synchronization" is actually no synchronization at all. I'm depending on a rolling buffer approach where I allocate and resize VBOs as they are needed and then ensure that the same VBO is not reused until at least 6 frames have passed. 6 frames is a lot of time and should be much longer than the OpenGL driver is prepared to let the GPU fall behind before stalling the CPU, and neither decreasing or increasing this value has an effect on performance (although low values introduce artifacts of course). Regardless, the fact that I am using GL_MAP_UNSYNCHRONIZED_BIT should disable all synchronization and be the fastest way of doing this. I don't really care if this is not 100% correct or safe at this point, I'm just saying that at the moment I'm not doing any 2-way communication with the GPU at all, so I don't see any possible way that the performance problems on Nvidia cards are my fault.

 

@marcClintDion

My instancing algorithm is as simple as it can get. I simply upload a buffer (using the above described VBO handling) filled with 3D-positions (16-bit values, padded from 6 to 8 bytes) of where to render each instance, which is read into the shader as a per-instance attribute (glVertexAttribDivisor(instancePositionLocation, 1)). Then everything is drawn using a single call to glDrawElementsInstanced().

 

Concerning performance, 5 out of 7 perform as I expect. The AMD HD5500 and the Intel HD5000 are both very limited by fragment performance, not vertex performance. I'd also like to argue that the AMD cards are too slow when doing psuedo-instancing, not the other way around. The performance numbers are also adding up when comparing the cards:

 

GTX 295 vs HD7790: The GTX 295 was only running on one GPU. When both are enabled I get around 90% higher FPS, which is very close to the HD7790. Those two cards have very similar theoretical computing performance.

 

 

Instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063953
Psuedo-instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063948
 
 
I've also tested the performance of simply using glBufferData() instead of glMapBufferRange(..., GL_MAP_UNSYNCHRONIZED_BIT). Here are the test results from my GTX 295 using both GPUs. This is a new scene so these numbers are not comparable to the ones in my previous post.
 
Instancing + glMapBufferRange: 112 FPS
Instancing + glBufferData: 141 FPS
Psuedo-instancing: 153 FPS
 
Similar performance numbers have been confirmed on cards from all series from the 200-series up to the 600 series, including on laptops, high-end and low-end GPUs. So far it seems easy to just let Nvidia cards use the psuedo-instancing version while everything else uses Instancing + glMapBufferRange.
 
Sorry for the wall of text... >_<
Edited by theagentd
0

Share this post


Link to post
Share on other sites

Can you clarify this please?

 

I allocate and resize VBOs as they are needed

 

Are you doing this at runtime?  Allocating and resizing VBOs at runtime can be quite an expensive operation.  If you are doing this at runtime, I'd recommend that you junk it and look at buffer object streaming instead.  With streaming you only need one buffer object, not 6, and the driver will automatically manage the multi-buffering for you (i.e. you don't need to make a guess at when the driver will be no longer using the buffer; the driver will know and give you either a new block of memory or a previously used block based on this knowledge).

1

Share this post


Link to post
Share on other sites

First of all: I've found that the reason for the slowdown does not seem to lie in how I upload the data. See below.

 

But to answer your question: My VBO handler works like this:

 - handler.nextFrame() is called when a new frame starts. It notifies the handler that a new frame has started and that it should start using the next set of VBOs. When X (= 6) frames have passed it will loop around and start using the first set of buffers again.

 - handler.nextBuffer() retrieves a previously allocated VBO. If it runs out of VBOs (the number of VBOs needed per frame depends on how many different types of meshes that were needed) it will allocate a new one and store it so that it can be reused later.

 - vbo.ensureCapacity() ensures that the retrieved VBO is has enough capacity. If it is to small it will call glBufferData() to resize it to the requested capacity. 

 - vbo.map()/unmap() simply calls the glMapBufferRange() and glUnmapBuffer().

 

All in all, this means that new buffers are almost never allocated except for during the first 6 frames (the number of mesh types is constant), and that they are almost as rarely resized. Once there are enough buffers and those buffers are big enough it won't have to do anything at all. In my test scene this stabilizes itself after looking around for around 2-3 seconds. The reduction in VRAM efficiency is not a problem since each buffer is less than 50kbs in size.

 

 

 

 

I did some profiling to identify potential bottlenecks, and discovered something very suspicious. glDrawElementsInstanced() is taking a very large chunk of the time it takes to render a frame!

 

 - The rendering resolution was set to 192x108 (1920/10 x 1080/10) to ensure I'm not fragment limited.

 - The view distance was increased a lot so much more of the world is visible.

 - The meshes were replaced with simple quads (4 verts, 2 tris).

 

This resulted in around 101 000 meshes being drawn per frame. Then I modified my pseudo-instancing renderer to instead render using instancing. Data is still uploaded using glUniform() and rendered in batches of 128 as usual, but this completely eliminates any use of (dynamic) VBOs! I did the test on my GTX 295 with only one GPU enabled. Using profiling I could determine the following things:

 

Pseudo-instancing

 - Runs at 112 FPS.

 - Frustum culling takes 80.6% of the time.

 - glUniform() takes 8.1% of the time, glDrawElements() takes 2.4% = 10.5% of the time.

 - The remaining 8.9% are miscellaneous OpenGL calls and some collision detection.

 - GPU-Z reports 53% GPU usage, 9% memory controller load.

 

These results are pretty much expected.

 

Pseudo-instancing code with glDrawElementsInstanced() call instead:

 - Runs at 54 FPS.

 - glDrawElementsInstanced() suddenly stands for 49.4% of the CPU time!

 - Frustum culling takes only 37.7% of the time.

 - glUniform() takes around 7.5% of the time.

 - GPU-Z reports 88% GPU usage, 4% memory controller load.

 

 

These results pretty much prove that this has got to be a driver bug, and that I mistakenly blamed glMapBufferRange() for the slowness. The inflated GPU load makes no sense. The weirdest part is that the CPU overhead of glDrawElementsInstanced() seem to scale with the number of instances drawn, effectively making it pretty useless. Of course it's faster to do one glDrawElementsInstanced() call instead of doing 101 000 glDrawElements() calls each frame, but batching together those meshes into 790 glDrawElements() calls is still more than 10x faster!

0

Share this post


Link to post
Share on other sites

OK... at this stage can you post some code?  Your descriptions are great but there still seems to be something weird happening that's pushing you off the fast path, and it's difficult to tell from descriptions alone.

0

Share this post


Link to post
Share on other sites

I've narrowed down the problem to glDrawElementsInstanced() specifically. glDrawArraysInstanced() does not suffer from the insane CPU usage.

 

I've made a small program that renders 524 288 small 1-pixel quads over a window.

 

 - My pseudo-instancing method manages 93 FPS (uses glDrawElements() and draws 512 instances per batch).

 - Rendering 512 instances per batch using glDrawArraysInstanced() gives me 43 FPS, which is decent.

 - Rendering 512 instances per batch using glDrawElementsInstanced() gives me an abysmal 10 FPS.

 - Rendering all instances using a single call to glDrawArraysInstanced() gives me 104 FPS as expected! biggrin.png

 - Rendering all instances using a single call to glDrawElementsInstanced() gives me 11 FPS. sad.png

 

The test can be found here: http://www.mediafire.com/?2i0vd909uw3salk

 

 - Run by starting run.bat. You need to have Java installed. If it can't find Java, try to hardcode the path to java.exe in the bat-file.

 - The program pops up an option box where you can choose one of the 5 modes above. The peudo-instancing mode and the pure instancing modes are the interesting ones.

 - FPS is printed to the console every second.

 

The source can be found here: http://pastie.org/8069712 and requires LWJGL for OpenGL access. Shader source can be found in the shaders/ directory that comes with the test program.

 

 

So... How do I report this so Nvidia actually listens? Isn't this a pretty serious bug?

Edited by theagentd
0

Share this post


Link to post
Share on other sites

So... How do I report this so Nvidia actually listens?

I'm not quite sure, but making an account at developer.nvidia.com would probably be a good start, and then either on their forums, or via the "contact" form.
The small, self-contained reproduction test is great for them to be able to see exactly what you're on about.

I haven't looked at your code, but simply adding an index buffer shouldn't double your CPU-side frame-times I would think...
I ran your test on my PC (Q6600 CPU, GTX460, driver 9.18.13.2018 12/05/2013) and got (from left to right buttons): 77 43 31 83 42 Edited by Hodgman
0

Share this post


Link to post
Share on other sites

Thank you very much! That means that the performance quirks exist (at least) on the following cards:

GTX 295

GTX 460m (laptop)

GTX 460

GT 630

GTX 680

 

The only difference between the indexed renderer and the array renderer is that one uses 6 indices to form 2 GL_TRIANGLES and the other effectively does the same thing internally using GL_QUADS but without an index buffer. This isn't a problem when just rendering quads, but everything I'm rendering in my engine uses indexed triangles.

 

EDIT: I've posted this on the Nvidia developer forums: https://devtalk.nvidia.com/default/topic/548150/opengl/performance-bug-in-gldrawelementsinstanced/

Edited by theagentd
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0