Jump to content

  • Log In with Google      Sign In   
  • Create Account

Custom pseudo-instancing faster than real instancing on Nvidia cards?

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
11 replies to this topic

#1 theagentd   Members   -  Reputation: 426


Posted 15 June 2013 - 04:44 PM

EDIT: This seem to be caused by glDrawElementsInstanced() being extremely CPU intensive. See my 4th post further down.



I have this situation where I need to render a large number of instances of a few different very simple meshes (100-300 triangles). Rendering them one and one turned out to be too CPU intensive, so instancing seemed like the perfect solution except for the fact that it requires OGL3. Therefore I came up with a "pseudo"-instancing method where I duplicated and stored my model 128 times in a VBO (instead of just once) and could therefore render up to 128 tiles in a single draw call by uploading instance positions to a uniform vec3[] which was used by a shader to position each instance.

Now I've also implemented an OGL3 version where I upload my per-instance data using super efficient manually synchronized VBO mapping instead of using glUniform3f and render the geometry using real instancing. However, this turned out to be remarkably slower on Nvidia hardware up to the point where my pseudo-instancing was 40% faster than real instancing. On the other hand, on AMD and Intel hardware real instancing is (sometimes much) faster.


Here's my test result data. Test1 = real instancing, Test2 = pseudo-instancing.



AMD HD5500
Test1: 27
Test2: 22
AMD HD6970
Test1: 225
Test2: 57
AMD HD7790
Test1: 195
Test2: 35
Nvidia GTX 295
Test1: 68
Test2: 89
Nvidia GTX 460M (laptop)
Test1: 83
Test2: 87
Intel HD5000*
Test1: 60
Test2: 54
GT 630 (rebranded 500 series GPU)
Test1: 54
Test2: 55
*Tested at much lower rendering resolutions to reduce the fragment bottleneck, so the FPS numbers on this test is much higher than it should be compared to the other cards.
My pseudo-instancing uses 128x more bandwidth (and memory), around 10x more draw calls and much less efficient memory uploading to the GPU than real instancing. I cannot for the love of god fathom why in the world this would be faster on Nvidia cards.

Edited by theagentd, 21 June 2013 - 12:58 PM.


#2 phil_t   Crossbones+   -  Reputation: 2479


Posted 15 June 2013 - 06:15 PM

It wouldn't be 128x more bandwidth though, right? The same number of vertices need to be read and processed regardless of if you're "instancing" or not (unless your model is so tiny that all its vertices fit in the pre-transform cache?).

#3 theagentd   Members   -  Reputation: 426


Posted 15 June 2013 - 06:22 PM

Well, yes, but since there are so few vertices I assume that at least some data might still be available in some kind of cache. Regardless, it shouldn't be faster for a number of other reasons.

#4 mhagain   Crossbones+   -  Reputation: 6292


Posted 16 June 2013 - 05:38 AM

My best guess is that your "super efficient manually synchronized VBO mapping" actually isn't.  Could you go into some more detail about how you accomplish this?  Because synchronization when mapping buffer objects is exactly the kind of thing that could cause something that is supposed to be faster to become mysteriously slower, and exactly the kind of thing that you could see wildly divergent results on different GPUs with (it would actually be the driver causing trouble though).

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

#5 marcClintDion   Members   -  Reputation: 431


Posted 16 June 2013 - 01:48 PM

It looks to me like the real instancing algorithm is a bit off.  If it only works properly on 2 out of seven GPU's then I suspect you will have to make a few adjustments for compatibility, some of those cards should show the same distinct improvement that two of them show.   It looks like the nVidia drivers are being forced to perform a software emulation on the CPU.  It could be something as simple as using an extension that's too new, it can a while for manufacturers to play catch up with one another.

There is an instancing demo available in the PowerVR SDK, available for Windows and Linux, and there is an another example published by Mali as well.   Maybe if you compare what you are doing with what they are doing you might be able to get it working on more cards.




Consider it pure joy, my brothers and sisters, whenever you face trials of many kinds, because you know that the testing of your faith produces perseverance. Let perseverance finish its work so that you may be mature and complete, not lacking anything.

#6 theagentd   Members   -  Reputation: 426


Posted 20 June 2013 - 12:42 PM

I'm sorry for taking so long to respond. Work is killing me...



My manual "synchronization" is actually no synchronization at all. I'm depending on a rolling buffer approach where I allocate and resize VBOs as they are needed and then ensure that the same VBO is not reused until at least 6 frames have passed. 6 frames is a lot of time and should be much longer than the OpenGL driver is prepared to let the GPU fall behind before stalling the CPU, and neither decreasing or increasing this value has an effect on performance (although low values introduce artifacts of course). Regardless, the fact that I am using GL_MAP_UNSYNCHRONIZED_BIT should disable all synchronization and be the fastest way of doing this. I don't really care if this is not 100% correct or safe at this point, I'm just saying that at the moment I'm not doing any 2-way communication with the GPU at all, so I don't see any possible way that the performance problems on Nvidia cards are my fault.



My instancing algorithm is as simple as it can get. I simply upload a buffer (using the above described VBO handling) filled with 3D-positions (16-bit values, padded from 6 to 8 bytes) of where to render each instance, which is read into the shader as a per-instance attribute (glVertexAttribDivisor(instancePositionLocation, 1)). Then everything is drawn using a single call to glDrawElementsInstanced().


Concerning performance, 5 out of 7 perform as I expect. The AMD HD5500 and the Intel HD5000 are both very limited by fragment performance, not vertex performance. I'd also like to argue that the AMD cards are too slow when doing psuedo-instancing, not the other way around. The performance numbers are also adding up when comparing the cards:


GTX 295 vs HD7790: The GTX 295 was only running on one GPU. When both are enabled I get around 90% higher FPS, which is very close to the HD7790. Those two cards have very similar theoretical computing performance.



Instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063953
Psuedo-instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063948
I've also tested the performance of simply using glBufferData() instead of glMapBufferRange(..., GL_MAP_UNSYNCHRONIZED_BIT). Here are the test results from my GTX 295 using both GPUs. This is a new scene so these numbers are not comparable to the ones in my previous post.
Instancing + glMapBufferRange: 112 FPS
Instancing + glBufferData: 141 FPS
Psuedo-instancing: 153 FPS
Similar performance numbers have been confirmed on cards from all series from the 200-series up to the 600 series, including on laptops, high-end and low-end GPUs. So far it seems easy to just let Nvidia cards use the psuedo-instancing version while everything else uses Instancing + glMapBufferRange.
Sorry for the wall of text... >_<

Edited by theagentd, 20 June 2013 - 12:43 PM.

#7 mhagain   Crossbones+   -  Reputation: 6292


Posted 20 June 2013 - 04:07 PM

Can you clarify this please?


I allocate and resize VBOs as they are needed


Are you doing this at runtime?  Allocating and resizing VBOs at runtime can be quite an expensive operation.  If you are doing this at runtime, I'd recommend that you junk it and look at buffer object streaming instead.  With streaming you only need one buffer object, not 6, and the driver will automatically manage the multi-buffering for you (i.e. you don't need to make a guess at when the driver will be no longer using the buffer; the driver will know and give you either a new block of memory or a previously used block based on this knowledge).

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

#8 theagentd   Members   -  Reputation: 426


Posted 21 June 2013 - 09:10 AM

First of all: I've found that the reason for the slowdown does not seem to lie in how I upload the data. See below.


But to answer your question: My VBO handler works like this:

 - handler.nextFrame() is called when a new frame starts. It notifies the handler that a new frame has started and that it should start using the next set of VBOs. When X (= 6) frames have passed it will loop around and start using the first set of buffers again.

 - handler.nextBuffer() retrieves a previously allocated VBO. If it runs out of VBOs (the number of VBOs needed per frame depends on how many different types of meshes that were needed) it will allocate a new one and store it so that it can be reused later.

 - vbo.ensureCapacity() ensures that the retrieved VBO is has enough capacity. If it is to small it will call glBufferData() to resize it to the requested capacity. 

 - vbo.map()/unmap() simply calls the glMapBufferRange() and glUnmapBuffer().


All in all, this means that new buffers are almost never allocated except for during the first 6 frames (the number of mesh types is constant), and that they are almost as rarely resized. Once there are enough buffers and those buffers are big enough it won't have to do anything at all. In my test scene this stabilizes itself after looking around for around 2-3 seconds. The reduction in VRAM efficiency is not a problem since each buffer is less than 50kbs in size.





I did some profiling to identify potential bottlenecks, and discovered something very suspicious. glDrawElementsInstanced() is taking a very large chunk of the time it takes to render a frame!


 - The rendering resolution was set to 192x108 (1920/10 x 1080/10) to ensure I'm not fragment limited.

 - The view distance was increased a lot so much more of the world is visible.

 - The meshes were replaced with simple quads (4 verts, 2 tris).


This resulted in around 101 000 meshes being drawn per frame. Then I modified my pseudo-instancing renderer to instead render using instancing. Data is still uploaded using glUniform() and rendered in batches of 128 as usual, but this completely eliminates any use of (dynamic) VBOs! I did the test on my GTX 295 with only one GPU enabled. Using profiling I could determine the following things:



 - Runs at 112 FPS.

 - Frustum culling takes 80.6% of the time.

 - glUniform() takes 8.1% of the time, glDrawElements() takes 2.4% = 10.5% of the time.

 - The remaining 8.9% are miscellaneous OpenGL calls and some collision detection.

 - GPU-Z reports 53% GPU usage, 9% memory controller load.


These results are pretty much expected.


Pseudo-instancing code with glDrawElementsInstanced() call instead:

 - Runs at 54 FPS.

 - glDrawElementsInstanced() suddenly stands for 49.4% of the CPU time!

 - Frustum culling takes only 37.7% of the time.

 - glUniform() takes around 7.5% of the time.

 - GPU-Z reports 88% GPU usage, 4% memory controller load.



These results pretty much prove that this has got to be a driver bug, and that I mistakenly blamed glMapBufferRange() for the slowness. The inflated GPU load makes no sense. The weirdest part is that the CPU overhead of glDrawElementsInstanced() seem to scale with the number of instances drawn, effectively making it pretty useless. Of course it's faster to do one glDrawElementsInstanced() call instead of doing 101 000 glDrawElements() calls each frame, but batching together those meshes into 790 glDrawElements() calls is still more than 10x faster!

#9 mhagain   Crossbones+   -  Reputation: 6292


Posted 21 June 2013 - 06:54 PM

OK... at this stage can you post some code?  Your descriptions are great but there still seems to be something weird happening that's pushing you off the fast path, and it's difficult to tell from descriptions alone.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

#10 theagentd   Members   -  Reputation: 426


Posted 22 June 2013 - 09:54 AM

I've narrowed down the problem to glDrawElementsInstanced() specifically. glDrawArraysInstanced() does not suffer from the insane CPU usage.


I've made a small program that renders 524 288 small 1-pixel quads over a window.


 - My pseudo-instancing method manages 93 FPS (uses glDrawElements() and draws 512 instances per batch).

 - Rendering 512 instances per batch using glDrawArraysInstanced() gives me 43 FPS, which is decent.

 - Rendering 512 instances per batch using glDrawElementsInstanced() gives me an abysmal 10 FPS.

 - Rendering all instances using a single call to glDrawArraysInstanced() gives me 104 FPS as expected! biggrin.png

 - Rendering all instances using a single call to glDrawElementsInstanced() gives me 11 FPS. sad.png


The test can be found here: http://www.mediafire.com/?2i0vd909uw3salk


 - Run by starting run.bat. You need to have Java installed. If it can't find Java, try to hardcode the path to java.exe in the bat-file.

 - The program pops up an option box where you can choose one of the 5 modes above. The peudo-instancing mode and the pure instancing modes are the interesting ones.

 - FPS is printed to the console every second.


The source can be found here: http://pastie.org/8069712 and requires LWJGL for OpenGL access. Shader source can be found in the shaders/ directory that comes with the test program.



So... How do I report this so Nvidia actually listens? Isn't this a pretty serious bug?

Edited by theagentd, 22 June 2013 - 11:25 AM.

#11 Hodgman   Moderators   -  Reputation: 23949


Posted 22 June 2013 - 10:06 AM

So... How do I report this so Nvidia actually listens?

I'm not quite sure, but making an account at developer.nvidia.com would probably be a good start, and then either on their forums, or via the "contact" form.
The small, self-contained reproduction test is great for them to be able to see exactly what you're on about.

I haven't looked at your code, but simply adding an index buffer shouldn't double your CPU-side frame-times I would think...
I ran your test on my PC (Q6600 CPU, GTX460, driver 12/05/2013) and got (from left to right buttons): 77 43 31 83 42

Edited by Hodgman, 22 June 2013 - 10:22 AM.

#12 theagentd   Members   -  Reputation: 426


Posted 22 June 2013 - 11:06 AM

Thank you very much! That means that the performance quirks exist (at least) on the following cards:

GTX 295

GTX 460m (laptop)

GTX 460

GT 630

GTX 680


The only difference between the indexed renderer and the array renderer is that one uses 6 indices to form 2 GL_TRIANGLES and the other effectively does the same thing internally using GL_QUADS but without an index buffer. This isn't a problem when just rendering quads, but everything I'm rendering in my engine uses indexed triangles.


EDIT: I've posted this on the Nvidia developer forums: https://devtalk.nvidia.com/default/topic/548150/opengl/performance-bug-in-gldrawelementsinstanced/

Edited by theagentd, 22 June 2013 - 11:18 AM.

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.